SE250:lab-1:jham005

John's lab report

I played around with the number of loop iterations until the program took about 1 second to run.

 double f = 0;
 clock_t t0 = clock( );

 for( n = 0; n < 100000000; n++ )
   f = f + 1;
 printf( "%ld\n", clock( ) - t0 );

By changing the type of f, I was able to gather data for the different C data types, and by commenting out the line f = f + 1 I obtained a time for the empty loop. To check the variance in the times, I added another loop around the code to repeat 10 times.

Here are the results from my laptop:

double float long int short noop 
484    469   391  390 360   375    
469    469   391  391 359   391    
453    453   390  453 359   390    
469    453   391  438 360   407    
469    469   391  422 343   390    
437    453   406  406 360   391    
469    453   390  437 359   391    
469    469   391  438 360   390    
453    453   391  406 343   375    
468    469   390  422 375   391    
(CLOCKS_PER_SEC is 1,000)

Subtracting the mean time for the noop loop gave the following mean times:

double  float long   int  short
  74.9   71.9  3.1  3.12  -3.13

Long and int both have similar runtime, as to double and float.

The result for short is most curious: it appears that a loop in which you increment a short takes less time than an empty loop.

It is worth taking a closer look at the generated assembly code, just in case the C compiler is doing something odd. I ran gcc -S add-times.c on the code for short, and again on the code with f = f + 1 commented out. The block on the left is the loop with the short increment, and on the right is the noop loop:

L5:
 cmpl   $99999999, -16(%ebp)    cmpl    $99999999, -16(%ebp)
 jg     L6                      jg      L6
 movzwl -10(%ebp), %eax
 incl   %eax
 movw   %ax, -10(%ebp)
 leal   -16(%ebp), %eax         leal    -16(%ebp), %eax
 incl   (%eax)                  incl    (%eax)
 jmp    L5                      jmp     L5
L6:

Nothing strange here -- the instructions are identical apart from the three lines for the increment. Those three additional lines of assembler actually reduce the runtime of the loop.

Repeating the experiment on the Linux server gives:

short    int      long      float    double   noop   
1100000  1620000  1620000   1170000  1120000  3000000
1090000  1630000  1630000   1160000  1130000  3010000
1070000  1630000  1630000   1130000  1140000  3010000
1100000  1620000  1630000   1140000  1130000  3000000
1090000  1640000  1620000   1180000  1120000  3010000
1100000  1610000  1630000   1160000  1140000  3020000
1100000  1630000  1630000   1200000  1130000  3000000
1070000  1630000  1630000   1080000  1140000  3010000
1100000  1630000  1630000   1170000  1130000  3010000
1100000  1620000  1630000   1190000  1140000  3010000
(CLOCKS_PER_SEC is 1,000,000)

The results show similar relative times, but even stranger time for an empty loop: doing nothing takes nearly three times as long as a loop that performs a simple operation. Who would have guessed?

It's also notable that the Linux server is considerably slower than my laptop.

Why is the noop loop so slow? Modern processors are able to execute several instructions in parallel, but this gets tricky when there is a conditional branch. Rather than delay, waiting for the calculations for the branch condition to be available, the processor may make a guess and start executing code after the branch "just in case" it turns out to be the right guess. If the guess is wrong, the work needs to be undone. In the code above, the jg (Jump Greater) instruction is only taken when the loop ends, and perhaps the processor is always guessing it is taken. In the empty loop case, it would get further ahead and end up with more work to undo when it discovers its mistake.

A completely different set of results can be obtained by taking the difference between a loop with two add operations and a loop with just one. i.e. replace the inner loop with this:

   for( n = 0; n < 100000000; n++ ) {
     f = f + 1;
     g = g + 1;
   }

Obtaining times with and without the second increment, g = g + 1 gives the following:

short2 short1 long2 long1 float2 float1 double2 double1
860    359    438   406   515    469    454     469
843    360    422   391   500    468    468	 453
844    359    421   375   500    453    469	 468
859    359    438   422   500    454    469	 454
844    344    422   391   500    468    469	 468
860    375    453   406   485    453    453	 453
843    359    422   390   484    469    453	 454
844    360    422   391   500    453    453	 468
844    359    437   406   500    453    484	 453
859    344    422   391   500    454    453	 454

Taking the mean differences gives very, very different results for short and double:

mean( short2 - short1 )   = 492.2
mean( long2 - long1 )     =  32.8
mean( float2 - float1 )   =  39
mean( double2 - double1 ) =   3.1

It now appears that shorts cost the earth, and two double adds are as cheap as one.

The times on the PowerPC64 Linux server are also damning for shorts and floats, but longs prove to be much cheaper.

short2  short1  long2   long1   float2  float1  double2 double1
1800000 1080000 1690000 1620000 1890000 1240000 1380000 1120000
1810000 1100000 1700000 1630000 1890000 1230000 1390000 1140000
1810000 1100000 1700000 1620000 1890000 1250000 1390000 1130000
1810000 1100000 1700000 1630000 1890000 1240000 1380000 1130000
1810000 1090000 1700000 1630000 1910000 1240000 1390000 1140000
1800000 1100000 1710000 1630000 1890000 1250000 1390000 1140000
1740000 1100000 1710000 1630000 1900000 1240000 1390000 1150000
1730000 1100000 1700000 1630000 1900000 1240000 1390000 1130000
1750000 1100000 1700000 1630000 1900000 1250000 1390000 1130000
1760000 1100000 1700000 1640000 1900000 1250000 1390000 1120000
 
mean( short2 - short1 )   = 685000
mean( long2 - long1 )     =  72000
mean( float2 - float1 )   = 653000
mean( double2 - double1 ) = 255000

A possible cause here is memory alignment: the PowerPC64 is optimised for fetching a 64-bit quantity (the size of a long and a double), and performs poorly when writing a smaller quantity because it has to first read a 64-bit block, update it, then write it back. The 32-bit Intel processor on my Dell appears to suffer the same problem when working with a 16-bit short.

As for why a double is faster than a float, most FPUs operate on full-precision floating point numbers. Working on a 32-bit float means converting it to and from 64-bit format. I guess the Intel FPU can execute several instructions in parallel, hence the "free" second increment.

SE250:lab-1:jham005

John's lab report

Navigation menu

Search