SE250:lab-5:mgha023

From Marks Wiki
Jump to navigation Jump to search

I first tested the buzhash function. I chose a sample size of 1000, 10000 and then 100000.


  • 1)

C:\Users\mgha023>H:

H:\>cd ec250

H:\ec250>cd lab5

H:\ec250\lab5>gcc *.c -o lab-5 && .\lab-5.exe
Testing Buzhash low on 1000 samples
Entropy = 7.843786 bits per byte.
Optimum compression would reduce the size of this 1000 byte file by 1 percent.
Chi square distribution for 1000 samples is 214.46, and randomly would exceed this value 95.00 percent of the times.
Arithmetic mean value of data bytes is 128.0860 (127.5 = random).
Monte Carlo value for Pi is 3.132530120 (error 0.29 percent).
Serial correlation coefficient is -0.017268 (totally uncorrelated = 0.0).
Buzhash low 10/1000000: llps = 1, expecting 1
  • 2)


Testing Buzhash low on 10000 samples
Entropy = 7.985498 bits per byte.
Optimum compression would reduce the size of this 10000 byte file by 0 percent.
Chi square distribution for 10000 samples is 201.50, and randomly would exceed this value 99.00 percent of the times.
Arithmetic mean value of data bytes is 125.8253 (127.5 = random).
Monte Carlo value for Pi is 3.181272509 (error 1.26 percent).
Serial correlation coefficient is -0.000047 (totally uncorrelated = 0.0).
Buzhash low 10/1000000: llps = 1, expecting 1
  • 3)
Testing Buzhash low on 100000 samples
Entropy = 7.998297 bits per byte.
Optimum compression would reduce the size of this 100000 byte file by 0 percent.
Chi square distribution for 100000 samples is 235.54, and randomly would exceed this value 75.00 percent of the times.
Arithmetic mean value of data bytes is 127.5775 (127.5 = random).
Monte Carlo value for Pi is 3.119404776 (error 0.71 percent).
Serial correlation coefficient is 0.000327 (totally uncorrelated = 0.0).
Buzhash low 10/1000000: llps = 1, expecting 1
Entropy: the closer the value is to 8 bits per byte, the more random the numbers returned by the function must be. 

How i interpreted the above results was by using the information I got from the 2007 Hypertext Textbook:

  • Percentage by which the result could be compressed: The higher this percentage, the more redundancy the results must contain (ie repeated values). This means that a high value equates to low randomness.
  • Chi Square: The number of times the values would exceed the random variable obtained gives an indication of how random the numbers must be. Values close to 50% have the highest randomness.
  • Arithmetic mean: The expected mean for a set of random numbers is always the centre between the highest value and the lowest value. For this set of tests, 127.5 is the expected mean.
  • Monte Carlo: The numbers are plotted on a 2D grid containing a circle. The percentage of points inside the circle and the percentage of points outside the circle should be the same. The error indicates by how much the two differ. Hence, the closer the error is to 0, the more random the numbers generated must be.
  • Serial Correlation: Values close to 0 show that the relationship between the numbers is fairly non-existant (and hence must be random).

It can be seen that from increasing the sample size from 1000 to 10000, there is a considerable increase in randomness. However there is not much of an improvement in randomness when the sample size is changed fomr 10000 to 100000. Although the sample size of 100000 is defiantely more random, it seems to standardise at this sample size.Thus, I will use 100000 as my sample size for the rest for the rest of the exercises.

Next, I used ent_test for the rest of the functions. I commented out //#define VERBOSE_OUTPUT so that the results were displayed without all the text adjoining it.

Table comparing results for low entropy sourse:

Function Entropy Chi square Arithmetic mean Monte Carlo Serial Correlation
Buzhash low 7.99830 75.00% 127.578 0.71% 0.000327
Buzhashn low 7.99824 50.00% 127.494 0.13% (-)0.003092
hashCRC low 5.59831 0.01% 81.783 24.53% 0.028545
Base256low 0.00000 0.01% 97.000 27.32% undefined
Java Integer low 4.82824 0.01% 43.883 27.32% (-)0.092002
Java Object low 2.00000 0.01% 77.000 27.32% (-)0.521556
Java String low 7.99959 99.99% 127.438 0.22% (-)0.000646
Rand low 7.95308 0.01% 111.441 11.17% (-)0.051837
High rand low 7.99821 50.00% 127.402 0.59% (-)0.001222

Buzhash low 10/1000000: llps = 1, expecting 1


Table comparing results for typical entropy source:


Buzhash         7.99783 2.50%   127.374 0.77%   -0.000076
Buzhashn        7.99824 50.00%  127.494 0.13%   -0.003092
hashCRC         7.84046 0.01%   122.862 1.98%   0.018518
Base265         4.02297 0.01%   107.853 27.32%  0.034082
Java Integer    4.82824 0.01%   43.883  27.32%  -0.092002
Java Object     5.68171 0.01%   71.559  18.90%  0.008626
Java String     7.94554 0.01%   126.139 0.27%   0.021181
Rand            7.95308 0.01%   111.441 11.17%  -0.051837
High rand       7.99821 50.00%  127.402 0.59%   -0.001222

Buzhash low 10/1000000: llps = 1, expecting 1


Studying the tables above, i was able to rank the functions in order of theorectical randomness:

  • 1-Buzhashn low
  • 2-High Rand low
  • 3-Buzhash low
  • 4-Java String low
  • 5-Rand low
  • 6-HashCRC low
  • 7-Java Integer low
  • 8-Base256 low

What I intended to do next was to replace 'buzhash' in the following code with 'buzhashn','base256'...and the rest of the functions.

printf( "Buzhash low %d/%d: llps = %d, expecting %g\n",
          n_keys, table_size,
          llps( n_keys, low_entropy_src, table_size, buzhash ),
 	   expected_llps( n_keys, table_size ) );

For buzhashn and Java Integer, I was to use llps_i. However, I ran out of time in the lab and so decided to continue at home. At home, I was unable to run the code as Command Prompt said it could not recognise 'gcc' as a command. I was puzzled as this was exactly what I had used in the Lab. I tried running it using Eclipse but it did not work. I also tried using Wedit but it still does not work. Obviously all this lead to a little bit of frustration. I went online to see how to make gcc work on my cmdprmpt. One forum asked me to change soem settign in 'System'. I didn't want to risk it.

I found this lab much (much) simpler than the previous ones. I was actually actively occupied for the two hours without constantly wondering what I was supposed to be doing. I wanted to finish this lab as I understand how to proceed but am unable to do so as cmd prmpt, eclipse and wedit are acting cranky. And I thought computers were meant to help you study (sheesh!).