SE250:lab-5:llay008

From Marks Wiki
Jump to navigation Jump to search

HOW DO HASH FUNCTIONS PERFORM IN THEORY AND PRACTICE?

Task One: How Do the Functions Compare in Theoretical Randomness

This task took quite a long time to do, due to the large amount of repetitive compiling of the code. It took a while at the start to understand exactly what to do. The first thing that I did was read through the code and try to grasp was what it was. I noticed that the code was divided into the hash fuctions and the analysis as well as the high and low entropy. I figured out which of the fuctions I had to substitute into the main code. From then it was simply modifying this to get the required output.

I chose a sample size of 1000, a key number of 1000 and a table number of 100000. I chose these for two reasons: first, when I ran the code these values gave a good entropy value. Altering them did not change the values considerably. The second reason that I chose these values is that i had a misconception of what they are. It wasn't entirely clear to me what they were so I asked for help and got told the wrong information, namely that they sample size was the number of inputs into the hash table, not that it was the number of times the statistcal tests were run. When I realised this mistake I didn't change the values because that would invalidate the results but I ran some tests on how the results differed when I changed them.


I kept most of the output on "full", because, although it takes up more space, it was easier to interpret (I didn't have to try and remember which values corresponded to what).

Output

buzhash

low

Testing Buzhash low on 1000 samples
Entropy = 7.843786 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 1 percent.

Chi square distribution for 1000 samples is 214.46, and randomly
would exceed this value 95.00 percent of the times.

Arithmetic mean value of data bytes is 128.0860 (127.5 = random).
Monte Carlo value for Pi is 3.132530120 (error 0.29 percent).
Serial correlation coefficient is -0.017268 (totally uncorrelated = 0.0).

Buzhash low 1000/100000: llps = 2, expecting 2.00948


typical

Testing Buzhash typical on 1000 samples
Entropy = 7.797775 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 2 percent.
 
Chi square distribution for 1000 samples is 250.82, and randomly
would exceed this value 50.00 percent of the times.

Arithmetic mean value of data bytes is 126.5740 (127.5 = random).
Monte Carlo value for Pi is 3.277108434 (error 4.31 percent).
Serial correlation coefficient is -0.007005 (totally uncorrelated = 0.0).

Buzhash typical 1000/100000: llps = 2, expecting 2.00948

bazhashn

low

Testing Buzhashn low on 1000 samples
Entropy = 7.823873 bits per byte. 

Optimum compression would reduce the size
of this 1000 byte file by 2 percent.

Chi square distribution for 1000 samples is 220.61, and randomly
would exceed this value 90.00 percent of the times.

Arithmetic mean value of data bytes is 127.3730 (127.5 = random).
Monte Carlo value for Pi is 3.108433735 (error 1.06 percent).
Serial correlation coefficient is -0.007118 (totally uncorrelated = 0.0).

Buzhashn low 1000/100000: llps = 2, expecting 2.00948

typical

Java String Hash typical 	7.82387	90.00%	127.373	1.06%	-0.007118
Java String Hash typical 1000/10000: llps = 999, expecting 2.82556

hash_CRC

low

Testing hash_CRC low on 1000 samples
Entropy = 3.965965 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 50 percent.

Chi square distribution for 1000 samples is 36163.52, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 93.6860 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.380754 (totally uncorrelated = 0.0).

hash_CRC low 1000/100000: llps = 1, expecting 2.00948

typical

Testing hash_CRC typical on 1000 samples
Entropy = 7.202459 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 9 percent.

Chi square distribution for 1000 samples is 1660.86, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 114.9320 (127.5 = random).
Monte Carlo value for Pi is 3.204819277 (error 2.01 percent).
Serial correlation coefficient is -0.032076 (totally uncorrelated = 0.0).

hash_CRC typical 1000/100000: llps = 2, expecting 2.00948

base256

low

Testing base256 low on 1000 samples
Entropy = 0.000000 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 100 percent.

Chi square distribution for 1000 samples is 255000.00, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 97.0000 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is undefined (all values equal!).

base256 low 1000/100000: llps = 1000, expecting 2.00948

typical

Testing base256 typical on 1000 samples
Entropy = 3.919224 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 51 percent.

Chi square distribution for 1000 samples is 19854.27, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 106.4100 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is 0.217294 (totally uncorrelated = 0.0).

base256 typical 1000/100000: llps = 46, expecting 2.00948

Java Integer Hash

low

lab5.c:332: warning: passing arg 4 of `llps' from incompatible pointer type
Testing Java Integer Hash low on 1000 samples
Entropy = 2.791730 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 65 percent.

Chi square distribution for 1000 samples is 143448.00, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 31.1250 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.230200 (totally uncorrelated = 0.0).

Java Integer Hash low 1000/100000: llps = 1, expecting 2.00948

typical

lab5.c:332: warning: passing arg 4 of `llps' from incompatible pointer type
Testing Java Integer Hash typical  on 1000 samples
Entropy = 2.791730 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 65 percent.

Chi square distribution for 1000 samples is 143448.00, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 31.1250 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.230200 (totally uncorrelated = 0.0).

Java Integer Hash typical 1000/100000: llps = 91, expecting 2.00948

Java Object Hash

low

Testing Java Object Hash low  on 1000 samples
Entropy = 2.000000 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 75 percent.

Chi square distribution for 1000 samples is 63000.00, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 77.0000 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.521556 (totally uncorrelated = 0.0).

Java Object Hash low 1000/100000: llps = 1000, expecting 2.00948

typical

Testing Java Object Hash typical  on 1000 samples
Entropy = 4.232015 bits per byte.

Optimum compression would reduce the size
of this 1000 byte file by 47 percent.

Chi square distribution for 1000 samples is 33033.66, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 88.4960 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.731743 (totally uncorrelated = 0.0).

Java Object Hash typical 1000/100000: llps = 1, expecting 2.00948

Java String hash

low

Java Stirng Hash low 	7.91640	99.99%	129.471	0.48%	0.009052
Java String Hash low 1000/100000: llps = 1, expecting 2.00948

typical

Java String Hash typical 	7.37782	0.01%	117.390	8.92%	-0.013887
Java String Hash typical 1000/100000: llps = 2, expecting 2.00948

Around this point I realised that I had some misconceptions about the variables and tried it again using a table size of 1000. The results are below.

Java String Hash typical 	7.37782	0.01%	117.390	8.92%	-0.013887
Java String Hash typical 1000/1000: llps = 7, expecting 5.51384

Increasing the sample size gave these results

Java String Hash typical 	7.89301	0.01%	126.009	1.42%	-0.021360
Java String Hash typical 1000/1000: llps = 7, expecting 5.51384

For my benefit (to see exactly what the numbers correlate to)

Testing Java String Hash typical  on 10000 samples
Entropy = 7.893010 bits per byte.

Optimum compression would reduce the size
of this 10000 byte file by 1 percent.

Chi square distribution for 10000 samples is 2047.92, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 126.0088 (127.5 = random).
Monte Carlo value for Pi is 3.186074430 (error 1.42 percent).
Serial correlation coefficient is -0.021360 (totally uncorrelated = 0.0).

Java String Hash typical 1000/1000: llps = 7, expecting 5.51384

rand

low

Java rand low 	7.71844	0.01%	110.541	8.92%	-0.048389
Java rand low 1000/10000: llps = 2, expecting 2.82556

typical

Java rand typical 	7.74840	0.05%	112.891	7.38%	-0.081749
Java rand typical 1000/10000: llps = 3, expecting 2.82556

high rand

This gave me errors =

lab5.c:332: error: `high_rand' undeclared (first use in this function)

Results

TEST                             ENTROPY    S.C.C        Chi squared value

Java String hash     low         7.91640     0.009052    99.99
buzhashn             low         7.823873   -0.007118    90.00 
buzhash              low         7.843786   -0.017268    95.00
rand                 low         7.71844    -0.048389     0.01
hash_CRC             low         3.965965   -0.380754     0.01 
Java Object hash     low         4.232015   -0.521556     0.01 
Java Integer hash    low         2.791730   -0.230200     0.01 
base256              low         0.000000    undefined    0.01 

buzhash              typical     7.797775   -0.032076    50.00
rand                 typical     7.74840    -0.081749     0.05
buzhashn             typical     7.202459   -0.007118    90.00
Java String hash     typical     7.37782    -0.013887     0.01
base256              typical     3.919224    0.217294     0.01 
hash_CRC             typical     3.919224   -0.032076     0.01 
Java Object hash     typical     4.232015   -0.731743     0.01 
Java Integer hash    typical     2.791730   -0.230200     0.01 


Looking at these results there is a clear difference between the top four and the bottom four, especially when looking at the entropy of each. base256 is actually 0, which means every value is the same(?). The difference is also seen when looking at the serial correlation coefficient. In each of the top four the first significant figure occurs at the thousandth significant figure, while it it larger for the bottom four, in most cases by a power of ten. With the sole exception of buzhash typical, all of the hash functions performed badly on the chi squared test, scoring very high (too good to be true) or very low (not very random). In general the low entropy test gave better results than the typical.