SE250:lab-5:npit006

From Marks Wiki
Jump to navigation Jump to search

Task 1 - Using ent_test to compare randomness of hash functions

Results

Sample Test Results With Verbose Output

Randomness Tests Using Low Entropy Sources:
Testing Buzhash low on 100000 samples
Entropy = 7.998297 bits per byte.
 
Optimum compression would reduce the size
of this 100000 byte file by 0 percent.

Chi square distribution for 100000 samples is 235.54, and randomly
would exceed this value 75.00 percent of the times.

Arithmetic mean value of data bytes is 127.5775 (127.5 = random).
Monte Carlo value for Pi is 3.119404776 (error 0.71 percent).
Serial correlation coefficient is 0.000327 (totally uncorrelated = 0.0).


Test Results for All Hash Functions Without Verbose Output

Randomness Tests Using Low Entropy Sources:
Buzhash         7.99830 75.00%  127.578 0.71%   0.000327
Buzhashn        7.99824 50.00%  127.494 0.13%   -0.003092
Hash CRC        5.59831 0.01%   81.783  24.53%  0.028545
Base 256        0.00000 0.01%   97.000  27.32%  undefined
Java Int        4.82824 0.01%   43.883  27.32%  -0.092002
Java Object     2.00000 0.01%   77.000  27.32%  -0.521556
Java String     7.99959 99.99%  127.438 0.22%   -0.000646
Rand            4.87776 0.01%   47.916  27.32%  -0.056777
High Rand       0.00000 0.01%   0.000   27.32%  undefined


Randomness Tests Using Typical Entropy Sources:
Buzhash         7.99783 2.50%   127.374 0.77%   -0.000076
Buzhashn        7.99824 50.00%  127.494 0.13%   -0.003092
Hash CRC        7.84046 0.01%   122.862 1.98%   0.018518
Base 256        4.02297 0.01%   107.853 27.32%  0.034082
Java Int        4.82824 0.01%   43.883  27.32%  -0.092002
Java Object     5.76274 0.01%   81.526  24.78%  -0.063301
Java String     7.94554 0.01%   126.139 0.27%   0.021181
Rand            4.87892 0.01%   47.910  27.32%  -0.057853
High Rand       0.00000 0.01%   0.000   27.32%  undefined

Column 1 = Hash Function
Column 2 = Entropy (measured in bits per byte)
Column 3 = Chi squared %
Column 4 = Arithmetic Mean Value
Column 5 = Percentage error for Monte Carlo value of pi
Column 6 = Serial Correlation Coefficient

Randomness Definitions

These measurements of randomness have been thrown at us without any real prior explanation, so I've included the definitions from the lab maintainer's report and added my own interpretations of their nature/reliability after each one.


Entropy: Is the measure of how disordered a system is. This is a Thermodynamics idea. Basically this would measure if our hash table is random or not. The value returned by this test would prove how random our distribution is and for this particular test, it would be ideal if the entropy returned by this function is as close to 8 as possible.

Based on my results, the entropy measurement looks quite reliable.

Chi Square: Chi square is a statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. What this means, is that, chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match.

I'm not sure what a good value is for a chi squared %, but I'm guessing 50% is good.

Arithmetic mean: The expected mean for a set of random numbers is always the centre between the highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean.

I understand that the expected mean for these tests is 127.5 and so better hash functions should deliver an actual mean which is close to this function.

Monte Carlo: This method takes random numbers, plots them on a cartesian coordinate plane, and then calculates the ratio of the points that are on the plane, with the number of points that lie in a unit circle centered at 0. This ratio is apparently a quarter times pi, and this method is then used to calculate the value of pi. So a more accurate value of pi given, the more random the numbers are.

The Monte Carlo method uses a distribution to approximate pi - if the distribution is uniform, it should be close to pi... and the results for this test are given as an error percentage. I wouldn't say these values are very reliable because they aren't entirely consistent with the entropy measurements. I suppose that the limitation of the method is that the data inside the circle can be distributed in so many different ways... but as long as it is inside the circle, the intra-circle distribution doesn't matter.

Serial Correlation: Is how much are the values in the hash table related to each other, basically, tries to determine if there is a pattern to the ordering of the values in the hash table or not.

I really don't know what to make of the serial correlation values.

Percentage by which the result could be compressed: The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness.

I really don't know why this randomness measurement was excluded when verbose output was turned off because it seems quite reasonable.

Results Interpretations

I chose a sample size of 100,000 because it would be big enough to ensure reliable results, but not so big that the program would take a noticeable amount of time to run.

It seems that my program didn't like some of the functions such as Base 256 and High Rand, which give undefined or nonsense results. Anyway:


For low entropy data, the Buzhash and Buzhashn functions consistently performed well with a high entropy and actual mean close to the expected mean. The Java Int and Hash CRC functions performed averagely in comparison, and in the vicinity of the Rand random number generator. The Java String function performed impeccably well... I'm not sure if I'm right in assuming the chi squared % value of 99.99% means that it performed unnaturally better than expected by far... and that perhaps this unusual performance is due to the nature of the data we're feeding it.


For typical entropy data, the Buzhash and Buzhashn functions performed more or less the same as with low entropy data. Presumably they are so good that the entropy of the data inputted isn't of such a big consequence. This time, Hash 256 decided to work properly and did so in mediocrity. Hash CRC and Java Object improved greatly with typical entropy data. Java Int remained relatively constant. Java String didn't perform as impeccably as before, but was still excellent. But this time the Chi Squared % value was on the other end of the scale, and I assume again that this means that Java String wouldn't regularly perform this well.


Based on my results, I would rank the hash functions like this: Java String > Buzhashn > Buzhashn > Hash CRC > Java Object > Java Int > Base 256


For general purposes where we have data sources which are more likely to be of typical entropy and aren't just strings, I definitely wouldn't rate Java String so highly. With numbers I assume Java Int would perform better than Java String and Java Object would be a good mixture of the two. The Buzhash functions would be my functions of choice in practice. Base 256 is poor.


Task 2 - Measuring the Expected LLPS and Actual LLPS

Results

Low Entropy Sources 

Buzhash low 1500/500: llps = 9, expecting 9.45384
Hash CRC low 1500/500: llps = 15, expecting 9.45384
Base256 low 1500/500: llps = 1500, expecting 9.45384
Java Object low 1500/500: llps = 1500, expecting 9.45384
Java String low 1500/500: llps = 6, expecting 9.45384
Buzhashn low 1500/500: llps = 8, expecting 9.45384
Java Int 1500/500: llps = 3, expecting 9.45384


Typical Entropy Sources

Buzhash 1500/500: llps = 10, expecting 9.45384
Hash CRC 1500/500: llps = 9, expecting 9.45384
Base256 1500/500: llps = 109, expecting 9.45384
Java Object 1500/500: llps = 19, expecting 9.45384
Java String 1500/500: llps = 10, expecting 9.45384

I specified n_keys as 1500 with a table size of 500 (load factor = 3) which has an expected worst-case llps (length of the longest probe sequence) of 9.45384. I feel that a higher load factor will give results which better demonstrate how good the hash functions are at distributing data uniformly and avoiding collisions.

From this data, we can see that the Buzhash functions were quite consistent with reasonably low llps values. For a low entropy source, base256 and Java Object both seem to have astronomical llps values which would suggest all 1500 values are being stored in one linked list. With typical entropy sources, both drastically improved.

Hash CRC, Java String and Java Int performed quite well too.

Conclusion

As most of the code was provided, and examples of hash function tests were given for each task, the only coding required was to duplicate the tests for each hash function and specify sample size, table size and number of keys. As such, I have justified my settings and I don't feel it is necessary or useful for me to include my code. On that note, MSVC 2005 wasn't able to compile the given code, so I used Dev-C++ for the job.

From my results, it seems that the Buzhash functions are the best all purpose hash functions on the grounds that they perform quite well and consistently in both tasks. Java_String produced some interesting results, but I feel it would be more unreliable in practice. Base256 was by far the worst hash function.