SE250:lab-5:Maintainers report

From Marks Wiki
Jump to navigation Jump to search

Lab 5 Report

Introduction

A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.

Hash functions are mostly used to speed up table lookup or data comparison tasks --- such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

Hash table could also be defined(loosly defined) as an array of linked list.

Hash table is a 2D array of values with a specific key attached to each value. This table is controlled or populated using a hash function. A hash function is a mapping device that generates and assigns numbers to the input data so that it can be stored in the array index according to this assigned number; each assigned number is unique to the input data.

A good hash function is essential for good hash table performance. A bad choice for a hash function will eventually lead to clustering, which is when the probability of keys being assigned same hash values is higher then expected from a random function. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.

Methodology

In this weeks lab we were expected to perform experiments on hash tables, to show how different hash functions work.

All the tests were provided, and it was just up to us students to run them, look at the results and come to conclusions about which hash function would be best to use.

The hash functions that were provided were:

  • rt_add_buzhash
  • rt_add_bushashn
  • rt_add_hash_CRC
  • rt_add_base256
  • rt_add_Java_Integer
  • rt_add_Java_Object
  • rt_add_Java_String
  • rt_add_rand
  • rt_add_high_rand

I wanted to try to explain what each of these functions do to get the hash values, but after looking at the code, I decided it would not be the best idea so I left it alone.

Definitions

  • Entropy: Is the measure of how disordered a system is. This is a Thermodynamics idea. Basically this would measure if our hash table is random or not. The value returned by this test would prove how random our distribution is and for this particular test, it would be ideal if the entropy returned by this function is as close to 8 as possible.
  • Percentage by which the result could be compressed: The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness.
  • Chi Square: Chi square is a statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. What this means, is that, chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match.
  • Arithmetic mean: The expected mean for a set of random numbers is always the centre between the

highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean.

  • Monte Carlo: This method takes random numbers, plots them on a cartesian coordinate plane, and then calculates the ratio of the points that are on the plane, with the number of points that lie in a unit circle centered at 0. This ratio is apparantly a quarter times pi, and this method is then used to calculate the value of pi. So a more accurate value of pi given, the more random the numbers are.
  • Serial Correlation: Is how much are the values in the hash table related to each other, basically, tries to determine if there is a pattern to the ordering of the values in the hash table or not.


Referenced from the HTTB 2007 and the Internet

Part 1

We used ent_test to compare the randomness of all the hash functions on both the low entropy source and the typical entropy source.


We also compared the results to the output from the Unix random number generators. Then we are arranging the values obtained from the functions in order from best to worst.

Choosing the sample size

Some students did experiments to choose an appropriate sample size. The majority of those used one function to test different sample sizes. The stopping point was when increasing the sample size made no difference on the randomness of the Hash function.

Example: (taken from hpan027's report)


To determine the sample size, I carried out a series of ent_test with different sample sizes.

Buzhash low 5		3.00000	50.00%	132.875	100.00%	-0.509682
Buzhash low 10		3.58496	50.00%	107.667	36.34%	0.235256
Buzhash low 100		6.15366	2.50%	130.460	3.45%	-0.088601
Buzhash low 1000	7.84646	97.50%	126.550	2.01%	0.006193
Buzhash low 10000	7.97970	25.00%	127.177	1.95%	-0.007153
Buzhash low 100000	7.99827	50.00%	127.587	0.14%	0.000712
Buzhash low 1000000	7.99989	99.99%	127.501	0.23%	-0.000832

Basically, the conclusion was somewhere around 10000 the results stop varying much, 
and hence 10,000 was the chosen sample size for the rest of the tests.

Most students made up some numbers and used them without any justification. I think this is due to the fact that the relationships between table size, No. of keys and sample size are not very clear to the majority of them.

Part 1.1

CODE:

int main( ) {
  int sample_size = 1000;
  int n_keys = 1000;
  int table_size = 10000;

  ent_test( "Buzhash low", low_entropy_src, sample_size, &rt_add_buzhash );
  ent_test( "Buzhash high", typical_entropy_src, sample_size, &rt_add_buzhash );

  ent_test( "buzhashn low", low_entropy_src, sample_size, &rt_add_hash_CRC );
  ent_test( "buzhashn high", typical_entropy_src, sample_size, &rt_add_hash_CRC );

  ent_test( "hash_CRC low", low_entropy_src, sample_size, &rt_add_hash_CRC );
  ent_test( "hash_CRC high", typical_entropy_src, sample_size, &rt_add_hash_CRC );

  ent_test( "base256 low", low_entropy_src, sample_size, &rt_add_base256 );
  ent_test( "base256 high", typical_entropy_src, sample_size, &rt_add_base256 );

  ent_test( "Java Int low", low_entropy_src, sample_size, &rt_add_Java_Integer );
  ent_test( "Java Int high", typical_entropy_src, sample_size, &rt_add_Java_Integer );

  ent_test( "Java Object low", low_entropy_src, sample_size, &rt_add_Java_Object );
  ent_test( "Java Objec high", typical_entropy_src, sample_size, &rt_add_Java_Object ); 

  ent_test( "Java String low", low_entropy_src, sample_size, &rt_add_Java_String );
  ent_test( "Java Strin high", typical_entropy_src, sample_size, &rt_add_Java_String );
  
  ent_test( "Rand  low", low_entropy_src, sample_size, &rt_add_rand );
  ent_test( "Rand  high", typical_entropy_src, sample_size, &rt_add_rand );

  ent_test( "high rand low", low_entropy_src, sample_size, &rt_add_high_rand );
  ent_test( "high rand high", typical_entropy_src, sample_size, &rt_add_high_rand );
 
  return 0;
}

Part 1.2

For:
  int sample_size = 1000;
  int n_keys = 1000;
  int table_size = 10000;

		
Buzhash low	7.99830	75.00%	127.578	0.71%	0.000327 
Buzhash high	7.99783	2.50%	127.374	0.77%	-0.000076
buzhashn low	5.59831	0.01%	81.783	24.53%	0.028545 
buzhashn high	7.84046	0.01%	122.862	1.98%	0.018518
hash_CRC low	5.71827	0.01%	112.143	12.43%	-0.204906
hash_CRC high	7.84046	0.01%	122.862	1.98%	0.018518 
base256 low	0.00000	0.01%	97.000	27.32%	undefined
base256 high	4.02297	0.01%	107.853	27.32%	0.034082
Java Int low	4.82824	0.01%	43.883	27.32%	-0.092002
Java Int high	4.82824	0.01%	43.883	27.32%	-0.092002
Java Object low	2.00000	0.01%	77.000	27.32%	-0.521556
Java Objec high	5.72621	0.01%	118.129	4.28%	-0.343281
Java String low	7.99958	99.99%	127.474	0.15%	0.001097
Java Strin high	7.94554	0.01%	126.139	0.27%	0.021181 
Rand  low	7.95308	0.01%	111.441	11.17%	-0.051837 
Rand  high	7.95272	0.01%	111.395	10.65%	-0.049131
high rand low	7.99828	75.00%	127.441	0.75%	-0.001213
high rand high	7.99807	50.00%	127.406	0.07%	-0.002226


'Rearranged in the form best to worst:'

Buzhash low	7.9983	75.00%	127.578	0.71%	0.000327
high rand low	7.99828	75.00%	127.441	0.75%	-0.001213
high rand high	7.99807	50.00%	127.406	0.07%	-0.002226
Java String low	7.99958	99.99%	127.474	0.15%	0.001097 
Buzhash high	7.99783	2.50%	127.374	0.77%	-0.000076
Java Strin high	7.94554	0.01%	126.139	0.27%	0.021181 
buzhashn high	7.84046	0.01%	122.862	1.98%	0.018518
hash_CRC high	7.84046	0.01%	122.862	1.98%	0.018518 
Rand  low	7.95308	0.01%	111.441	11.17%	-0.051837
Rand  high	7.95272	0.01%	111.395	10.65%	-0.049131
base256 high	4.02297	0.01%	107.853	27.32%	0.034082
Java Objec high	5.72621	0.01%	118.129	4.28%	-0.343281
hash_CRC low	5.71827	0.01%	112.143	12.43%	-0.204906
buzhashn low	5.59831	0.01%	81.783	24.53%	0.028545
Java Int low	4.82824	0.01%	43.883	27.32%	-0.092002
Java Int high	4.82824	0.01%	43.883	27.32%	-0.092002
Java Object low	2	0.01%	77	27.32%	-0.521556 
base256 low	0	0.01%	97	27.32%	undefined

Part 2

We used llps (and llps_i) to measure the actual worst case performance of the supplied hash functions.

Running the code:

int main( ) {
  int sample_size = 100000;
  int n_keys = 10000;
  int table_size = 1000;

  printf( "Buzhash low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, low_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size ));
  printf( "Buzhash high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, typical_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size ));

  printf( "Hash CRC low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, low_entropy_src, table_size, hash_CRC ), expected_llps( n_keys, table_size ));
  printf( "Hash CRC high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, typical_entropy_src, table_size,  hash_CRC ), expected_llps( n_keys, table_size )); 

  printf( "Base 256 low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, low_entropy_src, table_size, base256 ), expected_llps( n_keys, table_size ));
  printf( "Base 256 high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, 
  llps( n_keys, typical_entropy_src, table_size,  base256  ), expected_llps( n_keys, table_size ));

 return 0;
}


Gives:

Buzhash low 10000/1000: llps = 20, expecting 21.7096 
Buzhash high 10000/1000: llps = 22, expecting 21.7096
Hash CRC low 10000/1000: llps = 88, expecting 21.7096
Hash CRC high 10000/1000: llps = 23, expecting 21.7096
Base 256 low 10000/1000: llps = 10000, expecting 21.7096
Base 256 high 10000/1000: llps = 497, expecting 21.7096
Rearranged from best to worst: 
Buzhash low 10000/1000: llps = 20, expecting 21.7096
Buzhash high 10000/1000: llps = 22, expecting 21.7096
Hash CRC high 10000/1000: llps = 23, expecting 21.7096
Hash CRC low 10000/1000: llps = 88, expecting 21.7096
Base 256 high 10000/1000: llps = 497, expecting 21.7096
Base 256 low 10000/1000: llps = 10000, expecting 21.7096

Conclusion

For part 1, it was very difficult to determine the order of "randomness" of each function because it's hard to weigh each statistical test. It is likely the ranks for randomness would change depending on sample size.Functions perform better within a certain range of numbers and hence are disadvantaged by this particular sample size.

The difference between rand and high_rand is that high rand is generally better but comes at some slight processing power and memory use. High rand tends to 'over' randomize where as rand just 'under' randomizes (ie tends to be biased to under values rather than higher values).

For some people, it was noticed that they ranked BuzHash a better fucntion than Buzhashn, as it performed well overall, but majority had Buzhashn as the best rated function.

I would rank the hash functions from best to worst like this:

Buzhashn, Buzhash, java string, hash CRC, java object, java integer, base256.


For Part 2,

   * Buzhash is pretty much consistent with the expected result
   * Java_String seems to perform better than expected
   * Java_Object seems to be broken for with a low entropy source
   * hash_CRC tends to have a higher llps than the expected 

The results in part 2 match roughly what we expect in theory (results in part 1 of the experement).

The order of the best hash functions are ordered as follows:

Buzhash
Hash CRC
Base 256

Base256 is the worst of all!


Difficulties or Problems of This Lab

After reading all the comments by John Hamer, it was apparent that most of the students had made choices for the values for sample size, n_keys and table size without solid justification. One of the best justifications seen in the lab reports was:

Results for the test did were nearly the same for values about 10,000.

And since this was given to be the sample size, the table size had to be a bit greater than this to accommodate the sample size, and similarly, n_keys as well.

Another problem seen was that most people just copied and pasted the output without sufficient explanation to the results. Many people failed to do most if not all the testing and tasks involved. In some reports, it was noticed that the table was either underloaded, or overloaded.