SE250:lab-5:bvic005

From Marks Wiki
Jump to navigation Jump to search

This lab was quite hard, as a lot of the code we where provided was past our level of expertise, and it used statistical measures that where not explained, and we where not familiar with. However, after almost hour of blank staring, and a lot of question asking, i managed to come up with this code to test the provided function, and i had a general idea as to what the output meant:

The Code

int main( ) {
  int sample_size = 100000;
  int n_keys = 1000;
  int table_size = 500;

  ent_test( "Buzhash low     ", low_entropy_src, sample_size, &rt_add_buzhash );
  ent_test( "Buzhashn low    ", low_entropy_src, sample_size, &rt_add_buzhashn );
  ent_test( "Hash CRC low    ", low_entropy_src, sample_size, &rt_add_hash_CRC );
  ent_test( "Base256 low     ", low_entropy_src, sample_size, &rt_add_base256 );
  ent_test( "Java Integer low", low_entropy_src, sample_size, &rt_add_Java_Integer );
  ent_test( "Java Object low ", low_entropy_src, sample_size, &rt_add_Java_Object );
  ent_test( "Java String low ", low_entropy_src, sample_size, &rt_add_Java_String );
  ent_test( "Random low      ", low_entropy_src, sample_size, &rt_add_rand );
  ent_test( "High Random low ", low_entropy_src, sample_size, &rt_add_high_rand );

  printf("\n");

  ent_test( "Buzhash     ", typical_entropy_src, sample_size, &rt_add_buzhash );
  ent_test( "Buzhashn    ", typical_entropy_src, sample_size, &rt_add_buzhashn );
  ent_test( "Hash CRC    ", typical_entropy_src, sample_size, &rt_add_hash_CRC );
  ent_test( "Base256     ", typical_entropy_src, sample_size, &rt_add_base256 );
  ent_test( "Java Integer", typical_entropy_src, sample_size, &rt_add_Java_Integer );
  ent_test( "Java Object ", typical_entropy_src, sample_size, &rt_add_Java_Object );
  ent_test( "Java String ", typical_entropy_src, sample_size, &rt_add_Java_String );
  ent_test( "Random      ", typical_entropy_src, sample_size, &rt_add_rand );
  ent_test( "High Random ", typical_entropy_src, sample_size, &rt_add_high_rand );

  printf("\n");


	printf( "Buzhashn %d/%d: llps = %d, expecting %g\n",
  			  n_keys, table_size,
  		  llps_i( n_keys, table_size, buzhashn ),
		  expected_llps( n_keys, table_size ) );

	printf( "Java Integer %d/%d: llps = %d, expecting %g\n",
				  n_keys, table_size,
			  llps_i( n_keys, table_size, Java_Integer_hash ),
		  expected_llps( n_keys, table_size ) );

	printf("\n");

	printf( "Buzhash low %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, low_entropy_src, table_size, buzhash ),
		  expected_llps( n_keys, table_size ) );
	printf( "Hash CRC low %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, low_entropy_src, table_size, hash_CRC ),
		  expected_llps( n_keys, table_size ) );
	printf( "Base256 low %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, low_entropy_src, table_size, base256 ),
		  expected_llps( n_keys, table_size ) );
	printf( "Java Object low %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, low_entropy_src, table_size, Java_Object_hash ),
		  expected_llps( n_keys, table_size ) );
	printf( "Java String low %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, low_entropy_src, table_size, Java_String_hash ),
		  expected_llps( n_keys, table_size ) );

	printf("\n");

	printf( "Buzhash %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, typical_entropy_src, table_size, buzhash ),
		  expected_llps( n_keys, table_size ) );
	printf( "Hash CRC %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, typical_entropy_src, table_size, hash_CRC ),
		  expected_llps( n_keys, table_size ) );
	printf( "Base256 %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, typical_entropy_src, table_size, base256 ),
		  expected_llps( n_keys, table_size ) );
	printf( "Java Object %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, typical_entropy_src, table_size, Java_Object_hash ),
		  expected_llps( n_keys, table_size ) );
	printf( "Java String %d/%d: llps = %d, expecting %g\n",
			  n_keys, table_size,
		  llps( n_keys, typical_entropy_src, table_size, Java_String_hash ),
	  expected_llps( n_keys, table_size ) );


  return 0;
}

Code Explanation

Basically, my code tests the functions in two main parts.

First, it tests the theoretical randomness of hash values generated by the functions, from a "low entropy", and then from a "typical entropy" source list of keys. For this test, i used a sample size(number of hash keys generated) of 100,000, because it is a generic "large" number, to hopefully give the functions a fair shot at being random, and it is not to large, so the tests are preformed quickly.

Secondly, it tests the actual length of linked lists, when a set amount of keys are inserted in a hash table. Particularly, it looks at the longest list, compared to the statistically expected longest list. First, two integer hash functions are tested, and then the string hash functions are tested for "low entropy" and "typical entropy" keys.

The Results

Here are some results:

Buzhash low             7.99830 75.00%  127.578 0.71%   0.000327
Buzhashn low            7.99824 50.00%  127.494 0.13%   -0.003092
Hash CRC low            5.59831 0.01%   81.783  24.53%  0.028545
Base256 low             0.00000 0.01%   97.000  27.32%  undefined
Java Integer low        4.82824 0.01%   43.883  27.32%  -0.092002
Java Object low         2.00000 0.01%   77.000  27.32%  -0.521556
Java String low         7.99959 99.99%  127.438 0.22%   -0.000646
Random low              7.95308 0.01%   111.441 11.17%  -0.051837
High Random low         7.99821 50.00%  127.402 0.59%   -0.001222

Buzhash         7.99783 2.50%   127.374 0.77%   -0.000076
Buzhashn        7.99824 50.00%  127.494 0.13%   -0.003092
Hash CRC        7.84046 0.01%   122.862 1.98%   0.018518
Base256         4.02297 0.01%   107.853 27.32%  0.034082
Java Integer    4.82824 0.01%   43.883  27.32%  -0.092002
Java Object     5.68903 0.01%   88.695  22.13%  -0.136024
Java String     7.94554 0.01%   126.139 0.27%   0.021181
Random          7.95316 0.01%   111.468 11.13%  -0.048905
High Random     7.99807 50.00%  127.406 0.07%   -0.002226

Buzhashn 1000/500: llps = 5, expecting 7.45883
Java Integer 1000/500: llps = 2, expecting 7.45883

Buzhash low 1000/500: llps = 8, expecting 7.45883
Hash CRC low 1000/500: llps = 11, expecting 7.45883
Base256 low 1000/500: llps = 1000, expecting 7.45883
Java Object low 1000/500: llps = 1000, expecting 7.45883
Java String low 1000/500: llps = 4, expecting 7.45883

Buzhash 1000/500: llps = 10, expecting 7.45883
Hash CRC 1000/500: llps = 8, expecting 7.45883
Base256 1000/500: llps = 83, expecting 7.45883
Java Object 1000/500: llps = 13, expecting 7.45883
Java String 1000/500: llps = 8, expecting 7.45883

Theoretical Randomness Analysis

When looking at the results for randomness, the first colunm shows the entropy of the keys generated, or how many bytes on average where significant in making the hash unique(i think). As the hash key size is 8 bytes, the maximum and optimum entropy is 8. The buzhash, random and Java String functions worked well here, as did Hash CRC, in the typical entropy tests.

The second column is Chi Squared percentages. The optimum here is 50%, with buzhashn and high random doing well, but the values where very variably with sample size, so i am not sure we can draw anything significant from these values.

Third column is the mean value, with optimum randomness being a mean of 127.5. Once again, the best results come from the hash functions, the randoms, and the Java String. Hash CRC has a significantly "more random" mean with the typical entropy tests, once again(again).

The forth column is the error in a Monte Carlo calculation for pi, where why is estimated based on the samples. A perfectly random sample would theoretically generate pi exactly. Once again, the winners are Buzhash, Java String, and random.

Finally, we have the "Serial correlation coefficient", which has an optimum value of 0. The same winners once again, though the Java String typical entropy value is not that great.

From these results, it seems that Buzhash is the winner for theoretical randomness, though Java String seems to be a close second.

Practical Length Assessment

Next we need to look at the list length results. Note that i went for twice as many entries as the table length, to "stress" it a bit. For these values, a longest list of 7.4 is theoretically expected, so this or smaller would be good.

For the integer hash functions, Buzhash performs quite well with a longest list of 5, and very surprisingly, based on its randomness results, Java Integer performs perfectly, with a longest list of 2. there must be something special about the Java Integer function that allows it to distribute hash keys perfectly, without relying on randomness to do so.

For the string functions, Buzhash and Hash CRC perform well, with longest lengths around 8-10. Java String gets a similar 8 for typical entropy, but outperforms them with a 4 for low entropy keys.

Base 256 and Java object get perfectly bad longest lists of 1000 for the low entropy keys, though this is apparently an anomaly caused by how the low entropy keys are generated. Java object recovers quite well for a 13 in typical entropy, whereas Base 256 gets a still bad 83.

Conclusion

Basically, it seems that in out best test of real life performance, the typical entropy longest length, either of the integer key hash functions will do nicely, though the Java Integer one seems to magically get wonderful results, despite a lack of randomness. For string key hash functions, both Buzhash and Java String are very good. Hash CRC also could be a possibility, though its lower randomness may mean that it is not quite as consistent as the others.

Bvic005 02:18, 9 April 2008 (NZST)