CpSc 360

Lecture 19

 

Chapter 10 – Hashing

 

Orders of Complexity for Seeking Records in a File

 

·       Sequential searches require O(N) seeks to find a record

·       B-Trees require O(logkN) seeks to find a record, where k is the measure of leaf size.

·       The ultimate goal of file design is to require O(1) seeks

 

Hashing

 

·       A hash function is a function h(K) that transforms a key K into an address of the record containing K in a file

·       The value of the function h(K) is said to be the home address of K.

·       Hashing is different from indexing in 2 ways:

·       The home addresses appear to be random.  i.e. the is no apparent connection between the key and the location. Hashing is sometimes referred to as randomizing.

·       Two different keys may be transformed to the same address.  The keys are said to collide and are said to be synonyms.

·       Collisions cause problems because we cannot put two records in the same location.

·       A hashing algorithm that produces no collisions is called a perfect hashing algorithm.

·       Perfect hashing algorithms are usually impossible to find, so we more often try to reduce the number of collisions to an acceptable number and then design a work-around for the collisions.

·       Spread out the records more uniformly across the possible locations.  i.e. break up the clusters in the hashed output.

·       Have a large location to record ratio. i.e. have many more locations than we have records.  This technique wastes much space.

·       Put more than one record at a single address.  Rather than hashing into a record location, hash into a bucket (block) that has a capacity of several synonym records.

 

A Simple Hashing Algorithm

 

·       Hash means to “chop up into small pieces … muddle or confuse”. 

·       A simple algorithm might be to:

1.     Represent the key in numerical form

2.     Fold and add

3.     Divide by a prime number and use the remainder as the address

·       Example

 

1.     We want to hash the key:   LOWELL

 

The ASCII code for this 10 character key is:

 

LOWELL = 76 79 87 69 76 76 32 32 32 32

          L  O  W  E  L  L  |- spaces -|

 

We treat each ASCII character as a number with a value as shown above.

 

2.     To fold and add the number we separate the numbers into pairs as shown below:

 

76 79 | 87 69 | 76 76 | 32 32 | 32 32

 

and add the pairs together to form a sum:

 

         7679 + 8769 + 7676 + 3232 + 3232 = 27356

 

·       Since this sum in general may exceed the capacity of a two-byte binary number (32,767) we must protect against overflow. 

·       One way to do this is to be certain at each step of the addition we leave room to add the largest possible number in the next step.

·       If the largest possible value pair is ZZ (9090) then we make sure that no partial sum exceeds 32,767 – 9090 = 23677.  We can pick a smaller number than 23677 and the logic still holds.

·       If we choose 19937 as the upper limit on partial sums then we are protected.  We can enforce this protection by using the mod function as follows:

 

7676 + 8769 =             16448              16448 mod 19937 = 16448

16448 + 7676 =           24124             24124 mod 19937 = 4187

4187 + 3232 =             7419                7419   mod 19937 = 7419

7419 + 3232 =             10651              10651 mod 19937 = 10651

 

The number 10651 is the result of the fold-and-add operation.

           

3.     Divide by the size of the address space and take the remainder

 

Address = Sum mod Address space

 

In our example assuming 250 locations into which records will be stored:

 

         Address = 10651 mod 251 = 109

 

A prime number usually works better because primes tend to distribute remainders more uniformly than do non-primes.

 

·       Although there have been many approaches to constructing hashing functions, none seem to work significantly better with unknown data than those that end with a mod function using the address space of record locations as a base.

 

How Much Extra Memory Should Be Used

 

·       The packing density is the ratio of the number of records to be stored, r, to the number of available spaces N. 

 

Number of records / Number of spaces = r / N = packing density

 

·       In general, the greater the packing density, the higher the probability of collisions.