CpSc 360
Chapter 10 – Hashing
Orders of
Complexity for Seeking Records in a File
·
Sequential
searches require O(N) seeks to find a record
·
B-Trees
require O(logkN) seeks to find a record, where k is the measure of
leaf size.
·
The
ultimate goal of file design is to require O(1) seeks
·
A
hash function is a function h(K) that transforms a key K into an address of the record
containing K in a file
·
The
value of the function h(K) is said to
be the home address of K.
·
Hashing
is different from indexing in 2 ways:
·
The
home addresses appear to be random. i.e.
the is no apparent connection between the key and the location. Hashing is sometimes
referred to as randomizing.
·
Two
different keys may be transformed to the same address. The keys are said to collide and are said to be synonyms.
·
Collisions
cause problems because we cannot put two records in the same location.
·
A
hashing algorithm that produces no collisions is called a perfect hashing algorithm.
·
Perfect
hashing algorithms are usually impossible to find, so we more often try to
reduce the number of collisions to an acceptable number and then design a
work-around for the collisions.
·
Spread
out the records more uniformly across the possible locations. i.e. break up the clusters in the hashed
output.
·
Have
a large location to record ratio. i.e. have many more locations than we have
records. This technique wastes much
space.
·
Put
more than one record at a single address.
Rather than hashing into a record location, hash into a bucket (block) that has a capacity of
several synonym records.
·
Hash means to “chop up into
small pieces … muddle or confuse”.
·
A
simple algorithm might be to:
1.
Represent
the key in numerical form
2.
Fold
and add
3.
Divide
by a prime number and use the remainder as the address
·
Example
1.
We
want to hash the key: LOWELL
The ASCII code for this 10 character key is:
L
O W E L L |-
spaces -|
2.
To
fold and add the number we separate the numbers into pairs as shown below:
76 79 | 87 69 | 76 76 | 32 32 | 32 32
and add the pairs together
to form a sum:
7679 +
8769 + 7676 + 3232 + 3232 = 27356
·
Since
this sum in general may exceed the capacity of a two-byte binary number
(32,767) we must protect against overflow.
·
One
way to do this is to be certain at each step of the addition we leave room to
add the largest possible number in the next step.
·
If
the largest possible value pair is ZZ (9090) then we make sure that no partial
sum exceeds 32,767 – 9090 = 23677. We
can pick a smaller number than 23677 and the logic still holds.
·
If
we choose 19937 as the upper limit on partial sums then we are protected. We can enforce this protection by using the mod function as follows:
7676 + 8769 = 16448 16448 mod 19937 = 16448
16448 + 7676 = 24124
24124
mod 19937 = 4187
4187 + 3232 = 7419 7419 mod 19937 = 7419
7419 + 3232 = 10651 10651 mod 19937 = 10651
The number 10651 is the result of the fold-and-add
operation.
3.
Divide
by the size of the address space and take the remainder
Address = Sum mod
Address space
In our example assuming 250 locations into which
records will be stored:
Address = 10651 mod 251 = 109
A prime number
usually works better because primes tend to distribute remainders more
uniformly than do non-primes.
·
Although
there have been many approaches to constructing hashing functions, none seem to
work significantly better with unknown data than those that end with a mod
function using the address space of record locations as a base.
·
The
packing density is the ratio of the number
of records to be stored, r, to the
number of available spaces N.
Number of records / Number of spaces = r / N
= packing density
·
In
general, the greater the packing density, the higher the probability of
collisions.
