The following article was printed in April 1985 of the magazine „Computer Language".

A base article referring to search by hashing.

Search by hashing

By Namir Clement Shammas

In the last issue we talked about mathemaitcal parsers and interpreters. This month, let's take a look at searching methods that use hashing techniques, employing data in memory as well as on stored files.

Hashing techniques arc important for data base programs as well as language compilers, interpreters, preprocessors and any software that must search in a list of names quickly and efficiently.

The idea behind the hash-based search is to transform the search key, be it numeric or alphanumeric, into a unique address used to locate the information. This procedure offers a fast search capability to a computer system, useful, for example, when an airline needs to pull out passenger reservation information without delays. We will explore hashing, its merits and faults, and suggest some remedies for its weaknesses.

The technique in question essentially relics on the use of a hash function, H(x), that maps the search key onto (we hope) a unique address in a predefined range of addresses.

The address range limitation is one weakness of hashing. The majority of hash functions generate random values (or addresses). Thus a list of ordered keys would create scattered addresses that do not reflect the original order, as in:

If key X > keyY...then
H(X) > H(Y)

is not always true.

This immediately tells us that using such hash functions deprive us of the ability to easily maintain sorted lists, which is the price to pay for speed. If sorting is required, one can use an order-preserving hashing function, such that:

If key X > keyY...then
H(X) > H(Y)

Since the hashing functions produce numeric addresses, textual keys must be converted into numerical values. Using the ASCII code is one way to go, however, each character yields two or three digits.

Shifting values can also be used. The addresses calculated can be used to either store the data records when one search key is used or their indices when the data is searched by multiple keys. The records would be stored sequentially in the latter

What are the types of hashing functions'? How do they work? What are their strengths and weaknesses? These are the questions to answer to properly choose a hashing function.

To answer the first question, keep in mind that there are as many types of hash functions as the imagination can create. Here are a few:

Truncation. A specific set of digits or numbers, such as the second, third and fifth digits in an eight digit key is selected. The rest is ignored. This method provides a fast and easy way of obtaining an address in the hash table but often fails to provide even distribution in the table.

Folding. The original numeric key is divided into segments that are simply added. If the result is bigger than the hash table size, the address is taken as (sum of digits) modulus (table size). For example, a To digit key (1234567890) is folded into 34567 and 12890. Their sum yields 47457. The latter is a suitable address for a hash table with a size of 50.000. The method is reportedly very good at randomizing the key. However, it depends on the key sequence and is not very reliable.

Modular arithmetic. The address is obtained as the remainder from dividing the numeric key by the hash table size. This very popular method yields a good spread over the hash table.

Midsquare method. This method yields an address by squaring the numeric key and selecting an appropriate number of digits from the middle of the square. This method has been criticized by some but proven to be effective with certain types of keys.

Length dependent method. The length of the key is used along with some portion of the key to produce an address. On many occasions the address is seen as an intermediate value that yields the sought location via modulus calculations with the hash table size (as with modular arithmetic). This method is useful in dealing with alphanumeric keys.

Many of the books listed in the reference section discuss the types of hashing functions in more detail.

A reader may ask, "Is there any hashing function that guarantees no two different keys will yield the same address?" The answer is no.

Collision, as the effect is called, is almost certain and is caused by the choice of hashing function or the use ofu small size for the hash table. Increasing the size will not eliminate collision completely, only decrease it.

The next question is, "What methods do we use to deal with collision?" The good news is that there arc a good number. One category, called open addressing, includes:

Linear probing. This method is very simple. If the hashing function gives an address that turns out to be already occupied, it performs a sequential search for the locations that follow the collision site until an empty location is found. Keep in mind that the hash table should be regarded as a circular list. The major disadvantage of linear probing is the formation of clusters and an uneven distribution.

Quadratic probing. This technique is a modification of linear probing. If there is a collision at address H(x), then start probing at addresses H(x)+1, H(x)+4, H(x)+9, and so on. While this method resolves the problem of clustering, you can clearly see that not all the locations of the hash table will be probed.

Key dependent probing. This method uses part of the search key to decide the magnitude of an address offset once collision occurs.

Increment functions. A more sophisticated approach is to use a set of hash functions instead of one function. If the address from the first one causes collision, then we apply the second hash function to calculate an address. If collision occurs with the second, we use the third hash function and so on.

Chaining, another method for dealing with collision, dictates the use of two types of storage locations: the hash table and the overflow area. When a collision occurs, say between the newly added key Y and resident key X, the former is stored in the overflow area at the next available location. Key X sets a pointer to the site where key Y is stored, forming a linked list. If another collision occurs between another inserted key, Z and X, then Z is also stored in the overflow area. To maintain the linked list, key Y will contain the pointer for the site of key Z.

The advantage of chaining lies in economical storage, which becomes more evident as the records stored are larger. The hash table need not be oversized.

The disadvantage of chaining becomes apparent when searching through an unsorted list. If a sought key is nonexistent, then the entire linked list is searched. However, chaining works better than open addressing techniques. Data Structures and Program Design uses mathematical proof to demonstrate this.

Imagine searching in memory, using the chaining method to resolve collision. An alternative to maintaining linked lists in the overflow area is to use binary trees, especially when a collision occurs frequently. This makes searching in the overflow area faster since the keys in the overflow area are now sorted. The price to pay is that maintaining binary trees is more involved, especially when it comes to deleting.

The idea of using binary trees becomes more appealing when order-preserving hash functions are used. They can simply use the first one or two most significant digits or letters in the search key.

The combination of hash table and binary trees — I'll call it the H-Tree — allows for fast searching and sorting in memory. The hash table will contain a set of roots for binary trees. By using the divide-and-conquer strategy, we are scanning fewer nodes at a time while adding, deleting and searching.

What about searching in files saved on disks? First, we must remember that access time of a mechanical device is much greater than that for memory. So we need to perform the least number of disk accesses.

Second, since I/O operations occur via memory blocks or buckets of 256 bytes or its multiples, we can collect a number of keys in a bucket. Thus we can store the data records separately and create another storage area for the pages containing the search keys.

Again we have two storage areas for keys: the hash tabic and overflow area. The mechanism is very similar to that of memory-based search. The difference is that single keys in a memory-based search are replaced by buckets. Thus the hash table will contain a number of buckets, each containing colliding keys. When a hash table bucket overflows, a bucket in the overflow area is created, and the two buckets link using a pointer.

To decrease the numbcrof collisions, bigger hash tables are required. This is a disadvantage, because we are reserving a lot of space and assuming even distribution of the keys. The following strategy deals with the problem of excessive space.

Using a two-dimensional hash table and two distinct hash functions, the first hash function determines the row address, and teh second determines the column address. Any keys colliding due to the first hash function address are separated by the second hash function. The overflow area is still used, for nothing is collision-proof.

So far we have handled the collision problem but still have to solve the space allocation problem, especially in dealing with buckets. Here is the way out: rather than creating the space for the key buckets, we instead create, in memory, a two-dimensional map of pointers. As for the key buckets, we store them up in chronological order as they are introduced in the system.

The map needs to be loaded at the beginning and stored at the end of a data-manipulation session. Here is how the scheme works:

Initialization. Assign the data file and the key-buckets file and zero all data counters and pointers. Initially the elements of the latter are pointing nowhere.

Start-up. As data starts to flow in the system, update the number of records and extract the search key. Use the latter to determine the row and column address in the key map. If the latter is zero, the creation of a new bucket is required. In any case the key is stored in a slot located inside a bucket. Figure 1 shows a fairly empty key map. The first record added has a key equal to 16. So Map(1,6) is the first-used pointer to the first bucket. A second record with a key equal to 36 is entered and is used to point to the second bucket, and so on.

Routine addition. As buckets are filled, new ones are created and pointers are used to maintain the linked list of buckets. Suppose that we add 19 more records, all with keys equal to 16. If each bucket holds 20 keys, then we have filled the first bucket. As we add the twentysecond record, always with a key equal to 16, then a new bucket is created (number three to be exact) and the key points to record number 22.

Deletion. It is usually more difficult to delete parts of data structures than to build them. As records and keys are deleted, they create gaps in the data lists. One can construct additional lists for the latter gaps. The lists are either uscd in periodic packing of files or to fill the gaps by storing newly added records and keys.

An alternative way to keep track of the lists is to move the last element in the linked lists of records and keys into the location of the deleted record and key. One must take into account the effect of emptying buckets. Their space must be regained. This will require that the buckets be double-linked lists to allow travel through the lists in both directions.

Listing 1 has a few procedures —written in programmer's pseudo listing (PPL) — to demonstrate the code for addition and search. I have not included the code for deletion due to space limitations.

If this technique is used with order-preserving hash functions then the ability to perform sorting is vastly improved. Once more the principle of divide-and-conquer is put to service. To sort, a program would go through the map in a systematic way (reflecting ascending or descending order) and read the unsorted list linked to that map element. The keys in each of the lists are then sorted in memory and the result output. Since there are relatively few elements in a set of linked lists, sorting should not be too painful.

The other alternative is to maintain the linked lists of buckets in a sorted order, which slows the process of adding and deleting data but yields an improved search.

Data structures and management are of interest to most, if not all, programmers. Sorting and searching play a vital role in data processing. It is probably the subject most talked about and most researched. It is also an art.

The numerous algorithms involved and their modification form a vast number of methods. It is indeed a fascinating subject. Let me hear from you. If you have some code you developed that performs fast searching, then here is the good news: you may be the winner in our searching minicontest!

We will award prizes (al least a free one-year subscription extension) for the best ihree well-written and fastest (and I mean fast) codes. Because I may not have the same hardware and language implementation. I will rely on your speed test results, which should be included. The winners will be announced in four months.

In the next issue we will talk more about external hashing. Some interesting techniques overcome the limilations of a predetermined hash table size. I leave you with a list of references on the subsets.

Mapped-keys strategy (pointers to the records belong to the first
keys added to a bucket)

      RAM key map
    0  1  2  3  4  5  6
   ---------------------
0  |  |  |  |  |  |  |  |           Key buckets      Data records
   ---------------------
1  |  |  |  |  |  |  |16|-----------v
   ---------------------           -------     1     ------------
2  |  |  |  |  |  |  |  |       +--| 16  | --------->|16 data   |
   ---------------------        |  +-----+     2     ------------
3  |  |  |  |  |  |  |36| -------->| 36  | --------->|36 data   |
   ---------------------        |  +-----+     3     ------------
                                 ->| 16  |----       |          |
                                   +-----+   |       ------------
                                             |       ............
                                             | 22    ------------
                                             ------> |16 data   |
                                                     ------------

Figure 1.

References		Tenenbaum, A.M. and Augenstein, M.J.,
Hanson,O., 1982. Design of Computer Data		OO	1981. Data Structures Using Pascal.
OO	Files. Rockville. Md.: Computer Science		Englewood Cliffs, N,J,: Prentice-Hall.
	Press.	Tremblay, J. and Sorenson, P.G., 1984. An
Horowitz, E. and Sahni, S., 1982. Funda-			Introduction to Data Structures with Appli-
	mentals of Data Structures. Rockville. Md.:		cations. New York, N.Y.: McGraw-Hill.
	Computer Science Press.	Ullman,J,D., 1982.Database Systems. Second
Horowitz, E. and Sahni, S., 1984. Fumda-			Edition. Rockville, Md.: Computer Sci-
	mentals of Data Structures in Pascal. Rock-		ence Press.
	ville,Md.: Computer Science Press.	Weiderhold, G., 1983.Database Design, Sec-
Knuth.D., 1973, The Art of Computer Pro-			ond Edition. New York. N.Y.: McGraw-
	gramming, VoL 3: Sorting and Searching.		Hill.
	Reading, Mass.: Addison-Wesley.	Wirth, N., 1976.Algorithms + Data Structures Second
Kruse. R.L., 1984. Data Structures & Program			= Programs. Englewood Cliffs, N.J.:
	Design. Englewood Cliffs. N.J.: Prentice-		Prentice-Hall.
	Hall.
Melhorn, K., 1984. Sorting and Searching.
	New York, N.Y.: Springer-Verlag.

A collection of procedures, written in PPL, for initializing,
adding, and searching for records

-- Data types will be defined in a Pascal-like style.

-- RECORD : Anytype;
-- KeyData : record Key : Anytype; RecordPointer : integer end;
-- Map : array[1..MAXROW,1..MAXROW] of record First, Last : integer end;
-- Cell : record First, Last : integer end;
-- Bucket : record
--            NumSlot, PreviousBucket, NextBucket : Integer;
--            Slots : array [1..BUCKETSIZE] of (same type as) KeyData
--          end;

PROCEDURE Initialize

OPEN "DATAFILE",1,"RANDOM"
OPEN "KEYFILE",2,"RANDOM"
NData = 0; NBucket =0;
Zero all elements of Map

END Initialize


PROCEDURE Add

Obtain RECORD
NData += 1; KeyData.RecordPointer = NData;
WRITE 1, RECORD
Row = Hashl(KeyData.Key); Column = Hash2(KeyData.Key)
Cell = Map[Row, Column]
IF Cell.First = 0
THEN -- create a new bucket
   [Add_to_New_Bucket]
   Map[Row, Column].First = NBucket; Map[Row,Colum].Last = NBucket;
ELSE -- Locate last bucket
   READ 2, Cell.Last, Bucket
   IF Bucket.NumSlot = BUCKETSIZE -- Is the bucket full?
   THEN -- Create a new bucket and maintain linked list
      Bucket.NextBucket = NBucket + 1; WRITE 2, Cell.Last,Bucket
      [Add_to_New_Bucket]
      Map[Row,Column].Last = NBucket -- keep track of last bucket in list
   ELSE -- Add in the same bucket
      Bucket.NumSlot += 1
      Bucket.Slots[Bucket.NumSlot] = KeyData
      WRITE 2, Cell.Last Bucket
   END IF;
END IF;


PROCEDURE Add_to_New_Bucket

NBucket += 1; Bucket.NumSlot = 1;
Bucket.PreviousBucket = Cell.Last -- pointer to previous bucket or zero if
                                  -- this is the first bucket in the list
Bucket.NextBucket = 0 -- zero indicates end of linked list.
Bucket.Slots[1] = KeyData
WRITE 2, NBucket, Bucket

END Add_to_New_Bucket


PROCEDURE Search;
-- Seacrh through buckets.
Obtain SearchKey
Row = Hashl(SearchKey); Column = Hash2(SearchKey)
Cell = Map[Row,Column]
IF Cell.First =0 -- sought data is definitely not on file
THEN
   DISPLAY "Data nonexistent"
ELSE
   INITIALIZE: FoundFlag = False; NextOne = Cell.First
   LOOP <BIG>
   BEGIN
     READ 2, NextOne, Bucket
     INITIALIZE: None
     LOOP <Look_in_a_Bucket>
     BEGIN for i = 1 to NumSlot
       IF Slots[i].Key = SearchKey THEN FoundFlag = True; EXIT <BIG> END IF;
     END LOOP <Look_in_a_Bucket>;
     TERMINATE: NextOne = Bucket.NextBucket
     IF NextOne = 0 THEN EXIT <BIG> END IF;  -- when end of link is found
                                             -- exit loop
   END LOOP <BIG>;
   TERMINATE: IF FoundFlag
              THEN
                 DISPLAY "Data in record # ";Slot[i].RecordPointer
                 READ 1, Slot[i].RecordPointer, RECORD
                 -- Perform more data processing of your choice
              ELSE
                 DISPLAY. "Data nonexistent"
              END IF;
END Search

Listing 1.