Thursday, March 13, 2014

Core Java : Java hashCode()

In the Java programming language, every class implicitly or explicitly provides a hashCode() method, which digests the data stored in an instance of the class into a single hash value (a 32-bit signed integer). This hash is used by other code when storing or manipulating the instance – the values are intended to be evenly distributed for varied inputs in order to use in clustering.
Technically, in Java, hashCode() by default is a native method, meaning, it has the modifier 'native', as it is implemented directly in the native code in the JVM.
e.g. Hashing is designed to solve the problem of needing to efficiently find or store an item in a collection. For example, if we have a list of 10,000 words of English and we want to check if a given word is in the list, it would be inefficient to successively compare the word with all 10,000 items until we find a match. Hashing is a technique to make things more efficient by effectively narrowing down the search at the outset.
 

There are some restrictions placed on the behavior of equals() and hashCode(), which are enumerated in the documentation for Object. In particular, the equals() method must exhibit the following properties:

    Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
    Reflexivity: For all non-null references, a.equals(a)
    Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
    Consistency with hashCode(): Two equal objects must have the same hashCode() value


What is hashing?
Hashing means using some function or algorithm to map object data to some representative integer value. This so-called hash code (or simply hash) can then be used as a way to narrow down our search when looking for the item in the map

How hashing works

Purely as an example to help us grasp the concept, let's suppose that we want to map a list of string keys to string values (for example, map a list of countries to their capital cities). So let's say we want to store the data in Table 1 in the map.
KeyValue
CubaHavana
EnglandLondon
FranceParis
SpainMadrid
SwitzerlandBerne
Table 1: Example data to put in a map
And let's suppose that our hash function is to simply take the length of the string. For simplicity, we'll have two arrays: one for our keys and one for the values. So to put an item in the hash table, we compute its hash code (in this case, simply count the number of characters), then put the key and value in the arrays at the corresponding index. For example, Cuba has a hash code (length) of 4. So we store Cuba in the 4th position in the keys array, and Havana in the 4th index of the values array etc. And we end up with the following:
Position
(hash code = key length)
Keys arrayValues array
1
2
3
4CubaHavana
5SpainMadrid
6FranceParis
7EnglandLondon
8
9
10
11SwitzerlandBerne
Now, in this specific example things work quite well. Our array needs to be big enough to accommodate the longest string, but in this case that's only 11 slots. And we do waste a bit of space because, for example, there's no 1-letter keys in our data, nor keys between 8 and 10 letters. But in this case, the waste isn't so bad either. And taking the length of a string is nice and fast, so so is the process of finding the value associated with a given key (certainly faster than doing up to five string comparisons)1.
We can also easily see that this method wouldn't work for storing arbitrary strings. If one of our string keys was a thousand characters in length but the rest were small, we'd waste the majority of the space in the arrays. More seriously, this model can't deal with collisions: that is, what to do when there is more than one key with the same hash code (in this case, one than more string of a given length). For example, if our keys were random words of English, taking the string length would be fairly useless. Granted, the word psuedoantidisestablishmentarianistically would probably get its own place in the array. But on the other hand, we'd be left with thousands of, say, 6-letter words all competing for the same slot in the array. 

Introduction to hashing and hash maps
   
    On the previous page, we introduced the notion of hashing, mapping a piece of data such as a string to some kind of a representative integer value. We can then create a map by using this hash as an index into an array of key/value pairs. Such a structure is generally called a hash table or, particularly in Java parlance, hash map1. We saw that using the string length to create the hash, and indexing a simple array, could work in some restricted cases, but is no good generally: for example, we have the problem of collisions (several keys with the same length) and wasted space if a few keys are vastly larger than the majority.



Buckets
Now, we can solve the problem of collisions by having an array of (references to) linked lists2 rather than simply an array of keys/values. Each little list is generally called a bucket.
Then, we can solve the problem of having an array that is too large simply by taking the hash code modulo a certain array size3. So for example, if the array were 32 positions in size, going from 0-31, then rather than storing a key/value pair in the list at position 33, we store it at position (33 mod 32) = 1. (In simple terms, we "wrap round" when we reach the end of the array.) So we end up with a structure something like this: 

Graphical Representation

Each node in the linked lists stores a pairing of a key with a value. Now, to look for the mapping for, say, Ireland, we first compute this key's hash code (in this case, the string length, 7). Then we start traversing the linked list at position 7 in the table. We traverse each node in the list, comparing the key stored in that node with Ireland. When we find a match, we return the value from the pair stored in that node (Dublin). In our example here, we find it on the second comparison. So although we have to do some comparisons, if the list at a given position in the table is fairly short, we'll still reduce significantly the amount of work we need to do to find a given key/value mapping.
The structure we have just illustrated is essentially the one used by Java's hash maps and hash sets. However, we generally wouldn't want to use the string length as the hash code. In the next sections, we'll explore how to generate more adequate hash codes. 

Improving our hash function 
In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it's better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn't adequate. We need to choose a better hash function that ideally has these properties:

  • It can distribute the keys more or less evenly over the range of positions in the array. We want to avoid the situation where position 6 has a list of 10,000 entries and position 13 has a list of only 20 entries.
  • It can distribute the keys over a larger range of values. If there are 100,000 words in the map and a perfectly-distributed hash function, then having 50 positions in the array would still give an average list size of 2,000 items. But if we could have 5,000 positions, there'd be just 20 items in each list. And if we had in the region of 100,000 positions then we'd on average make just one or two comparisons1 to find the mapping we were looking for.
 In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values)2. Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.

Writing an adequate hash function for strings

for a given key X, the hash function must always generate the same hash code Y. But the point is that Y should essentially be a "random number across the possible range of hash codes". For typical keys, we don't want our hash codes to tend to "clump" within a certain range, or in binary terms, for certain bits of the hash code to tend to be always 0 or always 1. On the other hand, we also require our hash function to be fairly efficient. In practice, "reasonably random but quick to compute" is the tradeoff we want. For example, we'll see below that so long as a fair proportion of the bits of the hash code are random, then this randomness can be "ironed out" across all of the bits, and for most cases that's good enough.

Let's have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:

1
2
3
4
5
6
7
public int hashCode() {  
  int hash = 0;  
   for (int i = 0; i < length(); i++) {  
    hash = hash * 31 + charAt(i);  
   }  
   return hash;  
  }  

At this point, there are a couple of different routes you can take:
    If you just want to take a pragmatic approach and use Strings as keys to your map data without worrying about any more of the ins and outs of hashing, then you can start using the HashMap class immediately;
    If you want to use objects of your own classes as a key but without worrying too much more about the technical details of hash functions, then you can see some practical hash function guidelines and a guide to overriding the hashCde() and equals() methods, which is effectively what you need to do to "plug in" your hash function and make your class work as a hash map key.
    You may wish to understand more of the technical details of how hash functions work and what the basis is behind the guidelines mentioned.





No comments: