Sunday, January 22, 2012

Importance of equals and hashCode method in java

Hi,

   We all know that there are two methods in the "Object" class of java:

    a.  public boolean equals(Object obj) : Determines if the current (this) object and passed object (obj) is same/equal  to each other, more precisely from the business front of view, whether "this" and obj represent same data or NOT even if they both may be two different physical object in the memory.

    b. public int hashCode() : Determines the hash value of the current object to be used for direct access mechanisms like in Set, Map etc.

     There are many good explanations on internet which describes how Map and Set operations are performed, but I still see many (even experienced people) miss to understand it. They know when to use Set, Map etc. They also know that data access with this classes are fast, but overlook the reason behind it. Like how data is retrieved fast etc.There are some thumb rules to be followed while overriding both equals and hashCode methods. Please read them here.

        To me, one of the reason why this is happening is that, most people (including me as well) generally end up putting instances of String, Integer, Float, Double etc. in Set or as Key in Map. And all the magic to make sure that access will be fast is already provided by that respective class. How??, No prize to guess those two methods ;) (equals and hashCode).

    But you will start facing issues once started using your custom class to be added in Set or used as Key in the Map. Please continue reading to understand it in details.

 
1. Adding Data in Map:

Lets say we have following code:

HashMap<CustomKey, User> hm = new HasMap<CustomKey, User>();
hm.put(ck1, user1);


What actions are performed by HashMap's Put method internally? Assuming ck1 is NOT NULL

1. It calls hashCode method of ck1 and determines the bucket where it can drop the data (user)

2. If Bucket is empty, its good and simply put the KEY + Data in the bucket.

3. But, it might happen that, there is already some Key in the map (effectively pointing to same bucket) with the same hashCode, in that case:

     a. Apply equals method of current object (ck1) with all Key objects in that bucket until a match is found (ie. equals method returns true) OR all KEY objects in that bucket checked and confirmed to be NOT equal to ck1. This is the reason why people say insertion is slow in Map, Set, as it has to do some computation before really dumping the data.

     b. If a match is found for the KEY (ck1), means he has to replace existing "User" object with User object passed with this method call as parameter. It is  as good as passing the same String valued object as KEY again to the HashMap object with a new value in second parameter of the put function.

     c. If match is NOT found, then its a new KEY with same hashCode, and will again put the KEY+Data in the bucket.

2. Getting Data back:
      Lets assume, we have following code:

HashMap<CustomKey, User> hm = ....;
User user1 = hm.get(ck1);


  What actions are performed when get is called on Map (or contains on Set):
     a. Call "hashCode" on ck1 to find bucket where it can be residing. Because hasCode on (physically) same and also two object which returns "true" when compared will always return same value no matter how many times we call it, it will refer to same bucket where Key would have been stored while putting using put method.
     b. If Bucket is empty, its sure that KEY ck1 doesn't exists in the Map (Set) and it will return null.
     c. If Bucket is NOT  empty,
           i. Compare ck1 with all KEY objects, until a equal KEY object is found OR all Key object are compared (even if only one Key is there in the Bucket) in bucket to find exact KEY object whose Data has to be returned.  Don't forget, we have to do this because to different (un-equal) objects can still have SAME hashCode :) But still, this is the reason why people say retrieval is FAST in Map, Set, as it has to seach in only a subset of the whole data in the Map/Set in ideal conditions to reach to the data. Imagine otherwise you will have compare ck1 with many more KEY objects others wise.
     d. If No KEY is found which is equal to ck1, return null.

 3.  So, Where is the issue???
      Some of you might have already got answer for above question by now, but for those who still scratching their head or pulling their hair, don't worry ;) , I will try my level best to explain it.

     Answer/Problem lies in the Default behavior of equals and hashCode methods. When we declared our own Key class, KeyClass, it is extending Object class inherently. So the implementation of equals and hashCode is already there with this KeyClass as well. But its a very rigid implementation as Object class doesn't know "what business logic you have to decide whether two instances of same class are logically same to you from the business perspective. Also what are the attributes in your class that you can use to make sure that hashCode method returns proper hashCode" following all the guidelines to be followed for hashCode  method.
   For these reasons, the default implementation of equals and hashCode methods are as followed in the Object class:
    a. equals: If both object refs are to physically same location in the memory its assumed to be equal.
    b. hashcode: Basically the default implementation of hashCode() provided by Object is derived by mapping the memory address to an integer value. If look into the source of Object class , you will find the following code for the hashCode.

public native int hashCode();

It indicates that hashCode is the native implementation which provides the memory address to a certain extent.

  So if you make hashCode method to return hard-coded value (say 1) for all object of the class KeyClass, then all the KEY+Data will be kept in the same bucket in the Map's put operation. Effectively forcing java to to do sequential search on all the Key objects in the Map while retrieving the data back. If you by mistake apply logic to return some random hashCode even for the same object with multiple call to hashCode method, you might endup with inability to retrieve the data back from the Map as it might point to different bucket while getting data back.

 Similarly, if you don't implement equals correctly, you might endupto situation where you can't retireve the data back from the Map/Set.

  So, if you add Key + Value object in the Map/Set and then to retrieve the data back from Map, you will need physically same instance of KEY, else code will NOT return data even if the KEY object pass while retrieving is logically same from the business perspective!!!!

  One of the practical example of this issue I have faced myself is while using Grails with Hibernate Caching ON. Using Query Caching, with dynamic finders, makes use of equals object. Inherently its mechanism by which Secondary cache works. Details are provided here.  (http://jira.grails.org/browse/GRAILS-5893)
  It was a big shame to me that it took me almost a year to figure out issue even though I knew these concept of hashing, equals methods, hibernate secondary cache behavior etc. even before facing issue. Its all about facing the issue practically and knowing theory and then relating them to each other to reach to the solution.