Data Structures - Hash Tables

Why Hash Tables? ✨

Hash Tables are probably one of the most fundamental data structures you should have a good grasp on. Not only is it an extremely important topic for tech interviews, but it has tons of use cases in real-world software development.

And actually, it's a surprisingly simple concept. 👀

How do they work? 🤷‍♂️

Let's rewind to arrays, particularly to a specific property of arrays.

Arrays are powerful in the fact that we can lookup any value in the array given we have its index in constant time, or O(1) time for those acquainted with the Big O notation.

int arr[10] = {14, 4, 6, 3, 88, 13, 8, 12, 990, 10};
cout << arr[4] << endl; // Will output 88
cout << arr[9] << endl; // Will output 10
cout << arr[0] << endl; // Will output 14

In other words, the index acts like a key to the value of the array in that position. The index we place within the [] operator, such as 4, 9, and 0 act like keys to access the values 88, 10, and 14 respectively. You could even say that the indices implicitly are a map to the array values.

Hash tables extend the idea of this key/value association, where instead of the numeric indexes we're familiar with, our key can be virtually anything.

Now, the Hash Table keys, unlike the array keys, are mapped onto their values in a different manner, and this where we distinguish ourselves from arrays.

Keys undergo a process called hashing, a function which converts our key into an index value.

unordered_map<string, int>StudentMarks;
StudentMarks.insert({"Robert", 98});

Our key here, Robert, will undergo a hashing process, which would uniquely map the string "Robert" to the integer value marks 98, such that at any point of time we could lookup Robert's marks with StudentMarks["Robert"] in constant time (best case scenario).

This hashing process could be as simple and ineffective as the following piece of code:

int hash(const string& key)
{
   int hash_val{};
   for (int i = 0; i < key.size(); i++)
       {
           hash_val = (hash_val + key[i]) % tableSize;
       }
   return hash_val;
}

Of course, unordered_map definitely doesn't use this implementation to hash strings.

The point here is that through the hash function we're simply creating an index value from our arbitrary key value, allowing us to uniquely map the key to our value, analogous to how array indices map to array values.

Collision Resolution 🤦‍♂️

Well, you're not always going to have a unique hash value from a key. Consider our hash function from above, where we're simply accumulating the remainder of the ascii value of each character divided by the size of the Hash Table.

So, let's consider the name "Borter". which is just an anagram of Robert, which will yield the same hash value.

StudentMarks.insert({"Borter", 75}); // Same Hash Value because ASCII

It's likely to different keys hashing to the same value, which is also called collision, which is quite an apt term.

Except the fact that we can definitely come up with a better hashing function, there are ways to resolve this, one of them being separate chaining.

In this method, every index in a hash table is the head of a linked list, another data structure which I hope you're familiar with.

Now, every time we have a collision, we can simply add a node to the index in question and add our value there.

Now, suppose we wanted to look up the marks "Borter" got on his test. How would our hashtable know which value to grab?

StudentMarks["Borter"] // Will return 75, value for key "Borter:

Hashing "Borter" will give us an index which points to a linked list.
We'll iterate over the linked list, comparing our key "Borter" to the key in the current node of the linked list
When we get a match, we'll simply return the value, 75

This is where we kind of diverge from the idea of hash maps acting as arrays in the back end, where every [key/index] retrieval is O(1). If we have collisions taking place, we might be iterating over a linked list, which would be the notorious O(n).

There's quite a lot of other ways to avoid collisions, which also involve better hashing algorithms but that's a topic for another day.

Thanks for reading this post 👋