System design basics (Part 7) — #Hashing

Thomas Varghese
5 min readJan 6, 2022

Hashing, in its purest sense, is the process of converting an arbitrary piece of data into a fixed size value (typically, an integer value).

The arbitrary piece of data in the context of system design can be an IP address, a username, a HTTP request etc.

Now, as we saw in the prior article on proxies and load balancers, efficient systems need to distribute incoming client requests to servers which are best suited to serve these requests — and load balancers are proxies placed to take that call, of routing the request to the right server.

So, how is this routing, or server selection done?

Here is where we can introduce the concept of hashing functions; hashing functions take in a specific data type (such as a string or an identifier), and output a number. Different inputs may have the same output, but a good hashing function aims to minimize such hashing collisions (which is essential, to maximize what we call uniformity)

As a simple example, say we have 4 clients, a load balancer and 4 servers.

The basic approach would be to pass each of the client names through a hashing function (which provides an output of 4 integers); the server selection is then done by taking the modulus of the output and the number of servers.

So — say the client names are C1, C2, C3 and C4; the hashing function gives us an output of 11, 12, 13 and 14.

Then, the servers are selected in the following way:

Requests from C1 = 11%4 = 3; which means these requests will go to server 4.

Requests from C2 = 12%4 = 0; which means these requests will go to server 1.

Requests from C3 = 13%4 = 1; which means these requests will go to server 2.

Requests from C4 = 14%4 = 2; which means these requests will go to server 3.

Note that we have mapped the modulus outputs with the order of servers available in the system.

In addition to this, caching can be used once we have a client — server mapping, to serve the responses to requests faster. So, as an example, server 4 would typically cache requests from C1, so that repeated requests can be served from an in-memory cache.

What are some drawbacks of this approach?

  1. We can have an existing server fail, in which case the client — server mapping would get misaligned
  2. We may add more servers, or remove servers, based on the incoming traffic patterns, which again would disturb the client — server mapping obtained with the approach above

We can run the hashing function again, but this would now provide a completely new set of client — server mappings, and would misalign with the caches that have been created in the prior example.

Here are where a couple of slightly more complex hashing functions come in:

  • Consistent hashing: this is a type of hashing which aims to minimize the number of keys that need to be remapped when a hash table gets re-sized. It is often used by load balancers to distribute traffic to servers; and minimizes the number of requests that get forwarded to different servers when new servers are added, or existing servers are brought down.

Lets take an example: typically, the starting point is to pass all clients and servers in the system through a hashing function and position them on a circle, as shown below. (C1, C2, C3, C4 are clients), while (A, B, C, D are servers)

The server selection strategy is, moving clockwise, pick the nearest server to the client. You can see the arrows indicate the client server mappings with this approach.

Now, when we remove a server (say C), we still see that most of the client server mappings remain consistent.

Similarly, when adding a new server, we see that there is minimal disruption to the system.

This approach can be extended by passing the clients or servers through multiple hashing functions, thereby positioning them on multiple places on the same circle, This would help build in more resilience in the system, as there are alternate routes for client requests to go, in case an existing mapping fails.

  • Rendezvous hashing: this is a type of hashing, also known as highest random weight hashing; it allows for minimal re-distribution of mappings when a server goes down.
  • The approach here, is to rank each server corresponding to a client (say, in this case the clients are a set of usernames, and the hashing function would provide the highest ranking server mapped to each client)
  • This helps reduce the need for re-mappings when a server goes down or a server is added, as the next highest ranked server is taken in to map to the client, in case the existing client server mapping is disturbed.

Hashing and its use in system design is fascinating, and is an integral part of large scale systems, to ensure performance and optimal routing of incoming requests.

--

--

Thomas Varghese

I help build tech products and optimize outcomes with data; hobbiyst musician and video creator.