Partition is the technique of splitting the data, or shard, across many database instances. It is mainly intended to reduce the load on single nodes, so the developer must ensure it distributes the data evenly, in terms of access patterns.
Usually, the technique of partitioning is combined with replication, being one leader of a given partition follower of others.
Partitioning Key-Value Data
Partitioning key-value data can be done through range or hashing.
The first one is basically storing data from key A to B in node 1, C to F in node 2, so on and so forth. It’s worth noting that this approach does not encourage the same amount of keys to be stored in nodes, as in the previous example. This approach has the advantage of being able to efficiently respond to range queries, but the drawback that it is susceptible to hot spots when the data access is happening around a set of keys.
The second one is through a hash function applied to the key, and the decision is made using this hash. That might reduce adjacent hot spots but losing the ability of range queries.
Even though the previous approach might alleviate hot spots for similar values, it doesn’t for single values, so another approach can be combined. One random value can be concatenated with the hash to evenly distribute even equal data. The drawback is that queries need to be sent to all nodes.
Partitioning and Secondary Indexes
The approaches commented above handle well when the data needs to be fetched by its primary key, but when dealing with other fields that do not uniquely reference the document, some additional complexity might be added to improve query efficiency results.
One way of achieving this is to partition the secondary indexes by document and storing the indexes locally with the node. Thus, queries for a given value should search all nodes.
Another way is to create global indexes and propagate changes across nodes. The advantage is that the index is stored in only one node, with the drawback being it may not be up to date as soon as the data is stored.
There are some strategies for rebalancing the partitions, they include moving data around so it must be done carefully. One strategy is to divide the hashes into segments to decide to which node the data should be moved. Another strategy is to have many partitions in a single node, when the load in those partitions increases the whole partition can be moved to a new node.
Also, it’s worth noting that the rebalancing process can be done manually or automatically, but the manual approach is preferable since moving large chunks of data around should be done carefully.
For services to find the correct node, a service discovery service, like ZooKeeper, may be used. There are usually three ways of choosing the correct node. The first one is to connect to any node, and this in turn would route the request to the correct node. Another approach is to have a layer that is responsible to route to the correct node. Finally, the information about nodes can be stored with the client, which connects directly to the node.