Data Partitioning Strategies — A simple Partition by Key in CouchDB -Handson
When Enterprise system own very large datasets ,its not wise idea to store all of them in a single partition, owning them in a single partition is costlier when we have to query data efficiently and still provide high query throughput. Partitioning the data ,i.e In other words slicing the data based upon the application need into different buckets and querying them is an efficient strategy that guarantees performance and improves scalability of application. When partitioning the data we may have to consider the trade-offs like Rebalancing the Partition, Auto Balancing Partition etc.. In this article we will walk through the basic strategy to partition data.
Strategies to Partition Data :
- Partitioning by Key Range — Partition data by set of Key Range , example could be key range of first letter can be classified into 2 buckets (26/2), but this does not guarantee data is evenly distributed between two partitions ,often it is refereed as “Skew” or “Hot Spot”(One partition is often hit with load).
- Partitioning by Hash of Key — Because of risk of Skew/Hot spot , most of distributed datastores uses hash function(Consistent Hashing) to partition for a given key. Consistent Hashing also support minimal effort when comes to Rebalancing partitions i.e when a partition is added or removed from datastores.
- Partitioning by Secondary Index : Often we pick more than one column/field to search through data and is defined as Secondary Index. Partitioning by secondary Index is also effective way to partition data when there are large amount of data .(Example :HBASE DB)
Lets get started with partitioning by Key on No-Sql DB (Couchdb)
- Install Latest version of CocuhDB , I am using windows and here is the Installation step https://docs.couchdb.org/en/stable/install/windows.html
- Create a database with partition enabled, this can be achieved by both curl or from browser utilities of Couchdb
3. We have now created a database with partition enabled , Cocuhdb supports partition by key by default all we have to do is create a document with key:value, i.e
{
"_id": "sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
"sensor_id": "sensor-260",
"location": [41.6171031, -93.7705674],
"field_name": "Bob's Corn Field #5",
"readings": [
["2019-01-21T00:00:00", 0.15],
["2019-01-21T06:00:00", 0.14],
["2019-01-21T12:00:00", 0.16],
["2019-01-21T18:00:00", 0.11]
]
The above document creates a partition with sensor-260 (key)
4. Lets create one more document with different key i.e sensor-261
{"_id": "sensor-261:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
"sensor_id": "sensor-260",
"location": [41.6171031, -93.7705674],
"field_name": "Rangesh Corn Field #5",
"readings": [
["2019-01-21T00:00:00", 0.15],
["2019-01-21T06:00:00", 0.14],
["2019-01-21T12:00:00", 0.16],
["2019-01-21T18:00:00", 0.11]
]}
We have now two documents placed in one partition each , Lets query them by Partition ID
When we have huge volume of data partition will play a huge role and querying the data would become less costlier and meaningful with help of partitions. Most of DB sql/no-sql DB have different partition strategy and its all about configuration and how effectively we use the same.
Reference :