Friday, June 12, 2009

Developer / Blog

Developer / Blog: "When you want to split up your data two questions spring to mind: which property of the data (which column of the table) will I use to make the decisions on where the data should go? And what will the algorithm be? Let's call the first one the 'sharding/partitioning key', and the second one the 'sharding/partitioning scheme'.

Which sharding key will be used is basically a decision that depends on the nature of your application, or the way you'll want to access your data. In the blog example, if you display overviews of blog messages per author, it's a good idea to shard on the author's $userID. Say your site's navigation is through archives per month or per category, it might be smarter to shard on publication date or $categoryID. (If your application requires both approaches it might even be a good idea to set up a dual system with sharding on both keys.)

What you can do with the 'shard key' to find its corresponding shard basically falls into 4 categories:


Vertical Partitioning: Splitting up your data on feature/table level can be seen as a kind of sharding, where the 'shard key' is eg. the table name. As mentioned earlier this way of sharding is pretty straightforward to implement and has a relatively low impact on the application on the whole.
Range-based Partitioning: In range based partitioning you split up your data according to several ranges. Blog posts from before the 2000 and before go to database 1, blog posts from the new millenium go to the other database. This approach is typical for logging or other time based data. Other examples of range based partitioning could include federating users according to the first number of their postal code.
Key or Hash based Partitioning: The modulo-function used in the photos example is a way of partitioning your data base"

No comments: