How can I properly store data relationships with Microsoft Azure Table Storage?


Coming from a Relational world things are obviously very different with the Azure Table storage. The first major thing I've run into is how to properly store Many-to-Many relationships.

For example, I may have a system that keeps track of Users and Books that they own. I found one other post here on SO that suggested to have a String Property on the User that basically stored a list of the Book IDs that the User owned. While I understand that sometimes this is an accepted way to store data, the problem is that Azure only allows you to store 64KB of data in a String. That definitely puts a limit how many Books a User could potentially own.

Another possible solution is to have duplicate data. I may have a table that stores all known Books in the system. But when a User needs to be associated with a Book I copy the Book data into a different table called OwnedBooks which essentially is exactly the same as the Book table except it has a OwnedByUserID Property too.

Are there other possible solutions?

Besides this issue, does anyone have any good suggestions for other patterns and practices when using Azure table storage?

There are a number of solutions to this - all with drawbacks of course :-)

  1. Use a simple mapping table as you would in an RDBMS. Each row would contain a Book key and a User key.

    Then, to find all Books for a User, you'd select the Book keys in the mapping table, and then for each of those keys, select the Book entity from the Books table. You'd be able to do the Book retrievals in parallel using async fetching, but even so, this solution obviously doesn't scale.

  2. Use a mapping table as above, but include all Book data you need in the mapping table too. This is the denormalised, or "duplicated data" solution you've already proposed with your OwnedBooks table.

    The main drawback of this method is that if you need to update any of the Book data, you will be potentially updating many entities - and as they live in a separate table to the Book itself, it won't be able to be completed in a single transaction/batch (and I'd imagine you'd use the User identity as the Partition key in the mapping table anyway, which already precludes a single batch update in that table).

  3. Store the Book keys joined in a single property of the User. Again, you've suggested this method already.

    This actually wouldn't be so bad if it wasn't for the fact that Azure doesn't currently support "contains" type queries - ie, you can't search on a substring, so if you ever wanted to find out which Users owned a particular Book, this would be impossible. Interestingly, Google App Engine supports this fairly transparently in their storage system - and will index the list for you too. In any case, you'd still need to retrieve each Book's data with this method too.

  4. Use the "schemaless" nature of the Azure table storage to store associated Book keys as individual properties. Eg, one User entity may look like this:

    { Name: "User1", Book_4325: true, Book_5123: true }

    While another may look like this:

    { Name: "User2", Book_5346: true, Book_8753: true, Book_6135: true }

    Then if you did want to find all Users that own a particular Book you can select where that particular property is true (well, it just needs to exist really).

    The obvious drawbacks of this are that it's a little brittle, you need to fiddle with keys in property names, and you wouldn't be able to use the standard methods of StorageClient for this - you'd have to roll your own. Also, Azure only supports 255 properties on an entity. All that said, I think it would scale quite well - although I've never tried it.

Out of all these options, I'd say the one you were going to go with, option 2, would be the best, just for the fact that it's currently supported by Azure and you can typically achieve everything with fewer queries.

You'd just need to scrutinise your Use Cases to decide on how and when the data would be updated, considering that atomic transactions are out of the window. I can almost guarantee that you'd be able to live with things being "eventually consistent" and just account for the fact that your mapping table may not always be 100% up to date.

If it becomes too expensive to update the data in the mapping table at the same time as the primary table, you could put a message on a queue and get a worker role to perform the updates for you asynchronously.