This is a general design problem - I want to validate a username field for uniqueness when the user enters the value and tabs out. I do a Ajax validation and get a response from the server. This is all very standard. Now, what if I have a HUGE user database ? How to handle this situation ? I want to find if a username "foozbarz" is present among 150Million usernames ?
- Database queries are out of question [EDIT] - Read the username database once and populate the cache/hash for faster lookup (to clarify Emil Vikström's point)
- In memory databases wont help either
- Keep an in-memory hash (or cache/memcache) to store all usernames - usernames can be easily hashed and lookup will be very fast. But there are some problems with this: a. Size of the hash - can we optimize so that we can reduce the hash size ? b. Hash/cache refresh frequencies (users might get added while we are validating)
- Shard the username table based on some criteria (e.g.: A-B in table username_1 and so on) - thanks piotrek for this suggestion
Or, any other better approach ?
why don't you simply partition the data? if you have/plan to have 150M+ users i assume you have/will have budget for this. if you are just starting (with 2k users) do it traditional way with simple indexed search on database. when you have so many users that you observe performance issues and measure that this is because of your database (and not e.g. www server) then you simply put another database. on the first one you will have users with name from a to m and rest on the other one. you may choose other criterion, like hash, to make data be balanced. when you need more you will add more databases. but if you don't have so many users right now, i advise you not to do any premature optimizations. there are many things that may become a bottleneck with this amount of data