We were assigned a task of building a REAL TIME analytics API . The data to be analyzed were simple counters for multiple events that happen across the website. For eg how many times a particular listing appeared in the search results, how many times the listing page was viewed, how many times Click to Reveal was pressed, etc.
The problem is simple when you have a finite set of listing. But in our case we had around a million listings and each listing had around 5 events and we had to keep data for each listing for an year.
The first approach would be to use a NoSQL database like Mongodb. To be honest, mongodb for just maintaining the counters seemed like an over kill. Other reasons for not using Mongodb were:
The features that tipped the scale in Redis' favor were.
So we implemented our own solution and here are some of my learning from the same.
1) Design for failure: You cannot afford to have an analytics API down. So have a file backed fallback in case of datastore (Redis) being down. Have multiple app servers behind a load balancer which helps in making hot releases without bringing the site down. Write a script which replays these logs when your datastore is up.
2) All non counter data is written to a file. We did not know what to do with this additional data, so just persisted it in a file and pushed it periodically to AWS S3.
3) Enable AOF for Redis with a flush frequency of 1 sec. This is a must if you dont want to lose data on reboot of Redis :). The AOF file should be periodically pushed to S3 for higher durability.
4) Have RDB snaphots atleats twice a day and push those to S3.
5) When the data size becomes few GBs Redis takes a few minutes (yes minutes) to start as it runs the whole AOF file on startup. The AOF file is nothing but a list of all the DMLs that have happened and it does so serially. Redis would throw an exception until this AOF has been replayed completely. This is the reason why we need a file backed fallback when something like this happens.
6) Use Redis Hashes as they are superbly optimized for space (with high time complexity but this trade off doesnt hurt much).
7) Redis is single threaded. So adding more CPUs is of no use. Throw as much memory as you can at it.
8) No redis library supports transactions (they cannot actually) when you run Redis cluster as its not possible to have a transaction across multiple datastores. Redis cluster works the same way as memcache (i.e data is sharded on key hashes).
9) Needless to say use the smallest possible key:value pairs as even a single "_" in 1M keys adds a couple of hundred MBs to memory footprint.
The request flow of our application is something like this:
Hope this helps someone to make a decision about using Redis for analytics.
The problem is simple when you have a finite set of listing. But in our case we had around a million listings and each listing had around 5 events and we had to keep data for each listing for an year.
The first approach would be to use a NoSQL database like Mongodb. To be honest, mongodb for just maintaining the counters seemed like an over kill. Other reasons for not using Mongodb were:
- The write throughput would be a cause for concern given the collection level lock in Mongodb.
- No support for transactions. It becomes very import in case of an analytics API, as we cannot afford to update one counter and then fail to update the other.
- No support for atomic operation on a document. So there was no way to synchronize the counter increments or decrements.
- No native support for incrementing/decrementing counters.
The features that tipped the scale in Redis' favor were.
- Its crazy fast as the whole db is in memory.
- Transaction are supported.
- Native support for atomic operations.
- Native support for incrementing/decrementing counters.
So we implemented our own solution and here are some of my learning from the same.
1) Design for failure: You cannot afford to have an analytics API down. So have a file backed fallback in case of datastore (Redis) being down. Have multiple app servers behind a load balancer which helps in making hot releases without bringing the site down. Write a script which replays these logs when your datastore is up.
2) All non counter data is written to a file. We did not know what to do with this additional data, so just persisted it in a file and pushed it periodically to AWS S3.
3) Enable AOF for Redis with a flush frequency of 1 sec. This is a must if you dont want to lose data on reboot of Redis :). The AOF file should be periodically pushed to S3 for higher durability.
4) Have RDB snaphots atleats twice a day and push those to S3.
5) When the data size becomes few GBs Redis takes a few minutes (yes minutes) to start as it runs the whole AOF file on startup. The AOF file is nothing but a list of all the DMLs that have happened and it does so serially. Redis would throw an exception until this AOF has been replayed completely. This is the reason why we need a file backed fallback when something like this happens.
6) Use Redis Hashes as they are superbly optimized for space (with high time complexity but this trade off doesnt hurt much).
7) Redis is single threaded. So adding more CPUs is of no use. Throw as much memory as you can at it.
8) No redis library supports transactions (they cannot actually) when you run Redis cluster as its not possible to have a transaction across multiple datastores. Redis cluster works the same way as memcache (i.e data is sharded on key hashes).
9) Needless to say use the smallest possible key:value pairs as even a single "_" in 1M keys adds a couple of hundred MBs to memory footprint.
The request flow of our application is something like this:
Hope this helps someone to make a decision about using Redis for analytics.
No comments:
Post a Comment