Monday, April 14, 2014

Jaspersoft with AWS Redshift Experince, Learnings and Problems

This was my first experience with AWS Redshift and Jaspersoft. Both the technologies are good and easy to use but can sometimes throw up issues which are difficult to decode/fix . Here are some of the things that I faced/discovered using both.


  1. Has a decent performance when the queries that you are making do not have joins in them. For eg if there is one table which has 1 Bn rows and another which has just 50k rows and you do a join on a column which is not a sort key, then the query may take anywhere between 5-10min which for me is a no go for a dashboard (a user would not wait for 5 mins for a chart to load). This is because Redshift is very disk intensive and the caching of blocks in memory in Redshift is still not up to the mark so it has to read a lot of blocks on disk for every query.
  2. Always try to denormalize data as much as possible. Joins are a crime. It will suck the life out of you and the query.
  3. Concurrency is not upto the mark in redshift. The performance goes down exponentially with every new parallel execution.
  4. A LIKE match would almost always make the query slow (no surprise here as even an OLTP DB would do the same). But the degradation is significant.
  5. We have a query which when run makes the whole cluster unresponsive :) . We still dont know the reason but it happens. Again no clue why.
  6. We used to run VACCUM everyday after importing data into the cluster, but on numerous occasion we saw that the whole cluster would hang because VACCUM stalled on one of the nodes. One has to do a cluster reboot to fix this. Also the command does not time out so it may hang your cluster (READ/WRITE) for almost a day if not rebooted. So we moved this command to just once a week so that we do not hit this problem that often.
  7. Always make use of the SORT key in your queries. It works like a charm.
  8. Build aggregates for the charts as time taken by queries is not predictable when run on huge tables. We build aggregates for all the charts and dont run queries on big tables at run time.
It also has its fair share of issues but most of them are due to Redshift :). The following is for HTML 5 reports which we built using the web wizard and NOT through report designer.

  1. CREATE VIEW may throw an error or may not load at all when redshift is having performance issues or there is a lot of load on redshift, still dont know the reason why but it happens.
  2. If you change the chart type in VIEW , it may not reflect in the REPORT. Recreate the report.
  3. There is no support for having a single axis chart with multiple measures with tooltip showing all the measures together. I found a work around and is present here
  4. There is no API to clear the Query Cache. I tried a lot of command line tools like phantomjs to script it but it did not work as every page has an executionKey and jaspersoft was smart enough to know that it was a scripted attempt :( . I had to write a selenium test case to do the same. As the cache is in memory and NOT DB I guess one would have to modify the JAVA code. It uses EH Cache for caching the queries.
  5. One Good thing about jaspersoft is that even if you leave a report without it completely loading the query still works in the background and populates the cache. This helped me in automating cache warming by simulating user clicks (selenium) without waiting for the page to load.
  6. There is no eager/ preemptive caching of queries and is on demand. So one has to write a selenium test case to warm up the cache. This is a much needed thing for redshift which is not good with concurrent queries.
Hope this will help somebody.

No comments:

Post a Comment