Thursday, June 12, 2014

Jaspersoft Ad Hoc Cache clear and build programatically using Django and Selenium

Jaspersoft in amazing in what it does, but really sucks when it doesnt work the way you want it to. I am referring to the Ad Hoc Cache behavior in Jaspersoft. For those familiar with it would know that all we can set for the  Query Cache is the TTL, which is fine, but the problem comes when one has to clear the cache on demand or programatically (from a script or an ETL).

AFAIK, Jaspersoft uses EH cache with hibernate to save the results and the query. It has two problems.

  1. This query cache is lazy in nature so one has to visit the Report for the Cache to warm up for that query. 
  2. Every query has its own expiry time depending on when it was hit and what is the duration of the cache. This mean that one report could show old data and one report could show new data depending on which one was cached when. This causes a lot of confusion to the end users.
  3. Another problem is with invalidation of the cache. There is no HTTP endpoint which one can hit to clear the cache. One has to login as an Admin and the click the "Clear Cache" button.

The above has a lot of limitations. In our case we run an ETL everyday to load data into our redshift DW. This ETL takes a couple of hours to run as this includes some aggregates too. Since the datastore for our application is a Data Ware House, queries are not exactly superfast. What we wanted was a system which would invalidate the cache as soon as the ETL finishes and then  build (warm up) the cache so that the end user does not face the initial slowness of the system and is not shown stale data too.

Since the only option was to go through the browser we tried to automate the same. Phantomjs was our first choice as it runs on Linux and does not require a browser to automate user behavior. This is important as all our production systems are Linux and devoid of any desktop environment.

But this did not work (atleast for us). The security in Jaspersoft is really good and makes use of some hidden ExecutionKeys to ensure that the request is coming from a proper source. We tried all the headers a normal browser adds with our requests, but Jaspersoft did not throw out an execution key without which we could not hit the Clear Cache HTTP endpoint for clearing cache.

The only hope now was to use Selenium which requires a browser and hence a machine with desktop environment. We had to launch one Windows box (t1.micro) just for this purpose in our cloud.

The script first clears the cache and then rebuilds it. To launch this script we had to integrate it with a web framework which would accept our on demand calls from the ETL and launch the process.

We have used Django as the Web framework to achieve the same and the source code is available here. It has both windows and Linux ports. To run the Linux port you would need a desktop environment to be running there.

Project is available here for download.