Wednesday, December 3, 2014

Migrating a single node Cassandra to multi node on AWS (EC2) with Datastax Enterprise Edition and OpsCenter

Migration of single node cassandra to an HA cluster:


As an example we would migrate a single node cassandra cluster to a 4 node Datastax Enterprise Edition Cassandra. More often than not we start with cassandra on a single node and when the time has come to scale we move it to a cluster for HA and like to have a monitoring layer on top of it called OpsCenter which is a GUI tool from DataStax to manage one or more Cassandra clusters.


The first thing that we are going to do is launch a new Ubuntu 14.04 VM in AWS. This machine would act as a template for more machines to be launched in the cluster. Once all the desired applications are installed on this machine we would be creating an AMI out of it.


Once a plain jane Ubuntu 14.04 has been launched, follow the the following steps to create your first Cassandra Node node:
  1. Install Python 2.6+.
  2. Now install DataStax Cassandra.
    1. echo "deb http://username:password@debian.datastax.com/enterprise stable main" | sudo tee -a /etc/apt/sources.list.d/datastax.sources.list where username and password are the DataStax account credentials from your registration confirmation email. You need to register to be able to download. Registration is free.
    2. curl -L https://debian.datastax.com/debian/repo_key | sudo apt-key add - . Note: If you have trouble adding the key, use http instead of https.
    3. sudo apt-get update
    4. sudo apt-get install dse-full (Installs only DataStax Enterprise and the DataStax Agent.)


We now have a DataStax Cassandra node as well as DataStax-Agent installed on this machine. Agent is required by the OpsCenter to monitor each cassandra node remotely.


Now we should copy all the cassandra file to the new machine. You could either attach an empty EBS block to the new machine and copy all the files from the old machine to this volume or just remove the old BS volume from the old machine and attach it to this new machine. After the above activity suppose all the cassandra files are in the location /data/cassandra . This new EBS volume should have good IO throughput. Use provisioned IOPS volume if possible. The instance should have at least 8 GB of Memory and at least 4 CPU.


Now we need to setup the new machine with the new cassandra data at the new location. Do the following on the new node:
  1. sudo service dse stop
  2. Go to /etc/dse/cassandra/cassandra.yaml and configure the following properties:
    1. cluster_name: 'cluster1' . In case you want to change the cluster name, then put the new name here. One more step is required to make this effective, which is covered in the following instructions.
    2. num_token: 256
    3. data_file_directories:  - /data/cassandra
    4. commitlog_directory: /data/cassandra/commitlog
    5. saved_caches_directory: /data/cassandra/saved_caches
    6. endpoint_snitch: Ec2Snitch # or the desired one
    7. - seeds: “x.x.x.x”  to the primary/seed node <private ip>address of the current machine. Seeds is a list of servers which a machine connects to at bootstrap to know the meta data about the cluster. This is used only at the start. Since this is the first machine we put its own IP as the seed.
    8. listen_address: y.y.y.y to the current node <private ip>address
    9. rpc_address: y.y.y.y to current node <private ip>address
    10. auto_bootstrap: false
  3. Go to /etc/dse/dse.yaml  and configure the following properties:
    1. delegated_snitch: org.apache.cassandra.locator.Ec2Snitch
  4. Now we need to setup the datacenter name. Go to cassandra-rackdc.properties and make the following change:
    1. dc_suffix=cassandra_dc1 . Here every node which is part of the same data center should have the same suffix. If you want to create more than one data centers within a cluster then provide different name like cassandra_dc2, etc.
  5. sudo service datastax-agent start
  6. sudo service dse start
  7. Now find out the status of the node. sudo nodetool status. Note that it will take 1-2 mins to start and may throw an exception initially. Once its up it should show only one node in the list with state UN.
  8. Verify that you tables are intact: cqlsh <private ip> -u cluster_name -p password(default cassandra)
  9. If you need to update the cluster name then make sure that you have done Step 2a first and then do the following:
    1. cqlsh> UPDATE system.local SET cluster_name = 'cluster1' where key='local';
    2. sudo nodetool flush


Now we have the cassandra with single node ready. Take the AMI of the above machine and delete everything under /data/cassandra so that this node is clean. Make sure that all the machines launched in this tutorial share the same Security Group and all the traffic between the same Security Group should be open. This is very important, else the nodes will not be able to communicate with each other.


Launch more machines from the AMI taken from the first machine and follow the above steps. Just change the following:
  1. - seeds: “x.x.x.x”  . This should be the private ip of the first machine.
  2. auto_bootstrap: true


When you launch each node and start all the services. Do the nodetool status . Every node’s initial status would be UJ which would change to UN once the node has joined the cluster completely. In this example we have a 4 node cassandra cluster.


Now we need to change the replication factor of our cluster (3). Login to any of the boxes and do the following:
  1. Get the datacenter name. cqlsh> use system;select data_center from local;
  2. cqlsh> ALTER KEYSPACE <cluster_name> WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy','<datacenter_name>' : 3 } ;


Very Important: Post this we need to run nodetool -h <private_ip> repair on each machine of the node. In our experience this may take days together so run this command in nohup. The cluster can be used without any issues even when this is running. It took us almost 20 days to repair all the nodes.


Now we need to setup the Opscenter to monitor the cluster. Launch a new machine (even an m1.small would do) and install ops-center on it. Setup the DataStax apt repo as described above to this new machine and run sudo apt-get install opscenter. This machine should have a public IP. Make sure you run this server is the same Security Group as the Cassandra cluster or open all traffic between the SGs if running in different SG.


Once the opscenter is installed. Execute sudo service opscenter start .


Opscenter can be accessed via https://<IP>/opscenter .


To add the cluster to OpsCenter, do the following:
  1. New Cluster  -> Manage Existing Cluster.
  2. Enter IP any of the nodes. This should be the private IP of the node. We are assuming that the opscenter and nodes are all in the same VPC and the communication is open between them.
  3. It may ask for the .pem private key for the node. Provide the key.
  4. Done. In few minutes it will show the cluster status and all the cluster metrics.


Our cluster with 4 nodes and replication factor of 3 is now ready along with Opscenter to monitor it.

Friday, July 4, 2014

Difference between NAT vs PROXY vs ROUTER

To understand the subtle and not so subtle differences between the three, one needs to know the OSI model and where in that model these operate.

Here is a brief overview of the OSI model.


The left column defines the data model for each layer. For example the smallest data block at the Network layer is a packet and the smallest data block at the transport layer is a segment.

Protocols like TCP and UDP function at the Transport layer, whereas IP and routing works at the Network layer.

Now coming to what operates in which layer:

Router

See the diagram below:


 There are two machines, each on a different network connected through a Router. If machine A wants to communicate to machine B it needs to create a TCP connection. When the connection is created, both the machines A and B are unaware of the presence of a router in between. What this means is, in both the machines if we look up the IP:PORT combination it would be 10.0.0.1:1234 - 11.0.0.1:4567 i.e router is transparent.

In this case router does not participate at the Transport (TCP) layer and just acts as a relay for datagram packets. All it knows is where to send the packet. It does not modify the packets and does not require the response to come back through it. Infact, if there is another router somewhere in between these networks which often is the case if we assume two machines connecting over the internet, then the response may come back through some other route. Router is Network layer (TCP) protocol unaware.

NAT (Network Address Translator)


Now NAT is nothing but an intelligent router. Since we all know that IPv4 ips are in short supply and these days even something as dumb as a fridge has an IP. It is also a fact that most of the devices are consumers and not producers i.e they do not need a publicly resolvable IP. For eg : all the work stations inside a building need not have a public IP assigned to them.

This is where NAT comes into play. What NAT does is hide the machines behind it from the internet. All the machines go through NAT to access the outside world.

 See the image below:

 The machine A still thinks that it is talking to machine B directly (10.0.0.1:1234 - 11.0.0.1:4567). What the NAT does is replace the Source:Port Header of the packet it received from A with its own IP (w.x.y.z) and its own random port (7897). When the packet reaches machine B it thinks it came from the NAT IP.

Since a response is always to the source, it tries to send the response packet back to the NAT at the same port as was overridden in the header(7897). When the NAT had received the packet from A it had assigned it a random port (7897) and kept that in a table called NAT table. So when the response comes back from Machine B it just does a reverse look up in the same table and forwards it to the desired recipient (Machine A). This way more than one machines can access the internet through NAT.

One important point to note here is that at each layer of OSI model there is a checksum to determine if the packet/segment is valid. The same applies here. If the NAT changes the source IP:PORT combination, it need to recalculate the checksum again both at the Transport as well as Network layer. This leads to some additional work for the NAT.

Proxy:  

A proxy works at the Transport layer and is aware of the protocol. Its not transparent in nature. It actually creates two connections one each with source and destination. Machine A does not even know about machine B. For machine A Proxy is the only thing its talking to and does not care how and where the proxy gets its data.

See the image below:











In the above picture Server A does not even know the IP of Server B. All it knows is the IP and port of the proxy server. The proxy server creates two connections, one with Server A and one with Server B. This happens at the Transport layer. Similarly server B does not even know the IP of server A, for it Proxy is the Source.

Examples of proxy servers are load balancers like HAProxy, Nginx, Apache, AWS ELB, F5 BIG-IP. They hide the backend from the outside world and do lot of nifty stuff like load balancing. optimization, etc.

Above is a very high level distinction of the three and the lines infact have blurred. Proxy for one is loosely used even for a NAT and vice versa. This just given you a starting point to learn more.

Thursday, June 12, 2014

Jaspersoft Ad Hoc Cache clear and build programatically using Django and Selenium

Jaspersoft in amazing in what it does, but really sucks when it doesnt work the way you want it to. I am referring to the Ad Hoc Cache behavior in Jaspersoft. For those familiar with it would know that all we can set for the  Query Cache is the TTL, which is fine, but the problem comes when one has to clear the cache on demand or programatically (from a script or an ETL).

AFAIK, Jaspersoft uses EH cache with hibernate to save the results and the query. It has two problems.

  1. This query cache is lazy in nature so one has to visit the Report for the Cache to warm up for that query. 
  2. Every query has its own expiry time depending on when it was hit and what is the duration of the cache. This mean that one report could show old data and one report could show new data depending on which one was cached when. This causes a lot of confusion to the end users.
  3. Another problem is with invalidation of the cache. There is no HTTP endpoint which one can hit to clear the cache. One has to login as an Admin and the click the "Clear Cache" button.

The above has a lot of limitations. In our case we run an ETL everyday to load data into our redshift DW. This ETL takes a couple of hours to run as this includes some aggregates too. Since the datastore for our application is a Data Ware House, queries are not exactly superfast. What we wanted was a system which would invalidate the cache as soon as the ETL finishes and then  build (warm up) the cache so that the end user does not face the initial slowness of the system and is not shown stale data too.

Since the only option was to go through the browser we tried to automate the same. Phantomjs was our first choice as it runs on Linux and does not require a browser to automate user behavior. This is important as all our production systems are Linux and devoid of any desktop environment.

But this did not work (atleast for us). The security in Jaspersoft is really good and makes use of some hidden ExecutionKeys to ensure that the request is coming from a proper source. We tried all the headers a normal browser adds with our requests, but Jaspersoft did not throw out an execution key without which we could not hit the Clear Cache HTTP endpoint for clearing cache.

The only hope now was to use Selenium which requires a browser and hence a machine with desktop environment. We had to launch one Windows box (t1.micro) just for this purpose in our cloud.

The script first clears the cache and then rebuilds it. To launch this script we had to integrate it with a web framework which would accept our on demand calls from the ETL and launch the process.

We have used Django as the Web framework to achieve the same and the source code is available here. It has both windows and Linux ports. To run the Linux port you would need a desktop environment to be running there.

Project is available here for download.

Sunday, May 11, 2014

Jaspersoft HTML5 charts remove decimal (where not required) from tooltip

For those of you who dont know Jaspersoft use HighCharts js library for its HTML5 (dynamic) charts. We were facing this problem of Jaspersoft adding a decimal (XX.00) for all values in the tooltip. This is very disconcerting when the measure is an Integer, like Number of Users, etc. 

We thought that, may be our domain query or our data store was to blame for it and was returning decimal but it was not.

HighCharts tooltip has a property called valueDecimals which determines how may decimal positions it should show. 

Just add this line to getCommonSeriesGeneralOptions method in /var/lib/tomcat7/webapps/jasperserver-pro/scripts/adhoc/highchart.datamapper.js and live happily ever after :).

options.tooltip.valueDecimals=0;

This will add a decimal only if you give one. So all your floats (coming from data store) will remain floats and integers will remain integers.
 

Monday, April 14, 2014

Jaspersoft with AWS Redshift Experince, Learnings and Problems

This was my first experience with AWS Redshift and Jaspersoft. Both the technologies are good and easy to use but can sometimes throw up issues which are difficult to decode/fix . Here are some of the things that I faced/discovered using both.

Redshift:

  1. Has a decent performance when the queries that you are making do not have joins in them. For eg if there is one table which has 1 Bn rows and another which has just 50k rows and you do a join on a column which is not a sort key, then the query may take anywhere between 5-10min which for me is a no go for a dashboard (a user would not wait for 5 mins for a chart to load). This is because Redshift is very disk intensive and the caching of blocks in memory in Redshift is still not up to the mark so it has to read a lot of blocks on disk for every query.
  2. Always try to denormalize data as much as possible. Joins are a crime. It will suck the life out of you and the query.
  3. Concurrency is not upto the mark in redshift. The performance goes down exponentially with every new parallel execution.
  4. A LIKE match would almost always make the query slow (no surprise here as even an OLTP DB would do the same). But the degradation is significant.
  5. We have a query which when run makes the whole cluster unresponsive :) . We still dont know the reason but it happens. Again no clue why.
  6. We used to run VACCUM everyday after importing data into the cluster, but on numerous occasion we saw that the whole cluster would hang because VACCUM stalled on one of the nodes. One has to do a cluster reboot to fix this. Also the command does not time out so it may hang your cluster (READ/WRITE) for almost a day if not rebooted. So we moved this command to just once a week so that we do not hit this problem that often.
  7. Always make use of the SORT key in your queries. It works like a charm.
  8. Build aggregates for the charts as time taken by queries is not predictable when run on huge tables. We build aggregates for all the charts and dont run queries on big tables at run time.
Jaspersoft:
It also has its fair share of issues but most of them are due to Redshift :). The following is for HTML 5 reports which we built using the web wizard and NOT through report designer.

  1. CREATE VIEW may throw an error or may not load at all when redshift is having performance issues or there is a lot of load on redshift, still dont know the reason why but it happens.
  2. If you change the chart type in VIEW , it may not reflect in the REPORT. Recreate the report.
  3. There is no support for having a single axis chart with multiple measures with tooltip showing all the measures together. I found a work around and is present here
  4. There is no API to clear the Query Cache. I tried a lot of command line tools like phantomjs to script it but it did not work as every page has an executionKey and jaspersoft was smart enough to know that it was a scripted attempt :( . I had to write a selenium test case to do the same. As the cache is in memory and NOT DB I guess one would have to modify the JAVA code. It uses EH Cache for caching the queries.
  5. One Good thing about jaspersoft is that even if you leave a report without it completely loading the query still works in the background and populates the cache. This helped me in automating cache warming by simulating user clicks (selenium) without waiting for the page to load.
  6. There is no eager/ preemptive caching of queries and is on demand. So one has to write a selenium test case to warm up the cache. This is a much needed thing for redshift which is not good with concurrent queries.
Hope this will help somebody.

Friday, April 11, 2014

Jaspersoft Shared/Common Tooltip for non mutli-axis/single axis graphs (HTML5)

We were facing this problem wherein we needed four measures to be shown on the same graph (line graph). By default jaspersoft has two chart types, single axis and multi axis. The single axis graph shows only one y axis and one has to hover/move from one measure to the other to view the data in the tooltip.

This was very inconvenient for the end users as they were not able to see all the measures in a single tooltip for the same point on the x axis (date in our case) making it difficult for them to compare the data for all the measures on the graph.

We tried using multi axis chart which by default has a shared tooltip, but shows myltiple y axis which we did not need  as it was useless for us as  one, all measures were comparable to each other and two, it gave  a very wrong impression to the end user as a graph with a very low value could appear over a graph with a very high values due to difference in scales. This created a lot of confusion.

We tried all the forums and blogs but did not get an answer so decided to get our hands dirty. Turns out the fix/change is very easy (it took me 3 days though).

For those who do not know jaspersoft uses highcharts as the charting library for dynamic (HTML5) charts. The highcharts API has a property called "shared" for "tooltip". If we enable this the tooltip becomes shared. So we just needed to find the js file where it was being set and we found it :) .

Go to file scripts/adhoc/highchart.datamapper.js, method name "getCommonSeriesGeneralOptions", line number 342 and change

options.tooltip.shared = HDM.isDualOrMultiAxisChart(extraOptions.chartState.chartType);
to
options.tooltip.shared = true;

Beware, this will make the tooltip shared for all the graphs. If you want more granularity then u need to create another method similar to  HDM.isDualOrMultiAxisChart() and return true or false accordingly.


Sunday, January 12, 2014

AWS JAVA client examples for Auto Scaling metrics (Asynchronous)

Below are few code snippets for gathering Auto Scaling metrics from CloudWatch using the AWS Java Async Client (AmazonCloudWatchAsyncClient). Its very similar to the other code snippets I have shared. The only thing that took me almost a day to discover was the namespace which according to the documentation should be "AWS/AutoScaling" but what actually worked for me was "AWS/EC2"

:(

As always first create the client:


AWSCredentials credentials = new BasicAWSCredentials(obj.getString("AWS_ACCESS_KEY"),obj.getString("AWS_SECRET_KEY")); 
ClientConfiguration config = new ClientConfiguration(); 
config.setMaxConnections(1); // This is done to create fixed number of connections per client
AmazonCloudWatchAsyncClient client = new AmazonCloudWatchAsyncClient(credentials);
client.setConfiguration(config);

Now a utility method to initialize the request object:

private static GetMetricStatisticsRequest initializeRequestObject(AmazonCloudWatchAsyncClient client,JSONObject groupDetails){ 
    GetMetricStatisticsRequest request   = new GetMetricStatisticsRequest();
     
    request.setPeriod(60*5); // 5 minutes 
     
    request.setNamespace("AWS/EC2"); 
         
    List<Dimension> dims  = new ArrayList<Dimension>(); 
    Dimension dim  = new Dimension(); 
    dim.setName("AutoScalingGroupName"); 
    dim.setValue(groupDetails.getString("NAME")); 
    dims.add(dim); 
     
    Date end = new Date(); 
    request.setEndTime(end); 
    // Back up 5 minutes 
    Date beg = new Date(end.getTime() - 10*60*1000); 
    request.setStartTime(beg); 
    request.setDimensions(dims); 
    return request; 
}

Lets gather some metrics now:


    public static void get5MinCPUUtilization(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("CPUUtilization"); 
        request.setUnit(StandardUnit.Percent); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
                 
            } 
        }); 
        return; 
    } 
     
    public static void get5MinDiskReadOps(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("DiskReadOps"); 
        request.setUnit(StandardUnit.Count); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
            } 
        }); 
        return; 
    } 
     
    public static void get5MinStatusCheckFailed(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("StatusCheckFailed"); 
        request.setUnit(StandardUnit.Count); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
            } 
        }); 
        return; 
    } 
     
    public static void get5MinDiskWriteOps(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("DiskWriteOps"); 
        request.setUnit(StandardUnit.Count); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
            } 
        }); 
        return; 
    } 
     
    public static void get5MinNetworkOutBytes(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("NetworkOut"); 
        request.setUnit(StandardUnit.Bytes); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
            } 
        }); 
        return; 
    } 
     
    public static void get5MinNetworkInBytes(AmazonCloudWatchAsyncClient client, final JSONObject groupDetails, final String clientName){ 
        client.setEndpoint(groupDetails.getString("END_POINT")); 
        GetMetricStatisticsRequest request = initializeRequestObject(client, groupDetails); 
         
        request.setMetricName("NetworkIn"); 
        request.setUnit(StandardUnit.Bytes); 
         
        List<String> stats = new ArrayList<String>(); 
        stats.add("Average"); 
        stats.add("Maximum"); 
        stats.add("Minimum"); 
        request.setStatistics(stats); 
         
        client.getMetricStatisticsAsync(request, new AsyncHandler<GetMetricStatisticsRequest, GetMetricStatisticsResult>() { 
             
            @Override 
            public void onSuccess(GetMetricStatisticsRequest arg0,
                    GetMetricStatisticsResult arg1) { 
                List<Datapoint> data = arg1.getDatapoints(); 
                Double avg = data.size() > 0 ? data.get(0).getAverage() : 0.0; 
                Double min = data.size() > 0 ? data.get(0).getMinimum() : 0.0; 
                Double max = data.size() > 0 ? data.get(0).getMaximum() : 0.0; 
                 
            } 
             
            @Override 
            public void onError(Exception arg0) {
                log.error("Could not get Autoscaling data for " + groupDetails.getString("NAME") + " for client "+ clientName,arg0); 
                NotificationMail.sendMail("Could not get Autoscaling data for " + groupDetails.getString("NAME") + " for client "+ clientName, "AutoScaling data could not be read"); 
            } 
        }); 
        return; 
    } 

For some more examples (ELB and RDS metrics) go here

Cloud Based (AWS) Elastic Jmeter Load Testing Application (SWARM)

In this age of internet its imperative for any web based application to benchmark itself for high concurrency. As AWS Advanced Technology partners our work includes helping enterprises/start ups embrace AWS for their production as well as testing workloads. Few questions that people have are

1) Is AWS  scalable ? 
2) How many requests/min can an EC2 instance serve ? 
3) What instance class should I choose for my my application ?
4) How many instances should I chose for my application ?
5) Does Auto Scaling actually work ? 

Turns out there are no simple answers to these question as these are very subjective in nature and vary from one application to the other. The only way to test this is by doing a load test.

Jmeter is almost an industry standard for load testing. We can run a test with desired concurrency and duration and write our own test cases through the GUI provided with it. It can provide a summary in the form a RAW log file (JTL) or a table or a graph.

All this is good when you want to run a load test from one machine, but what if you want to run load test from multiple machines ? How would you aggregate the data across multiple machines ?

You must be wondering why would we need to run load test from multiple machines and why not from one machine only ? 

Some things that I have learnt from my experience are :

1) The test should always be run in a distributed nature. When running concurrent connections from a single machine one could easily reach the network/IO limit of a single machine which would add to response time which would not be correct.

2) Since Jmeter creates multiple concurrent threads, the more the threads more would be the CPU contention which would add to the response time incorrectly.

3) You cannot target requests/unit time for your load test as its a function of number of concurrent threads and the server response time.

4) You can simulate only concurrency with jmeter. For example if you select 100 threads then Jmeter would make sure that there are 100 concurrent requests at any given time. Also Jmeter reuses these threads for maximum performance.

5) When doing load testing for an application behind ELB make sure that either ELB is pre warmed (details here) or you use ramp up. Please note that this is required only when the concurrency you are testing for is very high (there are no numbers shared by AWS). To know whether you are reaching the limits of ELB look for ELB 5XX value in the cloudwatch for your ELB.

6) To know which part of your stack is the bottleneck, use a profiler. My favorite is New Relic. It has plugins for almost all softwares.

To try out our product please visit https://swarm.minjar.com/ .