Saturday, 4 June 2011

Follow the lady - on not getting tricked by your OS when performance testing

Recently my colleagues and I were involved in working on a story to achieve a certain level of query performance for a customer. We'd put a lot of effort into trying to generate data which would be representative of the customer's for the purpose of querying. The massive size of the target installation, however, prevented us from generating data to the same scale so we had created a realistic subset across a smaller example date range. This is an approach we have used many times before to great effect in creating acceptance tests for customer requirements. The target disk storage system for the data was NFS, so we'd created a share to our SAN mounted from a gateway Linux server and shared out to the application server.

False confidence

Through targeted improvements by the programmers we'd seen some dramatic redutions in the query times. Based on the figures that we were seeing for the execution of multiple parallel queries, we thought that we were well within target. Care was taken to ensure that each query was accessing different data partitions and that no files were being cached on the application server.

Missing a trick

We were well aware that our environment was not a perfect match for the customers, and had flagged this as a project risk to address. Our particular concerns revolved around using a gateway server instead of a native NAS device as it was a fundamental difference in topology. As we examined the potential dangers it dawned very quickly that the operating system on the gateway box could be invalidating the test results.

Most operating systems cache recently accessed files in spare memory to improve IO performance, and Linux is no exception. We were well aware of this behaviour and for the majority of tests we take action to prevent this from happening, however we'd failed to take it into account for the file sharing server in this new architecture. For many of our queries all of the data was coming out of memory rather than from disk and giving us unrealistically low query times.

Won't get fooled again

Understanding the operating system behaviour is critical to performance testing. What may seem to be a perfectly valid performance test can yield wildly innaccurate results if the caching behaviour of the operating system is not taken into account and excluded. Operating systems have their own agenda in optimising performance which can operate in conflict with our attempts to model performance to predict beaviour when operating at scale. In particular our scalability graph can exhibit a significant point of inflection when the file size exceeds that which can be contained in the memory cache of the OS.

In this case, despite out solid understanding of file system caching behaviour, we still made an error of judgement as we had not applied this knowledge to every component in a multi tiered model. Thankfully our identification of the topology as a risk to the story and subsequent investigation flushed out the deception in this case and we were able to retest and ensure that the customer targets were met in a more realistic architecture. It was a timely reminder, however, how vital it is to examine every facet of the test environment to ensure that we do not end up as the mark in an inadvertent three card trick.

(By the way - from RHEL 5 onwards linux has supported the hugely useful method to clear the OS cache
echo "n" > /proc/sys/vm/drop_caches
Where n=1,2 or 3. Sadly, not all OSs are as accomodating)

Copyright (c) Adam Knight 2011
Twitter: adampknight

No comments:

Post a Comment

Thanks for taking the time to read this post, I appreciate any comments that you may have:-