False confidenceThrough targeted improvements by the programmers we'd seen some dramatic redutions in the query times. Based on the figures that we were seeing for the execution of multiple parallel queries, we thought that we were well within target. Care was taken to ensure that each query was accessing different data partitions and that no files were being cached on the application server.
Missing a trickWe were well aware that our environment was not a perfect match for the customers, and had flagged this as a project risk to address. Our particular concerns revolved around using a gateway server instead of a native NAS device as it was a fundamental difference in topology. As we examined the potential dangers it dawned very quickly that the operating system on the gateway box could be invalidating the test results.
Most operating systems cache recently accessed files in spare memory to improve IO performance, and Linux is no exception. We were well aware of this behaviour and for the majority of tests we take action to prevent this from happening, however we'd failed to take it into account for the file sharing server in this new architecture. For many of our queries all of the data was coming out of memory rather than from disk and giving us unrealistically low query times.
Won't get fooled again
Understanding the operating system behaviour is critical to performance testing. What may seem to be a perfectly valid performance test can yield wildly innaccurate results if the caching behaviour of the operating system is not taken into account and excluded. Operating systems have their own agenda in optimising performance which can operate in conflict with our attempts to model performance to predict beaviour when operating at scale. In particular our scalability graph can exhibit a significant point of inflection when the file size exceeds that which can be contained in the memory cache of the OS.
In this case, despite out solid understanding of file system caching behaviour, we still made an error of judgement as we had not applied this knowledge to every component in a multi tiered model. Thankfully our identification of the topology as a risk to the story and subsequent investigation flushed out the deception in this case and we were able to retest and ensure that the customer targets were met in a more realistic architecture. It was a timely reminder, however, how vital it is to examine every facet of the test environment to ensure that we do not end up as the mark in an inadvertent three card trick.
(By the way - from RHEL 5 onwards linux has supported the hugely useful method to clear the OS cache
echo "n" > /proc/sys/vm/drop_cachesWhere n=1,2 or 3. Sadly, not all OSs are as accomodating)
Copyright (c) Adam Knight 2011