Sunday, 8 December 2013

Elementary, my dear customer

One of my personal traditions as winter approaches in England is sitting in front of the fire and watching one of the many excellent dramatisations of the classic story, The Hound of the Baskervilles. Having read the complete Sherlock Holmes repeatedly when I was younger the characters and plots have a comforting familiarity when the weather outside turns to spiteful. One of the most famous literary characters of all time Holmes, as I'm sure you are aware, uses the application of logical reasoning in the investigation of crime.

I find that exactly the same processes of logical and deductive reasoning are also invaluable to both the software testers and technical support agents that I work with when performing some of the more challenging aspects of those activities, for example when trying to establish what might be causing bugs that have  been observed in our software and systems.

More information than you know

'Pon my word, Watson, you are coming along wonderfully. You have really done very well indeed. It is true that you have missed everything of importance, but you have hit upon the method, and you have a quick eye for colour. Never trust to general impressions, my boy, but concentrate yourself upon details" (A Case of Identity)

One of the characteristics of Holmes deductions is that he does not make great inspirational leaps, what appear to be fantastic demonstrations of deduction are made capable through simply observing details that others miss that provide a wealth of information when a process of logical reasoning is applied to them. It is my experience that those that achieve the best results in investigations are the ones that take the time to really understand the information available and what they can deduce from it.

One of my team recently asked me for some help in knowing where to start when trying to investigate a software problem. My answer was, on the face of it, quite trite:

"Write down everything that you know".

Whilst this seems childishly obvious, the point that I was leading to was that if we examine and document all of the facts at our disposal, "what we know" can actually be a much larger set of information, and get us a lot closer to an explanation, than might first appear possible. By collating all of the information that we can identify from the various sources of available to us into a central set of notes helps to organise our knowledge of a situation and ensure that we don't overlook simple details or commit too early to conclusions without look at all the available data. I, and very capable individuals that I have worked with, have come to erroneous conclusions on problems simply because we commited to a diagnosis too early before examining the available information in sufficient detail to form a conclusion.

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." (A Scandal In Bohemia)

Story 1

A simple example presented itself recently. We had a failure that had occurred repeatedly across a cluster of servers over a period of continous running and one of my team was investigating the occurrences of the failure. I'd asked him to examine the occurrences of the failure in the logs to see if he could identify any relationships

They were:-

10/21 2:01

10/21 17:39

10/21 18:39

10/22 8:41

10/22 9:41

10/22 13:42

His first thought was that there was no relationship between the times, yet if we exclude the anomaly of the first reading we can see that all of the occurrences are at around 20 minutes to the hour. The pattern may look obvious when presented as above but when lost in a large set of log files it is easy to miss. Simply by taking the information out and looking at the relevant values in isolation does a clear pattern emerge. When I highlighted the relationship to him he re-examined the logs in light of this pattern and established more information on the first value, identifying that it was due to an unrelated and explainable human error. The remaining entries that fitted the pattern were all caused by the problem we were investigating.

Write down everything that you know:-

"All of the failures relating specifically to the problem at hand occurred around 20 minutes to the hour. The time of the failures gets approximately 1 minute closer to the hour with every 12-14 hours. One other failure occurred during the period which the logs show to be due to human syntax error."

By establishing that there was a temporal element to the occurrence of this issue we established that it must be affected by a process with a time component or cyclical process, most likely operating on an hourly loop. Based on this information we were able to correlate the occurrence of the error with the hourly refresh of a caching service.

Story 2

In a similar vein on investigating a query failure that occurred over a weekend one of the team pulled the logs across the whole weekend for the log message in question. On looking at when the problems occurred it was immediately apparent that all of the failures occured just before twenty-past one in the morning.

Server:192.168.0.1|2013-10-21 01:18:01 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.2|2013-10-22 01:19:21 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.1|2013-10-22 01:19:23 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.3|2013-10-23 01:19:52 -0600|3|Task:12-2013102301180103210231|A query error occurred ...

The immediate conclusion was that all had been impacted by the a single event. On re-examining the timings and looking at the dates of the failures in addition to the times, it was clear that the failures had actually occurred at similar time, but on different days. Not only that, they had affected different servers. This may seem like an obvious detail, but if you are looking at a list of consolidated error log entries and see multiple concurrent events with matching times, it is very easy to unconsciously make the assumption that the dates will also be the same and not actually confirm that detail.

Write down what you know

All of the failures occurred between 1:18am and 1:20am server local time, which is UTC + 6 hours.The failures occur on separate days on separate servers in the cluster.

Based on this information we could infer that the problem was being caused by a process common to all machines, or a process external to yet shared by all of the machines. either way there was clearly some timing element to the issue which made the problem more likely around 1:20 am and occurred on all three days. We were able to provide this information back to the customer who were then able to investigate the virtual environment with their infrastructure team on this basis.

The power of Contraposition

"when you have eliminated the impossible, whatever remains, however improbable, must be the truth". (The Sign of Four)

This is perhaps the most famous Holmes quote of all, and refers to the indirect deductive approach of Contraspositive Reasoning. This is the process of deducing that an event has not occurred by establishing a contradition to an antecedent of that event, i.e. establishing that a logically necessary condition for that event has not taken place. When faced with a failure, the typical behaviour (in my company at least) for all of the individuals concerned will be to offer theories and ideas about what may be causing that failure. This can result in a number of plausible theories on the problem which can make narrowing down to the likely problem very difficult. By examining all of the information available both to identify what we know is the case, but also in the light of what that information tells us isn't happening, we can make far more accurate inferences about the nature of a problem than could be apparent by taking the information at face value.

Story 3

A couple of years ago I was investigating a customer problem and struggling. I'd spent over a week collecting information and could still not identify the cause of the issue. Some of the key points were

  • A problem occurred running a job through a specific tool, however a different tool running the same job via the same connection ran successfully and other smaller operations through the original tool were also OK.
  • The problem appeared to result in processing files in our processing queue being trucated or corrupted in some way

Myself and three of the other team members sat in a room and literally mind-mapped everything that we knew about this process. Over the next hour, as a team, we established the facts and made a series of new deductions on the behaviour:-

  • One suggested another process corrupting ('gobbling up') some of the files - we established that the files were missing entire written sections. If something was affecting the files post write then an expected outcome would be that the files were truncated at random points rather than these clean section endings
  • We discussed a common failure of each parallel process creating the files. Each file had the same modified date yet a different size. If the write process was failing on each write separately then we'd expect different modified dates and probably more consistent sizes, so we deduced that one event was affecting all of the files.
  • We discussed an event that could have occurred at the point of failure. On examination the last write time of the files was well before the problem was reported. If the problem was caused by an event at the time of failure then we'd expect to see matching write times, therefore the actual problem was taking place earlier and the files were taking time to work their way through the processing queue and cause the failure event to be reported

Through this group session of logical reasoning we elminated other possibilities and established the problem was related to running out of space in a processing queue. This seemed unlikely given the configuration, however on the basis of our deductions we probed the other information that we had on the configuration and identified that the customer installation had been copied from a much larger server without being reconfigured appropriately for the new environment. The disk space provisioned for the queue on the new machine, combined with the the slow ingestion of results by the client application, was causing the exact behaviour that we established must be happening.

None of these deductions required any real time tracing or extra debugging. All that we had was a recursive listing of the files in the queue and a couple of example files from the failing process. By taking each hypothesis on what could have caused the problem and used the information available to prove the absence of a logical outcome of that cause, we could disprove the hypothesis and narrow down to what remained, which had to be the truth.

A role mode for Testers

Holmes is one of my great fictional heroes and it hugely rewarding that exactly the same processes of logical and deductive reasoning that are made so famous in those novels are also invaluable to both the software testers and technical support agents that I work with in performing their work. In fact, when providing some mentoring to one of my team recently I actually recommended that he read some Sherlock Holmes novels to gain inspiration in the art of deduction to help him to track down bugs.

I'm aware that I'm not the first to make such a comparison, yet all too often I am on the receiving end of requests for information or erroneous deductions from other organisations because the individual in question has not examined fully the information available to them. In too many testing cultures that I have encountered it is rare for the individual raising an issue to have made any attempts to apply a process of logical reasoning to establish what that information is telling them. With the cheapness and immediacy of communication some find it easier to fire the immediate 'end behaviour' of issues to others rather than taking the time to establish the facts themselves.

One of the things that I love about my organisation is that I, and the people that I work with, will always strive to fully understand each new situation and use the information available, and their own powers of logical reasoning, to their best advantage to achieve this. Sadly we'll never have Sherlock Holmes working for us as a software tester, I believe that having a culture of attempting to use the same skills that make the fictional detective so famous is the next best thing.

2 comments:

  1. I don't usually like "testers are like detectives" posts but this one was an exception with the use of real-life examples. Great read

    ReplyDelete
    Replies
    1. Phil,

      One of my missions in life is to present your personal bugbears back to you in a way that you grudgingly appreciate!

      Thanks for reading and commenting

      Adam

      Delete

Thanks for taking the time to read this post, I appreciate any comments that you may have:-

ShareThis