Sunday, 8 December 2013

Textual description of firstImageUrl

Elementary, my dear customer

One of my personal traditions as winter approaches in England is sitting in front of the fire and watching one of the many excellent dramatisations of the classic story, The Hound of the Baskervilles. Having read the complete Sherlock Holmes repeatedly when I was younger the characters and plots have a comforting familiarity when the weather outside turns to spiteful. One of the most famous literary characters of all time Holmes, as I'm sure you are aware, uses the application of logical reasoning in the investigation of crime.

I find that exactly the same processes of logical and deductive reasoning are also invaluable to both the software testers and technical support agents that I work with when performing some of the more challenging aspects of those activities, for example when trying to establish what might be causing bugs that have  been observed in our software and systems.

More information than you know

'Pon my word, Watson, you are coming along wonderfully. You have really done very well indeed. It is true that you have missed everything of importance, but you have hit upon the method, and you have a quick eye for colour. Never trust to general impressions, my boy, but concentrate yourself upon details" (A Case of Identity)

One of the characteristics of Holmes deductions is that he does not make great inspirational leaps, what appear to be fantastic demonstrations of deduction are made capable through simply observing details that others miss that provide a wealth of information when a process of logical reasoning is applied to them. It is my experience that those that achieve the best results in investigations are the ones that take the time to really understand the information available and what they can deduce from it.

One of my team recently asked me for some help in knowing where to start when trying to investigate a software problem. My answer was, on the face of it, quite trite:

"Write down everything that you know".

Whilst this seems childishly obvious, the point that I was leading to was that if we examine and document all of the facts at our disposal, "what we know" can actually be a much larger set of information, and get us a lot closer to an explanation, than might first appear possible. By collating all of the information that we can identify from the various sources of available to us into a central set of notes helps to organise our knowledge of a situation and ensure that we don't overlook simple details or commit too early to conclusions without look at all the available data. I, and very capable individuals that I have worked with, have come to erroneous conclusions on problems simply because we commited to a diagnosis too early before examining the available information in sufficient detail to form a conclusion.

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." (A Scandal In Bohemia)

Story 1

A simple example presented itself recently. We had a failure that had occurred repeatedly across a cluster of servers over a period of continous running and one of my team was investigating the occurrences of the failure. I'd asked him to examine the occurrences of the failure in the logs to see if he could identify any relationships

They were:-

10/21 2:01

10/21 17:39

10/21 18:39

10/22 8:41

10/22 9:41

10/22 13:42

His first thought was that there was no relationship between the times, yet if we exclude the anomaly of the first reading we can see that all of the occurrences are at around 20 minutes to the hour. The pattern may look obvious when presented as above but when lost in a large set of log files it is easy to miss. Simply by taking the information out and looking at the relevant values in isolation does a clear pattern emerge. When I highlighted the relationship to him he re-examined the logs in light of this pattern and established more information on the first value, identifying that it was due to an unrelated and explainable human error. The remaining entries that fitted the pattern were all caused by the problem we were investigating.

Write down everything that you know:-

"All of the failures relating specifically to the problem at hand occurred around 20 minutes to the hour. The time of the failures gets approximately 1 minute closer to the hour with every 12-14 hours. One other failure occurred during the period which the logs show to be due to human syntax error."

By establishing that there was a temporal element to the occurrence of this issue we established that it must be affected by a process with a time component or cyclical process, most likely operating on an hourly loop. Based on this information we were able to correlate the occurrence of the error with the hourly refresh of a caching service.

Story 2

In a similar vein on investigating a query failure that occurred over a weekend one of the team pulled the logs across the whole weekend for the log message in question. On looking at when the problems occurred it was immediately apparent that all of the failures occured just before twenty-past one in the morning.

Server:192.168.0.1|2013-10-21 01:18:01 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.2|2013-10-22 01:19:21 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.1|2013-10-22 01:19:23 -0600|3|Task:12-2013102301180103210231|A query error occurred ...
Server:192.168.0.3|2013-10-23 01:19:52 -0600|3|Task:12-2013102301180103210231|A query error occurred ...

The immediate conclusion was that all had been impacted by the a single event. On re-examining the timings and looking at the dates of the failures in addition to the times, it was clear that the failures had actually occurred at similar time, but on different days. Not only that, they had affected different servers. This may seem like an obvious detail, but if you are looking at a list of consolidated error log entries and see multiple concurrent events with matching times, it is very easy to unconsciously make the assumption that the dates will also be the same and not actually confirm that detail.

Write down what you know

All of the failures occurred between 1:18am and 1:20am server local time, which is UTC + 6 hours.The failures occur on separate days on separate servers in the cluster.

Based on this information we could infer that the problem was being caused by a process common to all machines, or a process external to yet shared by all of the machines. either way there was clearly some timing element to the issue which made the problem more likely around 1:20 am and occurred on all three days. We were able to provide this information back to the customer who were then able to investigate the virtual environment with their infrastructure team on this basis.

The power of Contraposition

"when you have eliminated the impossible, whatever remains, however improbable, must be the truth". (The Sign of Four)

This is perhaps the most famous Holmes quote of all, and refers to the indirect deductive approach of Contraspositive Reasoning. This is the process of deducing that an event has not occurred by establishing a contradition to an antecedent of that event, i.e. establishing that a logically necessary condition for that event has not taken place. When faced with a failure, the typical behaviour (in my company at least) for all of the individuals concerned will be to offer theories and ideas about what may be causing that failure. This can result in a number of plausible theories on the problem which can make narrowing down to the likely problem very difficult. By examining all of the information available both to identify what we know is the case, but also in the light of what that information tells us isn't happening, we can make far more accurate inferences about the nature of a problem than could be apparent by taking the information at face value.

Story 3

A couple of years ago I was investigating a customer problem and struggling. I'd spent over a week collecting information and could still not identify the cause of the issue. Some of the key points were

  • A problem occurred running a job through a specific tool, however a different tool running the same job via the same connection ran successfully and other smaller operations through the original tool were also OK.
  • The problem appeared to result in processing files in our processing queue being trucated or corrupted in some way

Myself and three of the other team members sat in a room and literally mind-mapped everything that we knew about this process. Over the next hour, as a team, we established the facts and made a series of new deductions on the behaviour:-

  • One suggested another process corrupting ('gobbling up') some of the files - we established that the files were missing entire written sections. If something was affecting the files post write then an expected outcome would be that the files were truncated at random points rather than these clean section endings
  • We discussed a common failure of each parallel process creating the files. Each file had the same modified date yet a different size. If the write process was failing on each write separately then we'd expect different modified dates and probably more consistent sizes, so we deduced that one event was affecting all of the files.
  • We discussed an event that could have occurred at the point of failure. On examination the last write time of the files was well before the problem was reported. If the problem was caused by an event at the time of failure then we'd expect to see matching write times, therefore the actual problem was taking place earlier and the files were taking time to work their way through the processing queue and cause the failure event to be reported

Through this group session of logical reasoning we elminated other possibilities and established the problem was related to running out of space in a processing queue. This seemed unlikely given the configuration, however on the basis of our deductions we probed the other information that we had on the configuration and identified that the customer installation had been copied from a much larger server without being reconfigured appropriately for the new environment. The disk space provisioned for the queue on the new machine, combined with the the slow ingestion of results by the client application, was causing the exact behaviour that we established must be happening.

None of these deductions required any real time tracing or extra debugging. All that we had was a recursive listing of the files in the queue and a couple of example files from the failing process. By taking each hypothesis on what could have caused the problem and used the information available to prove the absence of a logical outcome of that cause, we could disprove the hypothesis and narrow down to what remained, which had to be the truth.

A role mode for Testers

Holmes is one of my great fictional heroes and it hugely rewarding that exactly the same processes of logical and deductive reasoning that are made so famous in those novels are also invaluable to both the software testers and technical support agents that I work with in performing their work. In fact, when providing some mentoring to one of my team recently I actually recommended that he read some Sherlock Holmes novels to gain inspiration in the art of deduction to help him to track down bugs.

I'm aware that I'm not the first to make such a comparison, yet all too often I am on the receiving end of requests for information or erroneous deductions from other organisations because the individual in question has not examined fully the information available to them. In too many testing cultures that I have encountered it is rare for the individual raising an issue to have made any attempts to apply a process of logical reasoning to establish what that information is telling them. With the cheapness and immediacy of communication some find it easier to fire the immediate 'end behaviour' of issues to others rather than taking the time to establish the facts themselves.

One of the things that I love about my organisation is that I, and the people that I work with, will always strive to fully understand each new situation and use the information available, and their own powers of logical reasoning, to their best advantage to achieve this. Sadly we'll never have Sherlock Holmes working for us as a software tester, I believe that having a culture of attempting to use the same skills that make the fictional detective so famous is the next best thing.

Sunday, 1 December 2013

Textual description of firstImageUrl

Potential and Kinetic Brokenness - Why Testers Do Break Software

I'm writing this to expand on an idea that I put forward in response to a twitter conversation last week. Richard Bradshaw (@friendlytester) stated that he disliked saying that testers "break software" as the software is already broken. His comments echo a recent short blog post by Michael Bolton "The Software is already broken" . I know exactly what Richard and Michael are saying. Testers don't put problems into software, we raise awareness of behaviour that is already there. It sometimes feels that the perception of others is that the software is problem free until the testers get involved and suddenly start tearing it apart like the the stereotypical beach bully jumping on the developers' carefully constructed sandcastles.

I disagree with this statement in principle, however, as I believe that breaking software is exactly what we do...

Potential and Kinetic Failure

I'm not overly keen on using the term 'broken' in relation to software as it implies only two states - in all but the simplest programs, 'broken' is not a bit. I'm going to resort here to one of my personal dislikes and present a dictionary definition - what I believe to be the relevant definition of the word 'break' from the Oxford Dictionary:-

Break: - make or become inoperative: [With subject] he’s broken the video

The key element that stands out for me in this definition is the "make or become" - the definition implies that a transition of state is involved when something breaks. The software becomes broken at the point when that transition occurs. I'd argue that software presented to testers is usually not inoperative to the extent that I'd describe it as broken when we receive it for testing. I believe that a more representative scenario is that the basic functionality is likely to work, at least in a happy path scenario and environment. The various software features may be rendered inoperative through the appropriate combination of evironment changes, actions and inputs. In the twitter conversation I likened this to energy:-

It's like energy. A system may have high 'potential brokenness', testers convert to 'kinetic brokenness'

What we will do in this case is to search for potential for the system to break according to a relevant stakeholders expectations and relationships with the product. In order to demonstrate that potential exists we may need to force it into that broken state, thereby turning this potential into what could be described as kinetic failure. Sometimes this is not necessary, simply highlighting the potential for a problem to occur can be sufficient to demonstrate the need for a rework or redesign, but in most cases forcing the failure is required to identify and demonstrate the exact characteristics of the problem.

Anything can be broken

In the same conversation I suggested that:-

Any system can be broken, I see #testing role to demonstrate how easy/likely it is for that to happen.

With the sufficient combination of events and inputs, pretty much any system can be broken in the definitive sense of being 'rendered inoperative'. For example if we take the operating factors to extremes of temperature, resource limits, hardware failure or file corruption. I suggest that the presence or not of bugs/faults depends not on the presence of factors by which the system can be broken, but on whether this combination falls within the range of operation that the stakeholders want or expect it to support. As I've written about before in this post - bugs are subjective and depend on the expectations of the user. Pete Walen (@PeteWalen) made this same point during the same twitter conversation:-

It may also describe a relationship. "Broken" for 1 may be "works fine" for another; Context wins

The state of being broken is something that is achieved through transition, and is relative to the expectations of the user and the operating environment.

An example might be useful here.

A few years ago had my first MP3 player. When I received it it worked fine, uploaded and played my songs, I was really happy with it. One day I put the player in my pocket with my car keys and got into my car. When I took the player out of my pocket the screen on the player had broken. On returning it to the shop I discovered that the same thing had happened to enough people that they'd run out of spare screens. I researched the internet and found many similar examples where the screen had broken in bags or pockets. It seems reasonable that if you treat an item carelessly and it breaks then that is your responsibility, so why had this particular model caused such feedback? The expectation amongst the experiences that I had read was that the player would be as robust as other mobile electronic devices such as mobile phones or watches. This was clearly not the case, which is why it breaking in this way constituted a fault. I've subsequently had a very similar MP3 player which has behaved as I would expect and stood up to the rigours of my pockets.

  • So was the first player broken when I got it? No. It worked fine and I was happy with it.
  • Who broke the first mp3 player? I did.
  • Was the first player broken for everyone who bought it? - No. My model broke due to the activity that I subjected it to. I'm sure that many more careful users had a good experience with the product.
  • Was the second player immune to breaking in this way? - No. I'm pretty sure that if I smacked the one I have now with a hammer the screen would break. But I'm not planning to do that.

The difference was that the first player had a weaker screen and thereby a much higher potential for breaking such that it was prone to breaking well within the bounds of expected use of most users. This constituted a fault from the perspective of many people, and could have been detected through the appropriate testing.

Any technology system will have a range of operating constraints, outside the limits of which it will break. It will also have a sphere of operation within which it is expected to function correctly by the person using it and the operating environment. If the touch screen on the ticket machine in this post had failed at -50 degrees Celcius I wouldn't have been surprised and would certainly not have written about it. The fact that it ceased working at temperatures between -5 and 0 degrees is what constituted a breakage for me due to the environment in which it was working. It wasn't broken until it got cold. It possessed the potential to break given the appropriate environmental inputs, which manifested itself in 'kinetic' form when it was re-installed outside and winter came along. Interestingly it had operated just fine inside the station for years, and would not have been broken in this way if it had not been moved outside.

Taking a software example, a common problem for applications is where a specific value is entered into the system data. An application which starts to fail when a surname containing an apostrophe is entered into the database becomes 'broken' at the point that such a name is entered. If we never desire to enter such a name then it is not a problem. At the point that we enter such data into the system we change the state and realise the potential for that breakage to occur. Testers entering such data and demonstrating this problem are intentionally 'breaking' it in order to demonstrate that potential exists in live use so that that decision can be made to remove that potential before it hits the customers.

State change can occur outside the software

You could argue that not all bugs include a change of state in the software as I describe above. What about the situation, for example, where a system will simply not accept a piece of data, such as a name with an apostrophe, and rejects it with no change of state in the software before and after this action. Surely then the software itself was already broken?

In this situation I'd argue that the change of state occurred not in the software itself but in its operating environment. A company could be using such an application internally for years without any issues until they hire an "O'Brien" or an "N'jai". It is at the point at which this person joins the company and they attempt to enter that employees details that the state of the software changes, from "can accept all staff names" to "cannot accept all staff names" and breaks. Given that testers are creating models to replicate possible real world events in order to exercise the application in and obtain information on how it behaves, the point at which we add names containing apostrophes to our test data and expose the application to these is the point at which we 'break' the software and realise the potential for this problem to occur.

As well as event based changes, such as that, breakages can also occur over time through a changing market or user environment, and our lack of response to it. Using the above hangul example, the potential for breaking increases dramatically the moment that our software starts being used in eastern markets. I won't expand on this subject here as I covered it previously in this post.

So Testers Do Break Software

I can understand why we'd want to suggest that the software was already broken. From a political standpoint we don't want to be seen as being the point at which it broke. I think that saying that it was already broken can have political ramifications in other ways such as with the development team. I'd argue that when we receive software to test, it is usually not 'broken' in the sense that it has been rendered inoperative and suggesting that it was may affect our relationships with the people coding it. Instead I think a more appropriate way of looking at is is that it possesses the potential to break in a variety of ways and that it is our job to come up with ways to identify and expose that potential. If we need to actually 'break' the software to do so then so be it. We need to find the limits of the system and establish whether these sit within or outside the expected scope of use.

If we have a level of potential breakability looming in our software as precariously as the rock in the picture above then the testers job to ensure that we 'push it over' and find out what happens, because if we don't exercise that potential then someone else will.

ShareThis

Recommended