Tuesday, 29 May 2012

Spot the Difference - using Diffmerge on log files to investigate bugs

I've been meaning for a while to publish a blog post on tools that I find useful in my day to day testing. This idea was brought back to the forefront of my mind recently when reading this post by Jon Bach on and exchange of tool ideas between Jon and Ajay Balamurugadas. The problem that I have is that there are so many great tools that help me in my testing and I'd want to do justice to them all with a good level of detail as I did with WinSCP and NotePad++ in this post. I've started to compile a summary list, which I'll hopefully post sometime soon, but in the meantime I thought I'd do a more detailed post on a fantastic tool by SourceGear called DiffMerge, and how I use it to investigate software problems through the comparison of log files between systems.

There are a number of diff tools available. The reason that I choose to use DiffMerge is that I find it has the nicest interface I've found for visually representing differences between files in a clean and simple manner. It is this visual element that is most important for my use case. It is not immediately obvious why someone working primarily with big data, manipulating huge log files and SQL via command line interfaces, would worry too much about visual elegance. As a data warehouse tester friend of mine Ranjit wrote in his blog - being able to visualise key characteristics of information gives is a critical skill in testing in our domain, and here is a great technique to demonstrate this.

An inaccessible problem


Both through my testing work and my role running the technical support I am often faced with investigating why certain operations that work fine in one context will fail to do so in another. The classic It works on my machine scenario is a common example. If the installation appears to be sound and the cause of the issue is not obvious, then it can be difficult to know where to look when trying to recreate and diagnose problems. This is the situation that I found myself in a while ago when a someone found an issue trying to use our software via an Oracle Gateway interface and had encountered an error returning LONG data types. Having set up a similar environment myself a few weeks before I was well placed to investigate. Unfortunately after checking out all of the configuration files and parameters that I knew could cause issues if not set correctly, I'd not got any nearer to the cause of the problem. I had requested and received full debug tracing from the Oracle gateway application, however I was not overly familiar with the structure of the logs, which contained a number of bespoke parameters and settings.


Staring at the log file was getting me nowhere. I knew that my own test environment could perform the equivalent operation without issue - it worked on my machine. Figuring that I could at least take the logs from my healthy system to refer to , I performed the same operation, got the trace files then compared the two equivalent logs in DiffMerge


I was unsure about how much useful information this would provide, or even whether the logs would be comparable at all, however the initial few rows looked promising. A quick examination showed that
  • The logs were from equivalent systems
  • The logs were comparable
At this point my flawed but still amazing human brain came into its own. The tool was showing me that there were differences on almost every line between the two files, yet my brain was quickly identifying patterns of differences. Within a few seconds I was able to identify and categorise common patterns of differences: date stamps; database names and session ids were easily spotted and filtered out. Very quickly I was scanning through the files, able to ignore the vast majority of differences being shown.

The first few pages revealed nothing interesting, then my eye was drawn to some differences that did not conform to any of the mental patterns I'd already identified.



It was clear that some of the settings being applied in the gateway connection were different to my installation. Although not familiar with the keys or values, I could make a reasonable inference that these settings related to the character sets in the connection. Definitely something worth following up on - I noted it and pressed on.

I rattled through the next couple of hundred lines, my brain now easily identifying and dismissing repeated date and connection id difference patterns. I could, with a little effort, have written a script to abstract these out. I decided that this wasn't necessary given how easy it was to scan through the diffs, and I also was still in a position to spot a change in use or pattern of these values.


The next set of differences was interesting again - it showed some settings being detected or applied for the data columns in my query. Again the differences that caught my eye appeared to relate to the character set (CSET) of the columns.


With a scan to the end of the file revealing nothing further that appeared interesting, I investigated the character set changes further. Having an area of interest to focus on I was able to very quickly recreate the issue by altering the character set settings of the Oracle database, which I discovered were then passed by default into the gateway and resulted in the incorrect language settings being used on the ODBC connection. I verified my new settings exhibited the same behaviour by re-diffing the log files from my newly failing system against the ones I'd been given. A bit of internet research revealed how to explicitly configure the Gateway connection language settings to force the correct language, and the problem was resolved.

An addition to my testing toolkit


I've used this technique many times since, often to great effect. It is particularly useful in testing to investigate changes in behaviour across different versions of software , or when faced with a failing application or test that has worked successfully elsewhere. It is similarly effective for examining why an application starts failing when a certain setting is enabled - only today I was examining a difference in behavior when a specific flag was enabled on the ODBC connection using this technique.

Give it a go - you might be surprised at how good your brain is at processing huge volumes of seemingly inaccessible data once it is visualized in the correct way.

Monday, 21 May 2012

Automation and the Observer Effect, or Why I Manually Test My Installers

As anyone who has read my blog before will know, I make use extensive use of automation in the testing in my organisation. I believe that the nature of the product and interfaces into it make this a valid and necessary approach when backed up with appropriate manual exploration of new features and risk areas. There is one area of functionality, however, that I ensure has always undergone manual testing with every new release and that is the installer packages. I've had occasion to defend this approach in the past, so I thought I'd share a great example that highlights why I think this is so important, whilst also providing some excellent examples of the problem of Observer effects that can be particularly apparent in test automation.

Still waters run ... slowly


As part of our last release, one of my team was testing the installers in Linux and found that it was taking an inordinately long time to install the server product. In one test it took him 15 minutes to install. The programmers investigated and found that a new random library used to generate keys was relying on machine activity to provide the randomisation. On a system with other software running, the random data generation was very fast. On a quiet machine with no activity, other than the user running the installer, it could take minutes to generate enough random data to complete the process. By its very nature any automated testing had not uncovered the problem as the monitoring harnesses were generating sufficient background activity to feed the random data and reduce the install time. Through manual testing on an isolated system we uncovered an issue which could otherwise have seriously impacted the customers' first impressions of our software

This is a great example of the phenomenon of Observer effects, most commonly associated with physics but applicable in many fields, notably psychology and Information Technology. The act of observing a process can affect the actual behaviour and outcome of that process. In another good example, earlier this year we had a problem reported from a customer using an old version of one of our drivers which was complaining about library dependencies missing. It turns out that the tool that had been used to test the successful connectivity of the installation actually incorporated some runtime libraries on the library path that were needed for the drivers to function, but were not included in the install package. The software used to perform the testing had changed the environment sufficiently to mask the true status of the system. Without the tool and associated libraries, the drivers did not work.

Such Observer Effects are a risk with throughout software testing efforts where the presence of observing processes can mask problems such as deadlocks and race conditions by changing the execution profile of the software. The problem is particularly apparent with the use of test automation due to the presence of another software application which is accessing and monitoring exactly the same resources that are being used by the application under test. The reason I'm discussing Observation effects specifically in a post on installers, is that I've found this area to be one where they can be most apparent. Software installation testing by its nature is particularly susceptible to environmental problems. The presence of automation products and processes can fundamentally change the installation environment. Relying on automation alone to perform this nature of testing seems particularly risky.

Falling at the first


The install process is often the "shop window" of software quality, as it provides people with their first perception of working with your product. A bad first impression on an evaluation, proof of concept or sales engagement can be costly. Even when the severity of the issue is low, the impact in terms of customer impressions at the install stage can be much higher. If your install process is full of holes then this can shatter confidence in what is otherwise a high quality product. You can deliver the best software in the world but if you can't install it then this gets you nowhere.

This week I was using a set of drivers from another organisation as part of my testing. The unix based installers worked fine, however the windows packaged installers failed to install, throwing an exception. It was clear that the software had come out of an automated build system and no-one had actually tested to ensure that the installers worked. No matter how well the software drivers themselves had been tested I wasn't in a position to find out as I couldn't use them. Also my confidence in the software had been shattered by the fact that the delivery had fallen at the first hurdle.

I can't claim that our install process has never had issues, however I do know that we've identified a number of problems when manually testing installations that would otherwise have made it into the release software. I've also seen install issues from other software companies that I know wouldn't have happened for us. Reports from our field guys are that in most cases our install is one of the easier parts of any integrated delivery, which give me confidence that the approach is warranted. Every hour spent on testing is an investment, and I believe that making that investment in the install process is money very well spent.

Image: http://www.flickr.com/photos/alanstanton/3339356638/

ShareThis

Recommended