Tuesday 29 May 2012

Spot the Difference - using Diffmerge on log files to investigate bugs

I've been meaning for a while to publish a blog post on tools that I find useful in my day to day testing. This idea was brought back to the forefront of my mind recently when reading this post by Jon Bach on and exchange of tool ideas between Jon and Ajay Balamurugadas. The problem that I have is that there are so many great tools that help me in my testing and I'd want to do justice to them all with a good level of detail as I did with WinSCP and NotePad++ in this post. I've started to compile a summary list, which I'll hopefully post sometime soon, but in the meantime I thought I'd do a more detailed post on a fantastic tool by SourceGear called DiffMerge, and how I use it to investigate software problems through the comparison of log files between systems.

There are a number of diff tools available. The reason that I choose to use DiffMerge is that I find it has the nicest interface I've found for visually representing differences between files in a clean and simple manner. It is this visual element that is most important for my use case. It is not immediately obvious why someone working primarily with big data, manipulating huge log files and SQL via command line interfaces, would worry too much about visual elegance. As a data warehouse tester friend of mine Ranjit wrote in his blog - being able to visualise key characteristics of information gives is a critical skill in testing in our domain, and here is a great technique to demonstrate this.

An inaccessible problem

Both through my testing work and my role running the technical support I am often faced with investigating why certain operations that work fine in one context will fail to do so in another. The classic It works on my machine scenario is a common example. If the installation appears to be sound and the cause of the issue is not obvious, then it can be difficult to know where to look when trying to recreate and diagnose problems. This is the situation that I found myself in a while ago when a someone found an issue trying to use our software via an Oracle Gateway interface and had encountered an error returning LONG data types. Having set up a similar environment myself a few weeks before I was well placed to investigate. Unfortunately after checking out all of the configuration files and parameters that I knew could cause issues if not set correctly, I'd not got any nearer to the cause of the problem. I had requested and received full debug tracing from the Oracle gateway application, however I was not overly familiar with the structure of the logs, which contained a number of bespoke parameters and settings.

Staring at the log file was getting me nowhere. I knew that my own test environment could perform the equivalent operation without issue - it worked on my machine. Figuring that I could at least take the logs from my healthy system to refer to , I performed the same operation, got the trace files then compared the two equivalent logs in DiffMerge

I was unsure about how much useful information this would provide, or even whether the logs would be comparable at all, however the initial few rows looked promising. A quick examination showed that
  • The logs were from equivalent systems
  • The logs were comparable
At this point my flawed but still amazing human brain came into its own. The tool was showing me that there were differences on almost every line between the two files, yet my brain was quickly identifying patterns of differences. Within a few seconds I was able to identify and categorise common patterns of differences: date stamps; database names and session ids were easily spotted and filtered out. Very quickly I was scanning through the files, able to ignore the vast majority of differences being shown.

The first few pages revealed nothing interesting, then my eye was drawn to some differences that did not conform to any of the mental patterns I'd already identified.

It was clear that some of the settings being applied in the gateway connection were different to my installation. Although not familiar with the keys or values, I could make a reasonable inference that these settings related to the character sets in the connection. Definitely something worth following up on - I noted it and pressed on.

I rattled through the next couple of hundred lines, my brain now easily identifying and dismissing repeated date and connection id difference patterns. I could, with a little effort, have written a script to abstract these out. I decided that this wasn't necessary given how easy it was to scan through the diffs, and I also was still in a position to spot a change in use or pattern of these values.

The next set of differences was interesting again - it showed some settings being detected or applied for the data columns in my query. Again the differences that caught my eye appeared to relate to the character set (CSET) of the columns.

With a scan to the end of the file revealing nothing further that appeared interesting, I investigated the character set changes further. Having an area of interest to focus on I was able to very quickly recreate the issue by altering the character set settings of the Oracle database, which I discovered were then passed by default into the gateway and resulted in the incorrect language settings being used on the ODBC connection. I verified my new settings exhibited the same behaviour by re-diffing the log files from my newly failing system against the ones I'd been given. A bit of internet research revealed how to explicitly configure the Gateway connection language settings to force the correct language, and the problem was resolved.

An addition to my testing toolkit

I've used this technique many times since, often to great effect. It is particularly useful in testing to investigate changes in behaviour across different versions of software , or when faced with a failing application or test that has worked successfully elsewhere. It is similarly effective for examining why an application starts failing when a certain setting is enabled - only today I was examining a difference in behavior when a specific flag was enabled on the ODBC connection using this technique.

Give it a go - you might be surprised at how good your brain is at processing huge volumes of seemingly inaccessible data once it is visualized in the correct way.
Joe said...

Nice article, Adam!

I like the way you take the reader step-by-step through your process.

And while I use WinMerge for similar reasons (http://www.allthingsquality.com/2010/06/blink-tests-in-blink.html), I agree with you that nice clean visualization of differences is an important attribute for this sort of tool.

Adam Knight said...

Thanks Joe for taking the time to read and comment. I really like your post - I'd not come across Baretail before - I work a lot in Linux and use "tail -f" in a similar way. I will certainly be adding BareTail to my toolkit, and it is great to see that you use WinDiff in much the same way that I describe here.



Whatsapp Button works on Mobile Device only

Start typing and press Enter to search