Wednesday, 28 May 2014

The Friday Puzzle


Every Friday I put together an email entitled the 'Friday Report' which I send round to all of my teams. In it I include relevant articles and blog posts of interest to the various roles within the teams, including testing, documentation and support. As well as these links I also try to include a puzzle for people to tackle.

The original intention of including a puzzle was, as one might expect, as a bit of fun to encourage people to discuss the problem and find a solution. As I researched and presented more puzzles I started to see the scope for a less obvious, yet potentially more valuable benefit to these. I found that the most interesting discussions around the puzzles arose when individuals move beyond searching for the expected answer and start to apply critical thinking around the premises of the puzzles themselves. Through some simple questions I've been able to encourage some excellent debate around the puzzles themselves, calling into question the validity of the assumptions, the integrity of the constraints, and the viability of the solutions themselves.

  • Is the puzzle logically sound?
  • is the solution "on the card" the only one that fits?
  • Is it possible to come up with a different solution if we remove the assumed constraints?

Puzzles share many common characteristics with the situations that we are presented with working in a software company. They are usually based around some simplistic model of the world and the individuals within it that restrict behaviour, in a similar way that behaviour is modelled when developing software solutions. Puzzles may be presented with a limited set of information on the scenario in question, just as problems requiring diagnosis are rarely presented with all of the salient facts up front. Puzzles are presented with carefully chosen language to lead or mislead, just as customers, suppliers and other parties we interact with may sometimes use language to define their problems with the intention of leading us to the conclusions that they feel are relevant. In questioning the puzzles we are practising valuable skills in critical thinking that can be applied directly in our working contexts to ensure we maintain an open mind and consider options beyond the obvious ones presented to us.

I'm really pleased how enthusiastically everyone tackles these puzzles. On most Fridays now the discovery of the given answer is only the beginning of an excellent display of questioning and critical thinking around the problem. I've included here a few choice examples of puzzles and subsequent discussions that I think exhibit the exact characteristics that I'm really hoping for in this kind of exercise.

Choose your execution

This was the first puzzle in which I really started to see and introduce the idea of questioning the integrity of the puzzle itself as opposed to simply looking for the 'correct' answer. The interesting thing here is that the basis of the puzzle is that the prisoner escapes through manipulating the logic of his situation, however for me the language of the puzzle scenario provided scope for exploring other ideas.

Puzzle: General Gasslefield, accused of high treason, is sentenced to death by the court-martial. 
He is allowed to make a final statement, after which he will be shot if the statement is false 
or will be hung if the statement is true. Gasslefield makes his final statement and is released. 
What was his statement?**
  • A: I think the General said “I’m going to be shot”.*
  • Me: Well done, that’s the solution given. I personally would not have let him go – anyone see a way round letting him go?
  • J: Yes – I don’t see any reason why the alternative to “shot” or “hung” had to be “set free”. That’s not stated anywhere. It could have been continued incarceration – or some other method of execution. Anyway, given the statement, as soon as you take one or other of the two prescribed actions, either the statement or the action becomes invalid under the rules. If you don’t take an action, then we don’t yet know whether the statement is true of false – so he could continue to be held.
  • Me: They say they’ll shoot him if his statement is false but don’t explicitly say they won’t shoot him if it is true (i.e. it is an OR not an XOR) I’d have hung him then shot him.
  • S: Or may be shoot him on the leg and then hang him.

The Smuggler

This is a great example of something that is seen a lot in software circles. A solution that, on first examination, appears plausible, but that does not stand up well to closer scrutiny considering the practicalities of the situation.

A man comes up to the border of a country on his motorbike. He has three large sacks on his bike. The customs officer at the border crossing stop him and asks, “What is in the sacks?” 
“Sand,” answered the man.
The guard says, “We’ll see about that. Get off the bike.”
The guard takes the sacks and rips them apart; he empties them out and finds nothing in them but sand. He detains the man overnight and has the sand analysed, only to find that there is nothing but pure sand in the bags. The guard releases the man, puts the sand into new bags, lifts them onto the man’s shoulders and lets him cross the border.
A week later, the same thing happens. The customs officer asks, “What have you got?”
“Sand,” says the man.
The officer does another thorough examination and again discovers that the sacks contain nothing but sand. He gives the sand back to the man, and the man again crosses the border.
This sequence of events repeats every day for the next three years. Then one day, the man doesn’t show up. The border official meets up with him in a restaurant in the city. The officer says, “I know you’re smuggling something and it’s driving me crazy. It’s all I think about. I can’t even sleep. Just between you and me, what are you smuggling?”
What was he smuggling?
  • A: Smuggling gold dust in the sand?
  • L: I think he’s smuggling motorbikes.
  • Me: That is the 'answer on the card' - he was smuggling motorbikes. Any flaws in the puzzle?
  • O: Finding pure sand seems amiss as sand has absorbing properties. The analysis should have uncovered the sand plus all that it had absorbed e.g. water, sack fibre etc.
  • N: Surely the border guard would have noticed that it was a different bike each time?
  • S: could be the same model/colour with false number plates?
  • A: The official must have been suspicious if he was riding what appeared to be the same bike for 3 years but not showing any signs of wear and tear
  • S: He drove at night time; with poor lighting :)
  • Me: If he was smuggling motorbikes into the country - how did he get back? Would the lack of a motorbike on the return journey not have triggered some suspicion?
  • N: Maybe he smuggled cars or bicycles in the opposite direction? Or maybe he came back by bus or train.
  • S: The customs officer won’t be working 24 hours, so on his return on foot, a different officer is working and always sees him on foot.
  • Me: So the guy goes across every day and the customs officer is convinced that he's smuggling something to the extent that it is "driving him crazy" - but he doesn't tell his colleagues to check how the guy gets back and what he has with him? I wouldn't want to hire that customs officer for anything important.
  • S: We know little about him coming back – maybe he is doing the same in reverse…..(different model/make) and the customs officers did talk about it, but the ‘crazy’ one assumed it would be the same bike.

Predicting the score

This is a great example of a puzzle that is trying so hard to be clever that it has omitted the glaringly obvious, that the simplest answer may just be correct. Also included for a great piece of lateral thinking in the final response which is actually more pleasing than the rather unsatisfactory answer on the card.

Bill bets Craig £100 that he can predict the score of the hockey game before it starts. 
Craig agrees, but loses the bet. Why did Craig lose the bet?
  • S: Because the score of a hockey game before it starts is 0:0
  • Me: That's the answer on the card - everyone happy with that?
  • W: Because Bill guessed the correct score.
  • Me: That is some seriously twisted lateral thinking.
  • J: You can predict anything. Bill predicted the score. He didn't bet he would predict it correctly.

The Woman with half boys

This is a great example of where the language of the puzzle included unintended ambiguity. The answer on the card is based on an unsatisfactory premise that you can have half a boy. In this case, as with the previous one, I think that the team came up with a fast more elegant solution than the answer on the card.

 A woman has 7 children, half of them are boys. How can this be possible?
  • A: Maybe “half” means at least half, in which case she has at least 4 boys. Maybe she is expecting another one but doesn’t know the gender, so 4 of the 7 are boys, 3 are girls, and the one on its way could be either.
  • J: The statement is only that half are boys, there's no saying what the other half are - they could be boys too. If so, half are boys, a quarter are boys - any fraction. However, half of 7 of anything seems odd when the individual items are logically indivisible. Of course it could be in a specific application of the term "boy". So if one of the woman's children was a man or a woman rather than a boy, that could leave 3 and 3 boys and girls. That seems really inconsistent, though - being pedantic about "boy" but loose with "children" (and indeed loose with "them").
  • J: The "them" could include the woman, at which point, there are 8 people and if 4 are boys we're there.
  • Me: Well done J for coming up with an answer which, in my opinion, is actually neater than the one on the card (which was that they were all boys)

Carrying Sacks

This one I chose specifically due to ambiguity in the question that I thought would prompt interesting debate. It is a poorly worded puzzle and the team had no problem in dismantling it, and interestingly using it as an opportunity to explore the socio-political implications of the problem scenario!

A farmer and his hired help were carrying grain to the barn. 
The farmer carried one sack of grain and the hired help carried two sacks. 
Who carried the heavier load and why?
  • A: The farmer… the hired help had empty sacks… well it doesn't specify there was grain in them.
  • N: Even if there WAS grain in the hired help’s sacks, it would depend on how much grain was in each sack. So it could be either of them.
  • S: "A farmer and his hired help were carrying grain to the barn." This suggests that both were carrying grain. We don’t know the size or content of the sacks, or the weight of an empty sack. As it stands I do not think we can tell.
  • A: My thoughts exactly. Or maybe they were carrying different types of grain that weighed different amounts. Or maybe one was carrying dry grain and one was carrying waterlogged grain. Without further information it’s all just guesswork!
  • L: I’d hope that if the farmer had paid the guy, the hired man would have been of some help so therefore should be carrying at least as much weight as the farmer.
  • N: Depends what he or she was hired for. It might have been to feed the chickens and collect their eggs.
  • Me: Well done to N and L for getting the answer on the card. I'm with S and N - the first sentence states that they were both carrying grain. The second does not exclude the possibility of the hired help carrying grain, so the question is ambiguous and unanswerable. For me the most sensible answer is the 'obvious' one that the hired help carried more as the probability is that the sacks were full and of equal size, and as L points out, he was hired to help not just mess about with sacks.
  • E: But isn’t it more of a political/economic/philosophical question? The “hired help” represents the oppressed masses, forced to sell their labour for little or no reward. The “farmer” represents the capitalist oppressor. The burden of the hired help will always be the greater, in the exploitative capitalist society in which the farmer and the hired help notionally co-habit. My only question is why the farmer is carrying anything at all.

Riding into the Forest

Another one where the simplistic model of the problem doesn't stand up to practical considerations. Whilst something of an easy target, I really like the answers here.

How far can a horse run into a forest?
  • N: The answer to the puzzle is “half way”. Any further and the horse will be running OUT of the forest.
  • M: Depending on the density of the forest, I’d have to say “as far as the first tree”
  • Me: N has the answer on the cards, although Michael has applied some practical considerations to the issue….any other thoughts?
  • S: With regard N’s answer, that assumes the horse is running in a
  • straight line. It might be going in circles. A few answers from me:
    • As soon as it crosses the boundary into the forest, it is in, so it can’t run into it any further.
    • Until it stops running
    • Until it runs out of the forest
  • N: If the horse was riderless, it probably wouldn’t run into the forest at all. It would run over open ground if it could – less hazardous!
  • L: I agree with N - if a horse was riderless, it would not run through a forest. It would walk. Unless it was being chased, at which point I guess the answer is ‘until it got eaten’.

A Valuable Distraction

Whilst you could argue that these exercises distract from 'real' work I think that they are invaluable in promoting team debate around a subject and, more importantly, engender a team culture of collaborative criticism. I believe in encouraging an environment where it is acceptable to question the invisible boundaries that may have been established. Whether through our own assumptions or the dictate of others, were are often faced with the situation where we are introduced to scenarios with pre-established constraints, and the easiest option is usually to accept these and operate within them. I hope that by practising the process of questioning the scope and boundaries of every situation helps to ensure that we don't just blindly accept the information presented to us. I don't want to work with testers who only look for problems in the functionalities presented to them, I want to work with testers who question the functionality itself. I want to work with support agents who question the conclusions on that have been made by customers on what has caused the problems that they are facing. I want to work with technical authors who question behaviour and how it fits with the product set as a whole rather than simply documenting what has been done. Luckily for me that is exactly what I have, something I'm reminded of every Friday.


Some nice puzzle links can be found here:-


Saturday, 10 May 2014

Testing with International Data

I've recently been working on a new database utility that my company are releasing, helped by our intern-turned-employee Tom. One of the areas that I've been focussing on, for the first time in a while, is testing support for international characters and different character encodings. Tom is new to testing of this nature and working with him on this has reminded me of two things: firstly how difficult testing support for international characters can be, and secondly how enjoyable it is.

Tackling the challenge of ensuring data preservation and consistency across different data applications is a tricky job that I've seen trip many people up. In that difficulty comes a great deal of satisfaction when you understand behaviours and identify problems. I thought that, while working in this area, it might be useful to document some of the tips and gotchas that I have encountered.

Don't Trust Characters

The single biggest mistake that I've seen people make working on data verification is to base their deductions on the characters that are presented to them by their application interfaces. I have found the characters that are presented in an application interface to be an extremely unreliable means of checking the data that has been transferred or stored in a data system.

The reason is simple. Data is not stored or transferred as characters. This is done in the form of streams of bytes. The decision on how to present those bytes as characters is typically only made at the point of presenting it to the interface. The thing is, much as the metaphorical tree in the forest, the data won't actually 'appear' unless someone views it, and in order to do that it must be mapped to characters. The decision on how to present a byte or set of bytes as a character will depend on the encoding that the application is using and that can vary between applications and interfaces.

This exposes the tester to the possibility of two fundamental characteristics of character data which, in my opinion, are the most important to understand when working with data and systems of this nature:

1. The same data can appear as different characters when presented using a different encoding.

2. Different data can appear the same when presented using a different encoding.

What really confuses the matter is that the decision on which encoding to use is not made in one place, it will depend on factors internal to the application itself and in its operating environment. These include the data, the operating system, the database settings, the application settings and even the method of connection.

All about encodings

I'm not going to cover character sets and encodings in detail here as it is much too broad a subject. I will present some of the key encodings and their characteristics, but I suggest if working on any data system to read up and gain a solid understanding of character encodings. Character Encodings are a mechanism to map stored information to specific characters. Each encoding discussed here maps a byte or set of bytes to the code points in the encoding, each of which typically then represents a character, punctuation mark or control.

Single byte encodings.

Single byte encodings involve the storing of data in which a single byte is used to represent any individual character or control code. Hence single byte encodings are limited to the range of unique single bytes, at most 256 characters. Most are based around using the 7-bit Ascii encoding to map the byte range 1-127 ( or 0x01 to 0x7F ) to the basic Latin characters, numerics and punctuation required for the English language. Clearly there were many languages based on other accentuation, characters and alphabets and different encodings evolved around the use of 8-bits, or 1 byte, which support different ranges of characters in the 128-255 ( 0x80 to 0xFF ) in an attempt to support these. Many different single byte encodings exist which map this small range of possible single byte 'extended' code points to different characters. Some examples of common ones that I've encountered.

Unicode encodings

Unicode encodings are different from single byte ones in that they are designed to support unique representations of every possible character in every language, and Unicode uniquely defines a specific unique code point for every character. In order to represent this many unique points it is clear that more than one byte per character is required, and different Unicode encodings exist to map multiple bytes to each unique unicode code point.

  • UTF-8
  • This is the unicode encoding most used in Linux and Unix based operating systems. One of the greatest assets of UTF-8 can also prove to be a cause of great confusion when it comes to testing. That is that the first 127 points are single byte, and match exactly to the Ascii single byte encoding. This means that you can convert data encoded in UTF-8 with a single byte encoding and much of the data will appear as you expect. It is only when code points above the 0-127 range are used that the confusion starts as UTF-8 uses 2 or more bytes for all characters outside this range. Many times I've been presented with something like this

    UTF-8 data viewed as Ascii as an example of corrupt data, when in actual fact it was UTF-8 data being viewed in an application using a single byte encoding. (As a rule of thumb, if when looking at your data you see a lot of "Â" characters - you're viewing multi-byte data using a ISO-8859 encoding).
  • UTF-16
  • This is the encoding used for Unicode operations on Windows operating systems. Many windows applications will use the term 'Unicode' to refer to UTF-16 which can be confusing if you are working across operating systems. UTF-16 uses at least 2 bytes, sometimes 4, for every code point. Therefore it is incompatible with single byte encodings which, whilst making it sometimes less efficient, at least avoids confusion and ambiguity between UTF-16 data and single byte data.
  • UTF-32
  • I've not had a great deal of experience with UTF-32 - I know that it uses exactly 4-bytes for each code point, there is no variability in the number of bytes used for each code point, but I've not encountered this much in commercial use.

Given the many different encodings that can be used it is hardly surprising that confusion occurs. Whilst the increasing use of Unicode encodings, in preference to the legacy single byte encodings, is solving many of the problems involved with abiguity in character encodings, anyone testing with legacy data or across multiple systems will need to maintain awareness of both the single-byte and Unicode encodings that are are in play.

As I've mentioned, character encodings can be influenced by a number of factors. The primary one is the locale that is being used by the operating system or session. In Windows this is controlled via the "Control Panel - Region and Language" panel under "Language for non-Unicode programs". This setting controls the system locale which, amongst other things, sets the encoding used to interpret single byte data. Linux uses UTF-8 as its standard encoding and will typically, in my experience, be configured with a UTF-8 locale. As I describe above, this has the advantages of any data containing bytes solely in range of Ascii being consistent with most single byte source encodings, it can be confusing if dealing with single byte data from different locales e.g. data coming from other systems. The locale in Linux is controlled by the LC_* and LANG environment settings - a decent summary of which can be found here.

Lets look at some areas where I've encountered problems.

Don't trust applications

Application interfaces are the first place that we might look to verify whether the data that we expect is present, however these can be the least reliable. Application interfaces are likely to yield the least visibility and control over what you are seeing. The application may be using operating system settings or internal defaults to perform character mapping, and the absence of access to underlying data makes it hard to be sure what is stored.

In client-server or n-tier systems the problem is made worse due to the potential for both client and server side settings to be in play and for data transformations to be occurring in internal layers that you have no visibility of.

As I demonstrated in this post it can be hard to understand what s going on between the various interfaces of a data system. If the characters being presented are not what you expect then you might need to examine the data settings of the application and use application tracing to help understand the situation (of course, here you are beholden to the quality of that tracing).

Another thing that you need to bear in mind when viewing character data in an application is whether the font that is being used by that application is capable of displaying the characters represented by your data. Some fonts are limited to specific ranges of characters, with support for Chinese characters often missing due to the sheer number of characters required. If you are using an application that allows control over fonts then changing to a font that is appropriate for the data that you are viewing may be necessary to view the characters it represents.

When looking to test data systems to check the validity or conservation of data, the application interface can provide useful immediate feedback that something is wrong , however if you want to really see what is stored then getting access to the underlying data is essential.

Don't trust Databases

Accessing a database directly should be a much less ambiguous operation than working through an application interface, however it can still be confusing. I've seen a number of issues raised with our support teams resulting from confusion as to what a customer is seeing.

When accessing databases directly you are still using an application. This may be a GUI based application such as SQL Server Management studio or Toad, or a command line like SQLplus. It is still a client application and it pays to understand how the data is being presented. Inserting character data via a console or GUI interface using SQL is particularly unreliable as the bytes that are stored may differ from those intended based on the database settings and the interface used.

Databases will store data typically as single-byte CHAR/VARCHAR columns, or as unicode/variable-byte NCHAR/NVARCHAR columns. There are two important properties that need to be understood when working with character data in databases.

  • The database will use a CHARACTER SET to map the single byte column data. This character set is usually chosen from the same set of single-byte character sets that are used in most operating systems, however as we will see this is not always the case.
  • A database will also support a COLLATION which dictates sort order and equivalence when comparing character data - for example whether upper case and lower case letters are considered equivalent when comparing and sorting, and whether accented letters match to non-accented ones or not.

Different databases have different implementations with regard to character sets and collations - Sqlserver , for example, uses the collation of a database, table or column to cover both the character set of data on the server and the SQL collation. Oracle, on the other hand, supports different properties for the character set and collation of a database, table or column.

The effect of character sets on character columns is nicely demonstrated in SQLServer if we run a simple script to populate the bytes from 1 to 255 into columns, each using a different collation.

CREATE TABLE MixedEncodings
     Arabic_CS_AS CHAR(1) COLLATE Arabic_CS_AS NULL,
     Cyrillic_General_CS_AS CHAR(1) COLLATE Cyrillic_General_CS_AS NULL,
     Latin1_General_CS_AS CHAR(1) COLLATE Latin1_General_CS_AS NULL
     Chinese_PRC_CS_AS CHAR(1) COLLATE Chinese_PRC_CS_AS NULL,
     Vietnamese_CS_AS CHAR(1) COLLATE Vietnamese_CS_AS NULL

INSERT INTO MixedEncodings(code) VALUES (1),(2),(3),(4),(5),(6), ... you get the idea ... (253), (254),(255)

UPDATE MixedEncodings
  SET Arabic_CS_AS=CAST(code AS BINARY(1)),
  Cyrillic_General_CS_AS=CAST(code AS BINARY(1)),
  Latin1_General_CS_AS=CAST(code AS BINARY(1)),
  Chinese_PRC_CS_AS=CAST(code AS BINARY(1)),
  Vietnamese_CS_AS=CAST(code AS BINARY(1))

We can see that the result of querying this data is that the "code" byte is mapped to different characters at the point of presenting the results of querying the different columns, based on the collation of each column.

SQLserver bytes encoded with different collations

Yet, based on a CAST of the data to Binary we can see that the actual bytes stored in the column are the same for each column, it is the collation that is dictating that these bytes be presented as different characters, a neat demonstration of fundamental characteristic number 1 that I highlighted at the start.

Bytes matching different collations in SQLserver

This will not always be the case. The same text data, stored in NCHAR or WCHAR columns will have different underlying byte encodings from single byte CHAR data. I've had to deal with confused customers who were comparing HEX output from their source database with my product, not realising that our underlying UTF-8 server encoding resulted in different bytes to represent the same character data as their source single byte encoding, fundamental characteristic number 2. To avoid confusion when creating test data in databases I recommend, as with the example above, explicitly injecting data using binary format rather than attempting to insert characters using SQL string INSERT statements.

As I hinted above, some databases do not use standard character encodings but have their own custom encodings, which can trip up the unwary. Teradata uses different encodings for the data on the server and the client, and the Teradata Latin server encoding is a custom one which does not match any of the ISO or windows single byte latin ones. The byte 0xFF in ISO-8859-1 is the character ÿ, yet that byte loaded into Teradata using the Teradata custom Latin encoding is the Euro symbol. Therefore the resulting characters when a single byte file is loaded into a teradata database will be slightly different from those we might expect from looking at the source file from the operating system terminal.

In fact, while we are on the subject of terminals...

Don't trust Terminals

Working with command line connections can be particularly misleading, as folks often don't realise that these are applications which will have their own encodings to map bytes to characters, which are sometimes independent from the system being connected to. I think the fact that it is a command line leads people to place more trust that they are seeing the data 'as it appears on the server'.

Some terminals will not support unicode by default, and are often limited to specific fonts which, as mentioned, can result in the inability to present certain characters. Here, for example, is a small file containing a mixture of characters from different alphabets stored in a unicode file, viewed via the windows command line.:-

Remote console connections can be particularly bemusing. I've dealt with confused clients who were trying to view multi-byte data using a putty terminal, not realising that this was using a single byte encoding to present characters, when the database server they were connecting to was using a utf-8 encoding. The character set setting for putty is found in the connection properties - when working with Linux I have found that changing this to UTF-8 by default avoids a lot of mistakes.

Putty Encoding settings

here are some examples of the same data viewed in different putty sessions with changed encodings - the first with a default Latin encoding and the second with the correct UTF-8 setting:-

Data viewed with Latin console settings

Putty file viewed as UTF8

Sometimes it is easier rather than viewing remotely to get the files onto your machine and view them there, or is it.....?

Don't trust files

Even looking at files on a local machine is fraught with peril. Copying files needs to be done in binary transfer mode to avoid changing characters 'helpfully' for your operating system. Then you need to make sure that you view the file using an appropriate tool. Using a text editor brings you back to the problem of applications.

One of the classic mistakes that I've seen people make is to open a unicode file in a text editor, not seen the characters that they expected, so performed a File - Save As and re-encoded the data as presented, i.e. the incorrectly interpreted set of characters, back into Unicode format thus permanently losing the original data. Luckily text editors are getting better at automatically detecting Unicode encodings, however Windows achieves this throuh the use of a "BOM" (Byte order marker) at the start of the file which can be confusing when viewing in other operating systems or as Hex. There are excellent editors such as Notepad++ which allow you to select the encoding that you want to use to view the data, without modifying the file.

Encoding settings in Notepad++

There is still need for care here - the "Encode" menu items here do not change the data, only the encoding used to present the data as characters. The lower "Convert" options however will encode whatever characters are currently in the application window into the encoding selected, very much as I've described above. A useful option but one to be used with care.

So What Do I trust

The main thing that you need to be able to trust when working with international data and different encodings is your own understanding. If you are confident in what you are looking at, what are expecting to see and where you can trust the characters that you are seeing, then the battle is almost won. Working out where data is stored, what translations it undergoes and what encoding is being used to map that data to representable characters is critical to make sure that your testing is based on the appropriate target behaviours.

The most reliable way that I've found to verify the content of data is to view using a Hex Editor. The hexdump command on Linux with the -c option is a great way to view character files. On Windows I've found HHD Hex Editor Neo very useful as it allows you to view Hex data alongside different encoding representations of the same bytes, including Unicode ones. Most Hex editors are limited to single byte representations.

In addition to allowing the viewing of the underlying bytes in a data file, many Hex editors also support the comparison of files. The downsides of comparing in hex is that you don't tend to get the nice matching and alignment that comes with textual comparisons. If you can obtain supposedly identical data in a matching encoding and format, including line endings and delimiters, then hex comparison can be a very precise and surgical means of checking for consistency.

Hex Editor Neo HDD

As I mentioned in the 'database' section you can't necessarily rely on a difference in bytes indicating a difference in the character data that is represented. Remember the second of the key principles presented at the start. However if you can obtain data in a consistent format then viewing as hex can remove all manner of confusion as to differences that you are seeing. I also suggest that, where possible, data comparisons will be done programatically based on the byte content of the data rather than relying on visualising the characters. I have a Java program that I use to compare output files from our application with the source data loaded in to validate data conservation within the system.

One trick that I also find particularly helpful for testing character data conservation is to create a unicode copy of any character data that I test with. Then I can use this copy to compare back to on a character basis if the transformations that have occurred to the data within the system make binary comparisons to my source data inappropriate, for example if I've imported single byte data into a UTF-8 based system. To check for character support I keep copies of the entire byte ranges for both single byte and unicode encodings as test source data files, these are incredibly useful to push through a data system or process to check for conservation and consistency.

Working with character data can be hugely confusing field if you aren't sure what you are doing. By understanding the key concept that character representations of data are untrustworthy and inconsistent we have a much better chance of avoiding getting caught out. Developing a familiarity with the concepts of encodings and the various settings that control these provides an invaluable foundation for anyone involved in data testing.

Further Reading

As with all things in computing there are multiple layers of complexity to understanding the impact and complexity of character sets, encodings and locales. I've not found the need to delve to the depths, for example, of how the Linux kernel processes to perform data validation operations myself but there is plenty of information available.

  • This site has a good summary of the settings and considerations when working with Unicode in Linux.
  • A good introduction to code pages on Microsoft can be found here.
  • For more generic cross platform considerations this page provides a good concise history of encodings and the implications of mixing between unicode and non-unicode ones
  • this blog post provides a good in depth explanation of character encoding considerations, targetted at programmers.