Consider a spreadsheet with a field that captures an amount. Instead of the actual amount, the field has a number of interval possibilities, like the following.
- < € 5k
- € 5k - € 10k
- € 10k - € 20k
- € 20k - € 50k
- € 50k - € 100k
- > € 100k
What could possibly go wrong?
Recently, I worked on a spreadsheet with a field similar to this. At first glance, I thought, that is easy enough. 'What could possibly go wrong?'. However, when I started checking and updating the data, I noticed that the list had a number of problems. Before you continue reading, ask yourself for a moment, what could go wrong? And, how could that be solved or prevented? Scroll down, and compare with my findings.
This problem becomes clear from just looking at the list. What happens when the amount is € 10k? Should one take the € 5k - € 10k interval or the one of € 10k - € 20k ?
The answer may depend on the list's purpose, but eliminating the problem would be preferable. Typically, one would replace the high end of the interval by the same number minus one.
- < € 5k
- € 5k - € 9.999
- € 10k - € 19.999
A downside of this is of course that the interval's high end number becomes large. For the sake of clarity, that price could be paid. After all, the interval's low end remains a short number, so readability may not be lost.
The field concerns currency intervals for Euro's, but what to do with other currencies? In my example case, the idea was that amounts mentioned by the websites where the data was derived from ( hereafter the 'source websites' ), be converted to Euro's.
There are at least two problems with this. One is that the data editor needs to do additional work. This may be a minor problem depending on resources available. The other problem is that exchange rates vary over time. By the time that the data user sees the data, the rates may have changed to such an extent that the original amount moved to a different interval.
A potential solution for both problems is to move the currency to a separate variable. However, for the data users, this may not be acceptable and they insist that the amounts be presented in Euros. To get the best for both user groups, one would need to have two versions of the same spreadsheet: the data production version with the added variable for a currency and the data use version which calculates the exchange with Euros at the moment the data is used.
Cross border cases
Some of source websites did not mention a number but used an interval themselves. Worse, these intervals did not always fall withing the possibilities of the field. What to do with € 15k - € 30k?
Here too, the answer may depend on the application. Perhaps the data users are mostly interested in the maximum, in which case the interval of € 20k - € 50k should be selected. This could work, but requires additional instruction of the data editors, and possibly of the data users. This solution would not have my vote because most people do not read the instructions, and even if they do there may be an overload of instructions. If one can prevent instructions by letting the design speak for itself, that would be the better option.
To design the spreadsheet in such a way that this particular instruction can be prevented, one would have to have three fields. One for the individual value, in case one is mentioned in the source website; and the other two for the lower and upper limits in case the source website mentions an interval. Arguably, two fields could suffice, one for the individual value or the lower limit and one for the upper limit. However, this may require additional instruction of the data editor about the double use of the first field.
This solution would also require two versions of the spreadsheet. Here, an additional argument can be made. What if the data users change their minds? What if instead of the upper limit, they suddenly would want the lower limit of a border crossing interval? Working with only one version of the spreadsheet and one field would require revisiting all records. After all, selecting an interval irrespective of whether the source website mentions an interval or a number, results in data loss about what was the case. So all records would have to be checked. Using two versions of the spreadsheet would only require a re-calculation of the data use version.
More than € 35K
This is a special category of the cross border cases of the previous section. The problem is more or less the same, viz. it is unclear which category should be selected. The solution is also the same: work with two versions of the spreadsheet, whereas the data production version has three fields. The data editor only enters the lower limit if the source website says € 35k or more. And if the website says at most € 35k, then the editor only enters the upper limit. What could possibly go wrong? ;-)
As far as I remember, the spreadsheet that triggered the writing of this post did not have this particular problem, but it could occur. In the example list, the second last option is '> € 100k'. What happens if, say, three quarters of the source websites mention amounts much higher than € 100k? What if most sources talk about millions of Euros? There is a systematic mismatch between the source websites and the list of options, which means that the data is not as useful to the data users as could be.
It could be that the spreadsheet's designers had sampled a number of source websites but by coincidence mostly found websites mentioning amounts between € 5k and € 100k. Alternatively, they probed a representative sample, but that some years later, the world had changed and the list of options needed an update.
Regular checking would indeed be a solution. However, having the solutions in place that have been discussed so far, this would not be necessary. As part of the calculation of the data use version, the applicability of the list of options can be checked and adjusted as needed.
Any other problems? Should one solve them?
This post is about data entry and data use in general, not about statistics. For statistical use, the intervals do not suffice because they do not have the same size. This can of course easily be solved by changing the intervals.
Perhaps, you came up with something else that could go wrong? Please, write it down in a comment. I would be interested to know. Could your problem be tackled with the solutions proposed above? How else could one solve it?
There is at least on more possibility that the interval list can not deal with, which is that the source website mentions two or more distinct possibilities. For example, it mentions the amount of € 7k and a range of more than € 25k. However, after visiting about 150 websites, this possibility never occurred. There are two questions to ask in cases like this.
First, how does one capture the situations that do not fit the list of possibilities or the data structure of multiple fields? Notice that the list of options has the possibility of 'unknown'. It is a good addition because it shows that the field is not left empty for other reasons ( for example, the field remained empty because it was overlooked during data entry ). However, the option of 'Other' could be added to the list and accompanied by a 'Notes' field to describe which other situation occurred in the source website.
Secondly, once a new possibility is encountered and registered in the spreadsheet, one should ask or measure how often this possibility occurs and if it is worth the while to redesign the spreadsheet so that it can be captured? In fact, this is something to consider for all the propositions for improvements made above.
This post discussed a number of things that went wrong or could have gone wrong with the interval list : border cases, different currencies, cross border cases, special cross border cases, and options mismatch.
From interval list to data structure to user interface
This post also proposed a number of solutions, most of which involved changes that can not be achieved by merely changing the list of intervals. All in all, instead of having one field, the spreadsheet could eventually have five fields : A currency field, an amount field in case the source website mentions a specific amount, and, in case the website mentions an interval, a lower limit field and an upper limit field, and a notes field.
By proposing these additional fields, the topic changed from design of the list to the design of the wider data structure. This in turn may require work on the user interface, because merely showing them side-by-side could work, but there are better ways of presenting. Here, a form view with controls can do a better job than a table view on the data. It would go too far to go into the details here. Suffice it to say that it may require moving the production version from the spreadsheet to a relational database or other software. This blog has a post dedicated to such a data hub.
Different versions and views for different user groups
Almost all solutions rested on making a distinction between a data production version and a data use version. It maybe overkill for solving issues with an interval list, but these issues are a symptom of a bigger problem: there usually are different uses or user groups with different requirements. In this example of the field with the interval options and for the sake of simplicity, I identified only two groups, viz. data editors and data users. The data users 'simply' want a single list of options so that they can search and select data. For the data editors it would be much easier to stay close to the data provided by the source websites without having to calculate exchange rates, and to make difficult and sometimes arbitrary decisions about border crossing categories.
In real life, and also in the case of this example, many groups exist besides data editors and users. For example, owners, supervisors, data subjects and regulators can be identified. Moreover, sub-divisions within all these groups may exist as well. Each (sub) group has different and possibly contradicting requirements regarding data and interface. These differences can be addressed by having different versions of the 'same' data set, geared towards their respective uses through different data structures, views, features and functionalities, and different interfaces. Again, a spreadsheet program may not be the most suitable tool for that.
test, test, test against reality
The list of interval options seemed simple and straightforward. One can now answer the question stated at the beginning with : 'Far more than one would guess'. Few things are more surprising than reality. The design of spreadsheets, data structures, databases and applications benefit from taking this into account. One should test, test and test everything, not just against the imagination of the designers but against real life use.