Thursday, 1 September 2016

Why I still use Excel



The Microsoft application, Excel, was in the news for all the wrong reasons last week.  A paper in Genome Biology documented how numerous scientific papers had errors in their data because they had used default settings in Excel, which had unhelpfully converted gene names to dates or floating point numbers. It was hard to spot as it didn't do it to all gene names, but, for instance, the gene Septin 2, with acronym SEPT2 would be turned into 2006/09/02.  This is not new: this paper in 2004 documented the problem, but it seems many people weren't aware of it, and it is now estimated that the literature on genetics is riddled with errors as a consequence. 
This isn't the only way Excel can mess up your data. If you want to enter a date, you need to be very careful to ensure you have the correct setting. If you are in the UK and you enter a date like 23/4/16, then it will be correctly entered as 23rd April, regardless of the setting. But if you enter 12/4/16, it will be treated as 4th December if you are on US settings and as 12th April if you are on UK settings.
Then there is the dreaded autocomplete function. This can really screw things up by assuming that if you start typing text into a cell, you want it the same as a previous entry in that column that begins with the same sequence of letters. Can be a boon and a time-saver in some circumstances, but a way to introduce major errors in others.
I've also experienced odd bugs in Excel's autofill function, which makes it easy to copy a formula across columns or rows. It's possible for a file to become corrupted so that the cells referenced in the formula are wrong. Such errors are also often introduced by users, but I've experienced corrupted files containing formulae, which is pretty scary.
The response to this by many people is to say serious scientists shouldn't use Excel.  It's just too risky having software that can actively introduce errors into your data entry or computations. But people, including me, persist in using it, and we have to consider why.
So what are the advantages of keeping going with Excel?
Well, first, it usually comes for free with Microsoft computers, so it is widely available free of charge*. This means most people will have some familiarity with it –though few both to learn how to use it properly.
Second, you can scan a whole dataset easily: it's very direct scrolling through rows or columns. You can use Freeze Panes to keep column and row headers static, and you can hide columns or rows that you don't want getting in the way.
Third, you can format a worksheet to facilitate data entry. A lot of people dismiss colour coding of columns as prettification, but it can help ensure you keep the right data in the right place. Data validation is easily added and can ensure that only valid values are entered.
Fourth, you can add textual comments – either as a row in their own right, or using the Comment function.
Fifth, you can very easily plot data. Better still, you can do so dynamically, as it is easy to create a plot and then change the data range it refers to.
Sixth, you can use lookup functions. In my line of work we need to convert raw scores to standard scores based on normative data. This is typically done using tables of numbers in a manual, which makes it very easy to introduce human error. I have found it is worth investing time to get the large table of numbers entered as a separate worksheet, so we can then automate the lookup functions.
Many of my datasets are slowly generated over a period of years: we gather large amounts of data on individuals, record responses on paper, and then enter the data as it comes in. The people doing the data entry are mostly research assistants who are relatively inexperienced. So having a very transparent method of data entry, which can include clear instructions on the worksheet, and data validation, is important. I'm not sure there are other options of software that would suit my needs.
But I'm concerned about errors and need strategies to avoid them. So here are the working rules I have developed so far.
1. Before you begin, turn off any fancy Excel defaults you don't need. And if entering gene names, ensure they are entered as text.
2. Double data entry is crucial: have the data re-entered from scratch when the whole dataset is in, and cross-check the data files. This costs money but is important for data quality. There are always errors.
3. Once you have the key data entered and checked, export it to a simple, robust format such as tab-separated text. It can then be read and re-used by people working with other packages.
4. The main analysis should be done using software that generates a script that means the whole analysis can be reproduced. Excel is therefore not suitable. I increasingly use R, though SPSS is another option, provided you keep a syntax file.
5. I still like to cross-check analyses using Excel – even if it is just to do a quick plot to ensure that the pattern of results is consistent with an analysis done in R.  
Now, I am not an expert data scientist – far from it. I'm just someone who has been analysing data for many years and learned a few things along the way. Like most people, I tend to stick with what I know, as there are costs in mastering new skills, but I will change if I can see benefits. I've become convinced that R is the way to go for data analysis, but I do think Excel still has its uses, as a complement to other methods for storing, checking and analysing data. But, given the recent crisis in genetics, I'd be interested to hear what others think about optimal, affordable approaches to data entry and data analysis – with or without Excel.

*P.S.  I have been corrected on Twitter by people who have told me it is NOT free; the price for Microsoft products may be bundled in with the cost of the machine, but someone somewhere is paying for it!

Update: 2nd September 2016
There was a surprising amount of support for this post on Twitter, mixed in with anticipated criticism from those who just told me Excel is rubbish. What's interesting is that very few of the latter group could suggest a useable alternative for data entry (and some had clearly not read my post and thought I was advocating using Excel for data analysis). And yes, I don't regard Access as a usable alternative: been there tried that, and it just induced a lot of swearing.
There was, however, one suggestion that looks very promising and which I will chase up
@stephenelane suggested I look at REDcap.
Website here: https://projectredcap.org/

Meanwhile, here's a very useful link on setting up Excel worksheets to avoid later problems that came in via @tjmahr on Twitter
http://kbroman.org/dataorg/

Update 4th October 2016
Just to say we have trialled REDCap, and I love it.  Very friendly interface. Extremely limited for any data manipulation/computation, but that doesn't matter, as you can readily import/export information into other applications for processing. It's free but institution needs to be signed up for it: Oxford is not yet fully functional with it, but we were able to use it via a colleague's server for a pilot.

5 comments:

  1. I like Excel, and call it Magic Maths Paper.
    However, it needs to be kept simple, and I find the graphing facilities frustrating. Getting a reasonable graph is easy, getting it just as you want very difficult.
    No bad for freeware.

    ReplyDelete
  2. I still use Excel a lot. I don't use it for graphs as I find the time taken to get a graph right exceeds that required to write an R function do plot the data (in most cases).

    The big advantages are the ubiquity of the package, the utility of sharing csv files and the ability to view the raw data flexibility that is unmatched by any other package I've used. Once you've got data sets that can't be viewed on a single screen this advantage diminishes and I can see a case for other approaches.

    Two other uses: setting up simple calculators for students and collaborators (showing working etc.) and data checking and cleansing using logic and text functions.

    One thing I don't generally do is use it for proper stats (except where I can check the results by hand or in a package such as R).

    ReplyDelete
  3. Over the last few years I have slowly come to the conclusion that using a spreadsheet or anything more complicated than my shopping list was madness. I am now reconsidering my position on shopping lists.

    In general spreadsheets seem to be often used in extremely inappropriate ways and the actions of a spreadsheet is very opaque to the user which makes auditing a spreadsheet a bit of a problem.

    For highly experienced and knowledgeable users accustomed to all the vagaries of a spreadsheet, they can be a useful tool particularly for data entry but, as the Genome Biology paper illustrates, too many people don't understand what a spreadsheet does.

    A possible alternative to a data-entry spreadsheet for some in the Windows world might be EpiData. http://www.epidata.dk/

    I think that the best approach for data entry is a well-designed data entry form attached to a database.

    It seems a pity to teach students and budding young researchers to use an inferior and error prone tool.

    I totally agree with Thom Baguley on graphs. Using a spreadsheet for graphs is way too much work.

    Many of the sins of Excel and other spreadsheets are ones that don't appear if one is just using it for data entry but in many cases people will use it for analysis, modeling and heaven knows what else.

    A problem, not one likely to bother a researcher who is only using a spreadsheet for data entry or a quick bit of EDA is that in Excel and other spreadsheets do not, in general, have strong typing. Without this, you can do some weird things.

    As an experiment enter your telephone in spreadsheet cell A1 and format it to your usual national format. I am in Canada so my format is usually 555-666-7777. Then in B1 enter the number 2. In C1 enter “=A1*B1”.

    ReplyDelete
  4. SPSS isn't free either, so for reproducibility one would stick to R or something else freely available and open-source. (Without the latter we would be talking about some magic blackbox, not something that everyone can take apart to peek inside, if needed)

    ReplyDelete
  5. Does Excel have data limits for statistical functions? I have been told during an Excel call that t-tests, for example, are good for data sets up to N=60. Does that mean if I have a data set of N=120 that the calculation can be in error? Lpalmer@winona.edu

    ReplyDelete