Friday 18 April 2014

Data analysis: ten tips I wish I'd known sooner

I enjoy analysing data and I've been doing it for years, but I still do things inefficiently. All the same, I have learned some things along the way. As I work through another complex dataset, I thought it worth sharing some of the most useful tips I've picked up to make life simpler and smoother.  Some are very elementary and they will mostly be of relevance to psychologists who use Excel and SPSS to do fairly straightforward analyses, though some points are more generic.

A lot of these tips have to do with deploying those frontal lobes and thinking ahead: you need to be aware of three things:
  • How you set up a dataset can make a big difference to how easy it is to analyse later on
  • You will not remember anything about your data in a few years (or even months) time
  • There are increasing moves towards data-sharing - i.e. making your data available on a public repository and you need to plan for that too.

1. Label your subjects consistently and anonymously

You will almost certainly be required to anonymise data collected from human subjects. It is amazing how often people sign up to this in their ethics application but then forget all about it. I've seen supposedly anonymised data identified by people's names, initials and/or dates of birth. Don't do it!

I find simple consecutive number codes work fine, with a prefix denoting which study the subjects  come from. There are two things to think about, in addition to anonymisation. First, will it be useful to be able to sort subjects by code number into specific groups? For instance, if you have three groups identified as young, middle-aged and old, you might think of labelling them with Y1, Y2, .... M1, M2 etc.  But in certain computing systems, this will mean that when you generate output, e.g. means for each group, they will occur in alphabetic order, so M then O then Y.  So it may be  better to use codes that will follow a natural sequence - this means when you want to paste output into a table you don't need to fiddle about with it.

Another thing to note is that if you have a program that treats codes as alphabetic, then if you have, say, 20 subjects, and you sort them, they will come out in the order: S1, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S2, S20, S3, S4, S5, S6, S7, S8, S9. This mildly irritating feature can be avoided if you ensure all codes are the same length, e.g. S01, S02, and so on.

On the other hand, if you are generating raw data files that you want to process automatically through some analysis package, make sure you have codes that can be easily read in a loop. It's much easier to tell a program to sequentially analyse files beginning with S and ending in numbers 1 to 20, than it is to type in each subject code separately.

And if you are going to analyse a whole set of files using an automated procedure, use a logical folder structure to organise your files. I've been stymied by finding that someone has carefully organised data files so that all the data for one child on various tasks are in one folder, and these folders are then within other folders that group them by age. I prefer it if all the files that are to be analysed together are kept together: provided the file-naming system is well-constructed, there should not be any chance of confusing who is who, and you can then point the analysis program just to one relevant folder, without a lot of if statements.

2. Label your variables in a consistent and intuitive fashion that will work across platforms

If you are going to end up analysing your results in SPSS, start out with variable names that SPSS will accept - i.e. no blanks, leading numbers or prohibited characters should be included. Think ahead to the paper you plan to write about the results and consider the variable names you will use there. I seldom obey my own advice here, but it's a common source of irritation to reviewers if you are inconsistent in how you refer to a variable. Much better to start as you mean to go on as far as is possible.

3. Use 'freeze panes' in Excel

This is about as basic as it gets, but I'm surprised at how many people don't know about it. In Excel, when you scroll down or across your file, the variable names or subject IDs scroll off the screen. If you place your cursor in the cell just below the variable names and just to the right of the subject IDs (i.e. the first cell of data), and select View|Freeze panes, the rows and columns above the current cell will stay put when you scroll. If you get it wrong, you can always unfreeze.

There's a rather clunky method that lets you freeze panes in SPSS, which can be useful if you want to see subject IDs while scrolled over to the right: see here for instructions

4. Keep an explanatory list of your variables

In SPSS, the 'labels' field can be useful for keeping a record of what the variable is, but it is limited in length and sometimes more detail is needed. It's worth keeping a data coding file which lists all variables and gives a brief description of what they are, what missing value codes are, and so on. This is absolutely critical if you plan to deposit data in an archive for sharing. It's good practice to work as if that is going to be the case.

5. Use one big file, rather than lots of little files, and hide variables that you aren't currently using

If you have a large dataset, it's tempting to break it up to make it more manageable. You may have hundreds of variables to contend with. You may therefore be tempted to pull out variables of current interest and store in a separate file. The problem is that if you do that, it's easy to lose track of where you are.

Suppose you have a giant master file, and you select a few variables to make a new file, and then in the new file you find an error. You then need to correct it in both files, which is tedious - so tedious that you may not bother and will then end up with different versions and be unsure which is correct.

Also, you may want to look at relationships between the variables you have extracted and other variables in the master file. This involves more fiddling with files, which is a good way of generating errors, especially if you use cut and paste.  So my advice is stick with one master file, which is scrupulously labelled with a version number when you update it. You can avoid the 'too much data' problem by just hiding bits of file that aren't currently in use.

In Excel, it's simple to just hide rows or columns that you aren't using. You just select the columns you want to hide and select View|Hide. If you want to see them again, you select the columns adjacent to the hidden columns and select View|Unhide.

You can do a similar thing in SPSS by defining Variable Sets. It's a bit more fiddly than the Excel operation, but once you have defined a variable set, it is saved with the file and you can reselect it very easily. Instructions for doing this are here.

This website, incidentally, is a treasure trove of useful advice, including instructions on  how to produce APA formatted tables from SPSS output.

6. Never name a file with the suffix 'final' and always back up key data

It's just tempting fate to call a file 'final'. You will end up with files called 'final final' or 'really final' or 'final v.10'. Better to use dates to identify the most recent version.

The back-up advice is blindingly obvious but even the most seasoned of us still forgets to do it. If you've put in several days' work on something, you need to have it stored safely in case your computer dies or gets stolen.

7. Look at your data

Before rushing into an analysis, it is important to plot your data. This can be a good way of picking up gremlins in the dataset, such as values which are out of range. For instance, if you have raw scores and scaled scores on a variable, plot one vs the other in a scatterplot - depending on whether there is age variation in the sample, scores should either fall on a straight line, or at least cluster around it.
You should also check whether variables are normally distributed, and if not consider how to deal with this if your analysis assumes normality. Andy Field's textbook Discovering Statistics Using SPSS gives helpful advice on this general topic (and many more!).

8. Keep a log of your analysis

I find it helpful to think through what I want to do, then make a list of the analyses and outputs that I need and work through them systematically. In practice, things seldom go smoothly. You may find that a subject's file is missing, or corrupted, or that data need transforming in some way. It's traditional to work with a lab book in which such things are recorded, though I prefer to do this electronically when at the analysis stage, by just keeping a Word document open, and making a note of everything as I go along, under the relevant date.

In this log I note the names of the files I've created and their location and what they do. This really is necessary if you are to avoid a future time when you have to wade through numerous versions of numerous files trying to find the one you can only dimly remember creating.

I also record problems in the log. For instance, suppose you have a task that was wrongly administered in a couple of cases so the data will have to be excluded. The log is where you record the date at which this was noted, which data were affected and what action was taken. If you don't do this, you may well find that you, or someone else, comes back to the dataset at a later time and cannot make sense of it. Or, as I have done all too often, spends a morning discovering the same problem that was already found months previously.

9. Use scripting in SPSS

The best advice of all is to keep a script of your analysis if you are working in SPSS. Even if you know nothing about scripting, this is very easy to do. Set up your analysis using the menu, and instead of hitting OK, you hit Paste. This opens a script window showing the script-version of the commands you have selected. You can then select all or part of the script and run it by hitting the big green arrow at the top of the script window. If you then run another analysis, and again hit Paste, the new syntax will be appended at the bottom of the script.

There are four advantages to doing things this way:
  • You can save the script and thus have a permanent record of the analysis you have carried out. This complements the log you are keeping (and you can save its name in the log).
  • You can annotate the script. Any text you type in that begins with an asterisk and ends with a full stop is treated as a comment. You can use this to remind yourself of what the analysis does.
  • You can come back and re-run the analysis. Suppose you find that you had an error in your dataset and you had to correct one person's data. Instead of having to laboriously reconstruct the whole analysis, you just  re-run the script.
  • This is a good way to learn about scripting. A lot can be picked up by just taking the auto-generated script as a model and tweaking things like the variables that are analysed. If you want to do a parallel analysis on several variables, it is much easier to copy and paste the relevant section of a script and change one or two variable names than to go through a menu.

10. Check all analyses before publishing

It is really important to check all your analyses before you submit a paper. Always go through a paper and make sure that all reported tables and analysis are reproducible. It is amazing how often you re-run an analysis and things come out differently. It can seem sometimes that a malign spirit inhabits your computer and is just teasing you. Even if the  differences are not serious, they can still be a source of worry. In my experience, they often have to do with things like forgetting to select the correct subset of participants for a specific analysis - if, - for instance, there are exclusionary criteria. Or a missing data code may have been  inadvertently been treated as data. In the worst case you may find you cut and pasted something into a file forgetting that the file had been sorted in a different order. And sometimes, just sometimes, both Excel and SPSS files can get corrupted. It can be useful to have data in both formats for an overall consistency check on things like mean values.

This is where SPSS scripting comes into its own. The whole business of re-running the analyses becomes much more tractable with a script. Furthermore, in a final check, you can further annotate the script, noting which table in the paper corresponds to a particular output and so have a nice, clear account of what you did to refer back to.


  1. Re number 6: Don't save in this way in the first place. Use a tool such as the Open Science Framework and always save your file with the same name. The Open Science Framework uses version control, so at the end of the process, you may have 10 or 20 or 100 versions of the same file. The last one is the final. But every other version is there too, in chronological order. No confusion, no mess, no wondering which version is current.

  2. Thanks Stuart. I've taken a look at Open Science Framework and am definitely going to give it a try! I have a project that is nearly complete but has lots of collaborators, and I suspect this would be a good way of sharing materials, etc.

  3. But... but... If I save a script of everything I did, and there is an file that has all 100 versions of how I manipulated the data, how am I supposed to get away with data mining and p-value fishing?!? (Sarcastic, if not obvious)

    1. You can declare it as exploratory work. Nothing wrong with that. (Seriously!)

    2. Anon,
      There is everything wrong with that, if it is not actually exploratory work. If it is legitimately exploratory, then it is fine.

  4. I would add another point:

    "Do not use SPSS and use R instead!" Using which could help you dramatically ease your life.

    1. Disagree! I use R in some contexts, and it has certain advantages - flexible, powerful and free. But it is not easy to learn, and the notion of 'dramatically easing your life' strikes me as very misleading. Let's just say the amount of swearing accompanying my use of R is about 10 times more than when I use SPSS. And the reality is that, at least in psychology, SPSS still is the default for many grad students/postdocs. See:

    2. You are definitely right, there is a learning curve for R, and it is not quite user friendly, and you have mentioned R's advantages. Maybe the geekiness is where you and me 'dramatically' differ.

    3. My own opinion is that R is user-hostile for the first few weeks/months if you are coming from an SPSS or SAS background. I swear my brain was actually twisting itself in knots for a while.

      If one starts with R rather than moving from SPSS or SAS, I suspect it is not that difficult. It's just that it does things so differently from other stats packages that it takes a while to unprogram yourself.

      I totally agree with you, Arman. Use R instead of SPSS. It can drastically improve productivity once one is used to it. A biostatistian, Frank Harrell (a SAS expert) claims he is 20% more efficient since he switched to R.

      Also if you are using Apache OpenOffice, LaTeX or the LaTeX offshoot, LyX, you can create some pretty good self-documenting papers using Sweave or knitr. You just insert the R code into the paper you are writing, compile the paper and voila, you have the entire paper, complete with all the tables and figures created at one go.

      No need to copy and paste things and, if you change the analysis a bit or decide to add a figure you just make the change in the R code and recompile.

      On the other hand, I don't think that anyone should use a spreadsheet for serious data analysis. Spreadsheets are just too errorprone (Google Reinhart and Rogoff spreadsheet errors for an example).

      There is good reason to suspect that spreadsheets such as Excel, and to a lesser but still very serious extent, Apache OpenOffice return incorrect results when used for statistical analysis. I have heard of one linear regression in Excel that returned an R^2 with a negative value. See for a few potential problems.

  5. I (for the last year) run a research support team that does trainings in R, STATA, SPSS, and several other programs. Most of my team loves STATA and tries to direct people to it, but for people who are not going to do much, I could never recommend anything but SPSS if they can get access.

    If you are going to be doing a lot of complex analyses, I would switch all the way up to R or SAS. The transition to R can be a bit tough, but if you do the same types of analyses repeatedly, then after a little while you will get the hang of it, and it won't really matter.

  6. Eric, that sounds like good advice. Maybe someone should do a controlled trial comparing outcomes for people who've been trained in R or SPSS (or SAS for that matter)! I agree with jrk that part of the problem is interference from prior learning (particularly acute if you are a Matlab user I find) but I think that even if you are a newbie, R is less intuitive and easy to learn for most brains than other packages. I still remember my amazement at findings that there was no simple command to clear all variables - yes you could do it, but you only by typing in an unmemorable and cryptic string. And what's with the requiring two keystrokes for <- ? Yes, I can see the logic, but it's not exactly efficient, when other languages seem to get by cheerfully with =. (And I know you can use = with R, but if you read others' scripts, the mostly use <-).
    There's no doubt of the appeal of R to those who are statisticians or who have a strong geeky streak. But there are many people in psychology where that is not the case. And quite often their purposes are served by relatively simple and straightforward statistical methods.
    What leads to a great deal of profanity by me is my attempts to get data into R in the right format, and getting it out at the other end in useable format - either tables or figures. I resent spending hours on these activities, which of course are what both SPSS and Excel do with admirable efficiency. So if I just wanted a quick t-test or correlation or whatever, I'd never use R.
    My use of R is greatly improved by some marvellous people who have set up websites that lead you by the hand with worked examples, e.g. Quick-R. The help that comes with the package assumes you are a statistician, and in my experience often just leads to incomprehension and demoralisation.
    Having said all that, where R comes into its own is in making it easy to simulate data sets. I've argued elsewhere that working with simulated data should be a mandatory part of stats instruction:
    So bottom line is horses for courses

  7. The efficiency argument for R is interesting because I suspect it is true if you are someone who spends most of your day working with R, but not true if (like many researchers) you dip into it less frequently. I work mainly with R these days, and despite several years experience, and having written a book on it, I find myself having to look up how to do things that I haven't done for a while (usually in my own book, which kind of amuses me ...). In that respect it isn't efficient because these are often things I can do in SPSS without looking them up. Efficiency in R relies heavily on memory rather than intuition, SPSS you can usually work stuff out with a bit of common sense and a forage through the menus. However, things that are lodged in memory in R I can do much quicker than fanning about with dialog boxes.
    I basically think this whole SPSS vs R thing is one of those things where people lose sight of the fact that everything has strengths and weaknesses. People who use R fanatically endorse it - and I can understand why (and do so a bit myself) but I sometimes feel that it's almost like the pain of learning R is so immense that you have to come out the other side believing that R is wonderful to justify your own suffering. So, here's how I see it:

    SPSS: relatively easy to learn, great for straightforward analyses, good for data restructuring, the interface keeps you close to your data (I do like a spreadsheet). Like Dorothy says, keep a syntax record and you're laughing. Downsides - seems to be moving towards business analytics since IBM took over, is run by a behemoth and so genuine innovation in functionality tends not to happen ... at least not quickly. Graphs are really poor.

    R: Graphical capabilities are fantastic (and I hear what Dorothy says about trying to get graphs to work, but the more you do, the more templates you'll have sitting on your hard drive .... it gets quicker!), keeps up to speed with latest techniques (you can do Bayes stuff, Robust tests and a whole bunch of stuff that SPSS can't do). It can be very quick to do stuff if it's in your memory. You can usually find out how to do something within 30s using google. Also, you are forced to use script files, which as Dorothy points out, is good practice. Down side, steep learning curve, things not in your memory can be slow to do, you can find an answer in 30s using google but often that answer is completely penetrable to anyone without a statistics degree, I feel detached from the data (yes, I know there are data editor GUI's but they are clunky and I don't use them), Until they improve getting data in and out I wouldn't teach R to an undergrad;)

    So, why do I mainly use R. It's because it does more interesting things than SPSS. I think it's rare that anyone can justify doing anything other than a robust test, so that more or less rules out SPSS for anything other than a basic bootstrap of a t-test or correlation. I still use SPSS to restructure data, and sometimes for a quick and dirty look at the data, then I switch to R. This is possibly a diagnostic test of masochism though ....

  8. Enjoyed the post and discussion. R is the clear winner if you want to run linear mixed effect models, but for regression diagnostics and most analyses I also use SPSS. I'd eventually like to write snippets of R code for everything I know how to do in SPSS, but that's a weekend project I haven't found time for yet. There's nothing worse than having your SPSS license expire while you're in the thick of data analysis...

    And one SPSS trick I've found useful: You can edit the titles and subtitles of analyses and plots in the SPSS output viewer, to make them more useful and informative when you return to them years later.

  9. These sorts of posts frequently get dragged into a "R is best!" "No it isn't!" "Yes it is!" squabble which is unhelpful. Use whatever software you like, can afford, and know how to use. Preferably one which is programmable/scriptable. Here are some of my thoughts on some of the other tips:

    #5 - this sounds like a matter of taste to me, and not one that I personally agree with. If you are skilled in joining, merging, reshaping, collapsing and merging datasets then there is no reason at all why you can't keep things separate. I think this advice is highly intertwined with the workflow of the specific software/programming language that you're using, rather than universal advice. Furthermore I personally think that errors introduced by manually copy-and-pasting is an indictment against copy-and-pasting rather than having many input datasets.

    #6 - I think Stuart touched on this earlier - use version control for your programs and reports. I've not heard of "Open Science Framework" but it sounds similar to other ones I've used before such as Subversion (frowned upon by modern computing buffs but it does the job), Git or Mercurial. Learning how this works is such a worthwhile investment.

    #9: The key thing here is *reproducibility*. Your aim should be to build a program where you take a bunch of datasets, press a button, and all your analysis and reporting is done just like that with no further manual intervention. Make this your goal.

    A couple of further points which I feel were missing:
    - Learn what list format is and make sure all your datasets conform to it, especially if you're going to use Excel, which is great because it gives you the freedom to do what you want, but terrible because it gives you the freedom to make a complete mess. This guide is very old but still very relevant
    - Back up your original datasets and never overwrite them directly
    - Divide your programming up into three bits: taking your original (probably jumbled up) datasets and reformatting them into something useful and "analysis ready", doing your analysis, and reporting your results. Allow way more time for the first and last bits than for the second.

  10. Thanks everyone for the discussion. This discussion is common whenever we have overlaps in functionality, so for example you might find R vs. Matlab or Python vs. Matlab, etc all around the web. I'm a neuroimager, and a huge fan of open science, living in a realm between statistics and computer science and I would always use R (and discourage SPSS). However, it does not mean that if I were Dorothy, I would have used R.

    For my research, R is scalable (i.e I can run hundreds of R instances for parallel analysis), and very easy to use (sorry deevybee). Most of my work is focused on Bayesian analysis these days, I could even perform CUDA programming inside R.
    To be honest R is one of the easiest languages (and very high level) out there. I can use R kernel inside IPython notebook, write dynamic Markdown reports, and deploy cloud computing resources very easily.

    It is again, gets back to what your purpose of life is.

  11. Using consecutive numbers for IDs is a common mistake. When IDs are consecutive there is no way to validate the ID because all the numbers are good. This is will bite you down the road. At the very least, please skip numbers. For example, when a person has butter fingers and enters "123" instead of "124" that application can validate the ID (check against a list of valid IDs) and not allow entry of the bad ID. If the numbers are consecutive then there is no way to check because all the numbers are good.

  12. These are wonderful tips for a data analysis as it' a very tough job that needs attentiond and specific skills and I would say that keeping Keep a log of your analysis is one of the very important thing.

  13. Some are very elementary and they will mostly be of relevance to psychologists who use Excel and SPSS to do fairly straightforward analyses, though some points are more generic.
    websites value

  14. To make the raw data easy and unproblematic to maneuver and construe, so that one can make improved and good judgment of the information, it is utterly imperative to systemize the entire compilation of data. See more spss data collection data entry