Friday 18 April 2014

Data analysis: ten tips I wish I'd known sooner

I enjoy analysing data and I've been doing it for years, but I still do things inefficiently. All the same, I have learned some things along the way. As I work through another complex dataset, I thought it worth sharing some of the most useful tips I've picked up to make life simpler and smoother.  Some are very elementary and they will mostly be of relevance to psychologists who use Excel and SPSS to do fairly straightforward analyses, though some points are more generic.

A lot of these tips have to do with deploying those frontal lobes and thinking ahead: you need to be aware of three things:
  • How you set up a dataset can make a big difference to how easy it is to analyse later on
  • You will not remember anything about your data in a few years (or even months) time
  • There are increasing moves towards data-sharing - i.e. making your data available on a public repository and you need to plan for that too.

1. Label your subjects consistently and anonymously

You will almost certainly be required to anonymise data collected from human subjects. It is amazing how often people sign up to this in their ethics application but then forget all about it. I've seen supposedly anonymised data identified by people's names, initials and/or dates of birth. Don't do it!

I find simple consecutive number codes work fine, with a prefix denoting which study the subjects  come from. There are two things to think about, in addition to anonymisation. First, will it be useful to be able to sort subjects by code number into specific groups? For instance, if you have three groups identified as young, middle-aged and old, you might think of labelling them with Y1, Y2, .... M1, M2 etc.  But in certain computing systems, this will mean that when you generate output, e.g. means for each group, they will occur in alphabetic order, so M then O then Y.  So it may be  better to use codes that will follow a natural sequence - this means when you want to paste output into a table you don't need to fiddle about with it.

Another thing to note is that if you have a program that treats codes as alphabetic, then if you have, say, 20 subjects, and you sort them, they will come out in the order: S1, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S2, S20, S3, S4, S5, S6, S7, S8, S9. This mildly irritating feature can be avoided if you ensure all codes are the same length, e.g. S01, S02, and so on.

On the other hand, if you are generating raw data files that you want to process automatically through some analysis package, make sure you have codes that can be easily read in a loop. It's much easier to tell a program to sequentially analyse files beginning with S and ending in numbers 1 to 20, than it is to type in each subject code separately.

And if you are going to analyse a whole set of files using an automated procedure, use a logical folder structure to organise your files. I've been stymied by finding that someone has carefully organised data files so that all the data for one child on various tasks are in one folder, and these folders are then within other folders that group them by age. I prefer it if all the files that are to be analysed together are kept together: provided the file-naming system is well-constructed, there should not be any chance of confusing who is who, and you can then point the analysis program just to one relevant folder, without a lot of if statements.

2. Label your variables in a consistent and intuitive fashion that will work across platforms

If you are going to end up analysing your results in SPSS, start out with variable names that SPSS will accept - i.e. no blanks, leading numbers or prohibited characters should be included. Think ahead to the paper you plan to write about the results and consider the variable names you will use there. I seldom obey my own advice here, but it's a common source of irritation to reviewers if you are inconsistent in how you refer to a variable. Much better to start as you mean to go on as far as is possible.

3. Use 'freeze panes' in Excel

This is about as basic as it gets, but I'm surprised at how many people don't know about it. In Excel, when you scroll down or across your file, the variable names or subject IDs scroll off the screen. If you place your cursor in the cell just below the variable names and just to the right of the subject IDs (i.e. the first cell of data), and select View|Freeze panes, the rows and columns above the current cell will stay put when you scroll. If you get it wrong, you can always unfreeze.

There's a rather clunky method that lets you freeze panes in SPSS, which can be useful if you want to see subject IDs while scrolled over to the right: see here for instructions

4. Keep an explanatory list of your variables

In SPSS, the 'labels' field can be useful for keeping a record of what the variable is, but it is limited in length and sometimes more detail is needed. It's worth keeping a data coding file which lists all variables and gives a brief description of what they are, what missing value codes are, and so on. This is absolutely critical if you plan to deposit data in an archive for sharing. It's good practice to work as if that is going to be the case.

5. Use one big file, rather than lots of little files, and hide variables that you aren't currently using

If you have a large dataset, it's tempting to break it up to make it more manageable. You may have hundreds of variables to contend with. You may therefore be tempted to pull out variables of current interest and store in a separate file. The problem is that if you do that, it's easy to lose track of where you are.

Suppose you have a giant master file, and you select a few variables to make a new file, and then in the new file you find an error. You then need to correct it in both files, which is tedious - so tedious that you may not bother and will then end up with different versions and be unsure which is correct.

Also, you may want to look at relationships between the variables you have extracted and other variables in the master file. This involves more fiddling with files, which is a good way of generating errors, especially if you use cut and paste.  So my advice is stick with one master file, which is scrupulously labelled with a version number when you update it. You can avoid the 'too much data' problem by just hiding bits of file that aren't currently in use.

In Excel, it's simple to just hide rows or columns that you aren't using. You just select the columns you want to hide and select View|Hide. If you want to see them again, you select the columns adjacent to the hidden columns and select View|Unhide.

You can do a similar thing in SPSS by defining Variable Sets. It's a bit more fiddly than the Excel operation, but once you have defined a variable set, it is saved with the file and you can reselect it very easily. Instructions for doing this are here.

This website, incidentally, is a treasure trove of useful advice, including instructions on  how to produce APA formatted tables from SPSS output.

6. Never name a file with the suffix 'final' and always back up key data

It's just tempting fate to call a file 'final'. You will end up with files called 'final final' or 'really final' or 'final v.10'. Better to use dates to identify the most recent version.

The back-up advice is blindingly obvious but even the most seasoned of us still forgets to do it. If you've put in several days' work on something, you need to have it stored safely in case your computer dies or gets stolen.

7. Look at your data

Before rushing into an analysis, it is important to plot your data. This can be a good way of picking up gremlins in the dataset, such as values which are out of range. For instance, if you have raw scores and scaled scores on a variable, plot one vs the other in a scatterplot - depending on whether there is age variation in the sample, scores should either fall on a straight line, or at least cluster around it.
You should also check whether variables are normally distributed, and if not consider how to deal with this if your analysis assumes normality. Andy Field's textbook Discovering Statistics Using SPSS gives helpful advice on this general topic (and many more!).

8. Keep a log of your analysis

I find it helpful to think through what I want to do, then make a list of the analyses and outputs that I need and work through them systematically. In practice, things seldom go smoothly. You may find that a subject's file is missing, or corrupted, or that data need transforming in some way. It's traditional to work with a lab book in which such things are recorded, though I prefer to do this electronically when at the analysis stage, by just keeping a Word document open, and making a note of everything as I go along, under the relevant date.

In this log I note the names of the files I've created and their location and what they do. This really is necessary if you are to avoid a future time when you have to wade through numerous versions of numerous files trying to find the one you can only dimly remember creating.

I also record problems in the log. For instance, suppose you have a task that was wrongly administered in a couple of cases so the data will have to be excluded. The log is where you record the date at which this was noted, which data were affected and what action was taken. If you don't do this, you may well find that you, or someone else, comes back to the dataset at a later time and cannot make sense of it. Or, as I have done all too often, spends a morning discovering the same problem that was already found months previously.

9. Use scripting in SPSS

The best advice of all is to keep a script of your analysis if you are working in SPSS. Even if you know nothing about scripting, this is very easy to do. Set up your analysis using the menu, and instead of hitting OK, you hit Paste. This opens a script window showing the script-version of the commands you have selected. You can then select all or part of the script and run it by hitting the big green arrow at the top of the script window. If you then run another analysis, and again hit Paste, the new syntax will be appended at the bottom of the script.

There are four advantages to doing things this way:
  • You can save the script and thus have a permanent record of the analysis you have carried out. This complements the log you are keeping (and you can save its name in the log).
  • You can annotate the script. Any text you type in that begins with an asterisk and ends with a full stop is treated as a comment. You can use this to remind yourself of what the analysis does.
  • You can come back and re-run the analysis. Suppose you find that you had an error in your dataset and you had to correct one person's data. Instead of having to laboriously reconstruct the whole analysis, you just  re-run the script.
  • This is a good way to learn about scripting. A lot can be picked up by just taking the auto-generated script as a model and tweaking things like the variables that are analysed. If you want to do a parallel analysis on several variables, it is much easier to copy and paste the relevant section of a script and change one or two variable names than to go through a menu.

10. Check all analyses before publishing

It is really important to check all your analyses before you submit a paper. Always go through a paper and make sure that all reported tables and analysis are reproducible. It is amazing how often you re-run an analysis and things come out differently. It can seem sometimes that a malign spirit inhabits your computer and is just teasing you. Even if the  differences are not serious, they can still be a source of worry. In my experience, they often have to do with things like forgetting to select the correct subset of participants for a specific analysis - if, - for instance, there are exclusionary criteria. Or a missing data code may have been  inadvertently been treated as data. In the worst case you may find you cut and pasted something into a file forgetting that the file had been sorted in a different order. And sometimes, just sometimes, both Excel and SPSS files can get corrupted. It can be useful to have data in both formats for an overall consistency check on things like mean values.

This is where SPSS scripting comes into its own. The whole business of re-running the analyses becomes much more tractable with a script. Furthermore, in a final check, you can further annotate the script, noting which table in the paper corresponds to a particular output and so have a nice, clear account of what you did to refer back to.

Saturday 12 April 2014

WeSpeechies: A meeting point on Twitter

As an avid Twitter user, I'm always interested in new developments in social media, and so I was intrigued when Caroline Bowen (@speech_woman) and Bronwyn Hemsley (@bronwynhemsley) invited me to curate a session of @WeSpeechies for a week.
I said yes immediately, though I had no real idea (and still have no real idea) of what I was getting myself into. But I knew that Caroline had done an enormous amount to encourage 'Speechies' (i.e., Speech and Language Therapists in the UK/Speech-Language Pathologists elsewhere) to use the internet. In 1998, she started a web-based resource: this is a mine of useful information, including links to assessments, interventions and evidence-based information.
I'm not a speechie myself –  I'm a full-time researcher with a Principal Research Fellowship from the Wellcome Trust and my background is neuropsychology -   but I've always worked closely with the profession, because of my interest in children's communication impairments. So it's great to have the opportunity to interact with those who work at the coalface, and @WeSpeechies seems like a great idea.
But how does it work? Well, there's a description here.
In brief, the idea is that the Twitter handle @wespeechies is taken over by someone – the 'curator' – for a week, with the idea that they can foster interesting exchanges on Twitter. What's more, the interactions can be preserved: this can be helped if those interacting with @wespeechies remember to add the hashtag #WeSpeechies to their tweets.This way we can build an archive of useful and interesting interactions.
Now, of course, most of us are not glued to Twitter all day, even though we may give that impression, and there can be difficulties in interacting across time zones. As curator, I plan to check in a couple of times every day during the week to respond to tweets, but Caroline and Bronwyn wanted also to ensure that there is an opportunity for some live chat. They therefore include in the course of the week a "Tuesday Chat", which involves me chairing a Q&A session. Caroline and Bronwyn have kindly given me an easy topic for my first attempt, namely Apprehensive Academics on Twitter – inspired by this blogpost.
I'm going to do this at 8 to 9 in the morning on Tuesday 15th April 2014 on British Standard Time. This means that the live chat should appeal to night owls in North America, and larks in UK and Europe, whereas for those in Australasia it will be afternoon or early evening.
If you'd like to take part, please feel free to respond to any message from @wespeechies, and remember to include #WeSpeechies in your message. I've generated some questions to get the ball rolling; if you want to answer one of these, please make it clear which one by including, for instance, A1, A2, A3 etc at the start of your Tweet.
Q1 What makes for an effective/ineffective tweet? #WeSpeechies

Q2 Have you got any tips for attracting followers? #WeSpeechies

Q3 Are there downsides to Twitter? #WeSpeechies 
Q4 What unexpected benefits of Twitter have you found?  #WeSpeechies

Q5 Does your employer have an issue or policy on staff being on Twitter, that you know about?  #WeSpeechies
Please note that the idea of #WeSpeechies is to engage and connect with those who work as SALTs/SLPs or who have shared interests in problems affecting communication and/or swallowing. You don't need to be an expert – on the contrary, part of the fun of the exercise is that we can all share experiences and maybe learn something new. The hope is that, if you take part, you may find new contacts around the globe: people with shared interests who can make you feel more of a world-wide community.
A final note of warning: WeSpeechies does not encourage advertisers, spammers or trolls. It is not the place to try and sell or endorse a product. Anyone who does not abide by the basic @WeSpeechies guidelines will be warned and then blocked if the advertising or unsuitable promo Tweets continue.

If you are interested in speechie issues, I do hope you will participate by following @WeSpeechies. I'll be taking over the twitter handle on Sunday 13th April for a week, so look forward to meeting you there!

P.S. 15th April 2014
The Live Chat was fun, if somewhat frenetic.
If you missed it, you can see the tweets collated here: