Once you have your script file created and your basic information annotated, the next step is importing your data. In the sections below, we discuss how to import from CSV, Excel, and SPSS files. This is where we start talking about actual code, but it’s nothing too complicated!
If you’re using the folder structure recommended by earlier workshops, your Files tab should look something like this:
The raw data we’ll be importing is saved to the ‘Data’ folder, the script file is saved to the ‘RCode’ folder. Now that we’ve established this, we can start writing the code.
The best filetype for data storage is CSV. It’s extremely basic, and so even huge datasets stay pretty small, and there’s no extra bits and bobs to confuse R or any other programs when you open it. It’s best practice not to edit your CSV file beforehand either – just download the file from Qualtrics (or any other survey software) and import it into R. That way, any data cleaning decisions are saved in your code, both for your future records (so you know what you did) and your future sanity (so you can double-check for mistakes or the like).
The first command we’ll use is the read.csv() command. Let’s give it a shot! Your code should look like this:
data <- read.csv(file="Data/workshop_data_toy.csv")
In this case, we’ve told R to run the read.csv command on the file ‘workshop_data_toy.csv’ in the ‘Data’ folder of our working directory, and then save the result as an object (in this case, a dataframe) in our workspace, titled ‘data’. Maybe you don’t think ‘data’ is a very descriptive name for an object, and want to label yours as ‘raw_data_toy’. In that case, your code would look like this:
raw_data_toy <- read.csv(file="Data/workshop_data_toy.csv")
(You can call your dataframe whatever you want! I’m going to call mine ‘data’, though, because I’m lazy.)
Once you’ve run this code, you can click the ‘data’ object that’s appeared in your Environment tab, and tada! There’s your data. But if you look closely, you’ll quickly notice an issue:
The topmost row of your CSV file has become the header of your new dataframe. In this case, this means your header consists of long strings that would be unwieldy to use in your scripting. What to do?
You could open your CSV in Excel, delete the top row, save it, and re-import it in R. This probably seems easier, especially at first when you’re unfamiliar with R. But it isn’t the best way to proceed!
The read.csv() command actually gives you the tool you need to fix this problem through the ‘skip’ argument. This argument tells the command ‘the number of lines of the data file to skip before beginning to read data’. If we give it a try:
data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1)
Much better! If we look through our data a bit more, however, we notice a second potential issue in row 19 (aka Stu.ID 1019). There are a bunch of blank cells! It looks like the participant left some questions unanswered. They aren’t the only one – Stu.ID 1012 didn’t respond to the ‘Intent’ item, and 1008 didn’t include their age. However, in these cases there’s a greyed out NA instead of a blank cell. What’s up?
What’s up is that R isn’t recognizing 1019’s missing responses as missing. If a cell is empty, R will always display the greyed-out NA. If anything else is displayed, then R is assuming that something is there. If we were to run an analysis on the data as it is, it would consider ‘blank’ as valid of a response as ‘Agree’, and that’s not what we want. So we need to tweak our import code so that R correctly identifies these blank cells as NAs, and displays them in the
This might seem an overwhelming task, but the solution is simple. Because there are different ways of indicating missing data – some survey software use values like 99, Excel uses #N/A, and so on – the read.csv() command has an argument in which you can specify which values it should treat as NAs. Like this:
data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1, na.strings = c("#N/A",""," "))
The beginning of the code is identical to the import command we used earlier, but the added argument tells R to mark any cells with the provided values as NAs. When I import CSVs, I always include the ‘na.strings’ argument as written here, even when I don’t expect any issues, just to avoid tripping over them later.
Did this fix the issue with Stu.Id 1019? If not, what additional values could you add to the list to catch any remaining import errors? Note: make sure your additions to the list are enclosed in quotes and separated by commas, as displayed above. Otherwise R will assume they’re all one long value or spit out an error!
Did you get stuck anywhere in this step? What are some things to think about as you are preparing to import a data file? Head back to the worksheet to share your thoughts and get the next step.