Hello! This document was created as part of the CC Bio INSITES Quantitative Workshop #2, specifically as a guide for the pre-activity to be completed before our meeting on ?????. This guide has been written assuming that you have little to no familiarity with R, and so begins with a brief overview of the software installation. You shouldn’t feel pressured to memorize or even completely understand everything written here – this is designed as an introduction and a future resource, and as such can be a bit overwhelming at first.
[Add additional bit about “your goal is to get stuck” and link to worksheet.]
Installing R can be unexpectedly complicated at first, as there are two programs from separate websites that need to be installed. To provide a bit of background for the curious: the original ‘R’ is a programming language (like Java, python, C, etc) that began development in the 90s. With the development of this programming language came the first computer program you’ll install, simply titled R, which is a bare-bones interface for writing and running code.
As R became more popular, the RStudio IDE (the second program you’ll install) was developed to build upon the original R and to make writing and running code a bit easier. RStudio doesn’t replace R – instead, it loads it in the background and interfaces between you and the R engine in order to provide some additional functionality. Therefore, in order to use RStudio, you have to download R first (note that you don’t have to run R before running RStudio, so you should only need one additional shortcut on your desktop).
With that information in mind, hopefully getting R and RStudio installed won’t be too confusing anymore! This page provides a good overview for PC and Mac users, but in case it isn’t available (or you would just prefer to stay on this page) we’ll also provide directions for Windows and Macs here. If you use Linux, the link also provides step-by-step directions for Ubuntu towards the bottom of the page.
Head over to CRAN to download the current version of R. The link you need is right on the front page, labeled ‘Download R for Windows’; from there, look for the bolded ‘install R for the first time’ link and click to download the software. You can save the downloaded EXE where ever you like; when run, it will install R into the default directory that you shouldn’t need to alter.
The download page for RStudio makes it very easy to install RStudio for Windows. Hit the big blue button and to get the installer, and then run to get RStudio ready to run. After that, you’re done! The link provided above will walk you through some additional stuff, but we’ll also cover that on our own so you don’t have to worry about it now.
Head over to CRAN to download the current version of R. The link you need is right on the front page, labeled ‘Download R for (Mac) OS X’; from there, look for the section titled ‘Latest release’ and download the indicated PKG. When run, it will install R into the default directory that you shouldn’t need to alter.
On the download page for RStudio check out the section titled ‘All Installers’, and then locate the link for macOS. Save the indicated file and run; once again, you should accept the defaults (unless you’re familiar enough with your system to know why you shouldn’t). After that, you’re done! The link provided above will walk you through some additional stuff, but we’ll also cover that on our own so you don’t have to worry about it now.
Now that you’ve got everything installed, here’s a quick overview of what’s what in the program. When you open RStudio for the first time, you should see this:
The only thing that won’t match up is the contents of your ‘Files’ window, but that’s okay. Now for a quick breakdown of what the various windows are:
This window is where most of the output will be displayed, including errors and warnings. Everything in this window is text – if you create a plot, for instance, it will appear in the ‘Plots’ tab (the same windows as the ‘Files’ tab). The window does have a limit; if you run a bunch of code, or ask R to display something lengthy, some of your older code may get cut off as the window fills with new output. You can write code here, but it won’t be saved, so make sure you’re writing all of your code in the script window (#4, discussed more below).
This window lets you keep track of the objects you’re working with. If you import a dataset, that will be one object (usually a ‘dataframe’ that holds all of your rows and columns). If you run a regression and save the output, that will be another object (usually an ‘lm’, a collection of all the output produced by the test, such as coefficients, residuals, fitted values, and so on). Sometimes you can click objects in this window (e.g., a dataframe) and the object will be output into the console or it will open in a new window. There are some other tabs here as well, but they aren’t used often.
Lots of stuff can be displayed in the tabs of this window, but it’s not as interactive as the others.
This is the window where you’ll spend most of your time, but it doesn’t show up automatically. This window only appears when you create a code file (File > New File) for writing and saving code. The files you create will be Untitled and unsaved, like other coding software or word processing programs you might be familiar with, so don’t forget to save! RStudio has a handy default setting that lets it save your workspace, specifically all of the code windows you have open and all of the objects in your environment, but sometimes there are issues so don’t rely on it too much. Note: when you save files, the save location will default to your current working directory.
The best way to get familiar with R is to use it, so let’s get started. First step: creating a Project!
Projects are a useful way of organizing your work environment, working directory, and file structure. They’re especially useful if you’re working on multiple projects (lowercase p). For instance, you might have a dataset consisting of information about your students’ grades, science interest, science self-efficacy, and growth mindsets. One collaborator might be interested in analyzing the relationship between grades and mindsets, while another wants to examine the relationship between interest and self-efficacy. You can, of course, run both analyses from the same Project, but splitting them helps keep things organized. For instance, one project might limit the dataset to first-year students, while the other looks at the entire sample, or even requires you to combine your data with some collected by your collaborator. Keeping track of any such decisions you’ve made is easier when you keep the projects in separate Projects.
They also make switching between projects easier. If you’re going from a meeting with Collaborator 1 to Collaborator 2, all you have to do is pull up the second project and everything – your working directory, the scripts you had open, the objects in your environment, etc. – will all come up exactly as you left them last. Your working directory is the current location that R will use to output files, and so it’s easy to forget about until something goes wrong (e.g., you save something to the wrong location and can’t find it, or overwrite a file you didn’t intend to). So again, Projects help you avoid this hassle by setting your working directory and allowing you to focus on other things.
In the top right corner of your screen is a little blue box and a label that says Project: (None) (assuming, of course, that you don’t have a project open: see the screenshot below for more information).
The next screen provides a list of potential project types. In this case, we want to select ‘New Project’ again.
The last screen lets you name your project and select the project’s working directory. You don’t need to create a new folder just for your new project! Whatever you input as the directory name will become the new folder for your project. If you created the folder structure recommended by the previous workshop, you can navigate to this location and select it as your new project subdirectory, or you can move your previously created folders into your new project location ourside of R. I titled my project ‘Quant Workshop 2’.
When complete, the project button will have the name of your new project, and the Files window will change to reflect your new working directory. If you look closely at the Files window in the screenshot below, you can see that the full directory for my new project is D:ProjectsWorkshop 2.
And you’re done! Your new project is ready for you to start creating script files, adding code, and importing your data.
File > New File will produce a list of all the programming languages that RStudio can handle, but the most commonly used is the R Script. Once you’ve created the new script, File > Save As will let you save it. The default location will be your project subdirectory, although you can save it outside this folder as well (not recommended).
In the screenshot above, you can see I created the project and script in a folder without any sub-folders. I added folders using the structure recommended in the previous workshop and moved my new script into the RCode location afterwards.
If you have experience with other programming languages, you’re probably familiar with comments. Comments are bits of ‘code’ that aren’t run as code; instead, they’re displayed beside the code as commentary for anyone looking at the code. For an example from another of my projects, see the screenshot below.
In RStudio, comments will automatically display in green (unless you change the default settings) and code will display in blue and black. Comments can be used to provide additional information or guidance. For instance, in line 32 the table() command is used to organize responses to Q42 into a frequency table, and there is a comment afterwords with the text of Q42 and the list of response options.
It’s a good idea to provide comments, but no hard-and-fast rules about what comments should be included or not. In the example above, I provided the text for survey items that generated the data, but I didn’t provide an explanation of the table() command, or of the grep() command that I use in lines 49 and onward, although I might end up doing so later when I share this file with others. That said, it’s usually a good idea to include a few comments including your name and the date you worked on the code. I like to include things (like the survey item text above) that I would otherwise have to spend time looking up. I also use comments to help organize my code, which I talk about more in the next section.
Comments can be used to help make code easier to interpret, but they can also be used to present it more tidily. In the example above, I use two ‘tricks’ to organize my code that you might find useful:
As you can see, when the sections are collapsed, the —— lines are replaced by a purple arrow graphic to further indicate there is code hidden. I like code sections because they help prevent overload and confusion, and allow for a quick review of the entire script and what’s included. Your mileage may vary, but I encourage you to give them a try! When you’re working in RStudio, there are some helpful shortcuts that you can use to add, collapse, and expand code sections easily:
Once you have your script file created and your basic information annotated, the next step is importing your data. In the sections below, we discuss how to import from CSV, Excel, and SPSS files. This is where we start talking about actual code, but it’s nothing too complicated!
If you’re using the folder structure recommended by earlier workshops, your Files tab should look something like this:
The raw data we’ll be importing is saved to the ‘Data’ folder, the script file is saved to the ‘RCode’ folder. Now that we’ve established this, we can start writing the code.
The best filetype for data storage is CSV. It’s extremely basic, and so even huge datasets stay pretty small, and there’s no extra bits and bobs to confuse R or any other programs when you open it. It’s best practice not to edit your CSV file beforehand either – just download the file from Qualtrics (or any other survey software) and import it into R. That way, any data cleaning decisions are saved in your code, both for your future records (so you know what you did) and your future sanity (so you can double-check for mistakes or the like).
The first command we’ll use is the read.csv() command. Let’s give it a shot:
data <- read.csv(file="Data/workshop_data_toy.csv")
In this case, we’ve told R to run the read.csv command on the file ‘workshop_data_toy.csv’ in the ‘Data’ folder of our working directory, and then save the result as an object (in this case, a dataframe) in our workspace, titled ‘data’. Maybe you don’t think ‘data’ is a very descriptive name for an object, and want to label yours as ‘raw_data_toy’. In that case, your code would look like this:
raw_data_toy <- read.csv(file="Data/workshop_data_toy.csv")
(You can call your dataframe whatever you want! I’m going to call mine ‘data’, though, because I’m lazy.)
Once you’ve run this code, you can click the ‘data’ object that’s appeared in your Environment tab, and tada! There’s your data. But if you look closely, you’ll quickly notice an issue:
The topmost row of your CSV file has become the header of your new dataframe. In this case, this means your header consists of long strings that would be unwieldy to use in your scripting. What to do?
You could open your CSV in Excel, delete the top row, save it, and re-import it in R. This probably seems easier, especially at first when you’re unfamiliar with R. But it isn’t the best way to proceed!
The read.csv() command actually gives you the tool you need to fix this problem through the ‘skip’ argument. This argument tells the command ‘the number of lines of the data file to skip before beginning to read data’. If we give it a try:
data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1)
Much better! If we look through our data a bit more, however, we notice a second potential issue in row 19 (aka Stu.ID 1019). There are a bunch of blank cells! It looks like the participant left some questions unanswered. They aren’t the only one – Stu.ID 1012 didn’t respond to the ‘Intent’ item, and 1008 didn’t include their age. However, in these cases there’s a greyed out NA instead of a blank cell. What’s up?
What’s up is that R isn’t recognizing 1019’s missing responses as missing. If a cell is empty, R will always display the greyed-out NA. If anything else is displayed, then R is assuming that something is there. Let’s look at the output from the table() command, a useful command that creates frequency tables of the indicated columns:
table(data$Q1, useNA = "always")
##
## Agree Disagree
## 1 10 4
## Prefer not to respond Somewhat agree Somewhat disagree
## 2 1 5
## Strongly Agree <NA>
## 1 0
To translate: this line of code is telling R to use the table() command on the ‘Q1’ column of the ‘data’ dataframe, and to always display NAs, or cells with missing data (we will talk more about calling objects and what’s up with the ’data$Q1 bit later). As you can see from the output displayed above (it will show in your Console window if you run the code on your own in RStudio), the first response option is a blank line, with only one case. It’s followed by the Agree, Disagree, and Prefer not to respond options, which have 10, 4, and 2 cases, respectively.
If we were to run an analysis on the data as it is, it would consider ‘blank’ as valid of a response as ‘Agree’, and that’s not what we want. So we need to tweak our import code so that R correctly identifies these blank cells as NAs, and displays them in the
This might seem an overwhelming task, but the solution is simple. Because there are different ways of indicating missing data – some survey software use values like 99, Excel uses #N/A, and so on – the read.csv() command has an argument in which you can specify which values it should treat as NAs. Like this:
data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1, na.strings = c("#N/A",""," "))
This code is identical to the import command we used earlier, but the added argument tells R to mark any cells with the provided values as NAs. When I import CSVs, I always include the ‘na.strings’ argument as written here, even when I don’t expect any issues, just to avoid tripping over them later.
Did this fix the issue with Stu.Id 1019? If not, what additional values could you add to the list to catch any remaining import errors? Note: make sure your additions to the list are enclosed in quotes and separated by commas, as displayed above. Otherwise R will assume they’re all one long value or spit out an error!
Once your data is in R you need to understand how R sees it. Just because you know a question has categorical or continuous data does not mean R sees it this way. There are multiple ways to understand how R sees your data. We share two below.
Now that you’ve got your data into R, lets get some practice looking at it. There are a few ways to view a dataframe object like our new data file. One is to click it in the Environment tab – it will open in a new window that can do some useful stuff (e.g., sort by columns or filter). Another way is to use the head() command (and it’s opposite, the tail() command).
head(data)
## Stu.ID Intent Q1 Q2 Q3
## 1 1001 8 Disagree Disagree Disagree
## 2 1002 6 Prefer not to respond Disagree Agree
## 3 1003 5 Somewhat disagree Somewhat disagree Somewhat disagree
## 4 1004 9 Agree Agree Strongly Agree
## 5 1005 7 Strongly Agree Strongly Agree Strongly Agree
## 6 1006 8 Disagree Disagree Disagree
## Q4 Q5 Gender Par1.Educ
## 1 Agree Agree Agender High School
## 2 Agree Agree Female &/or Feminine &/or Woman High School
## 3 Agree Agree Female &/or Feminine &/or Woman High School
## 4 Strongly Agree Strongly Agree NA Bachelor's
## 5 Strongly Agree Strongly Agree Female &/or Feminine &/or Woman High School
## 6 Somewhat agree Somewhat agree Male &/or Masculine &/or Man Bachelor's
## Par2.Educ Age Rac.Eth
## 1 Some College 18 White
## 2 High School 19 Middle Eastern or North African, White
## 3 High School 25 I prefer to identify as [Columbian]
## 4 PhD 16 White
## 5 Not applicable 19 White
## 6 PhD 18 Black or African American
tail(data)
## Stu.ID Intent Q1 Q2
## 19 1019 9 <NA> <NA>
## 20 1020 8 Prefer not to respond Prefer not to respond
## 21 1021 5 Agree Agree
## 22 1022 5 Disagree Disagree
## 23 1023 7 Agree Agree
## 24 1024 6 Agree Agree
## Q3 Q4 Q5
## 19 <NA> <NA> <NA>
## 20 Prefer not to respond Prefer not to respond Prefer not to respond
## 21 Strongly Agree Strongly Agree Strongly Agree
## 22 Disagree Somewhat disagree Somewhat disagree
## 23 Strongly Agree Strongly Agree Strongly Agree
## 24 Strongly Agree Strongly Agree Strongly Agree
## Gender Par1.Educ Par2.Educ Age
## 19 NA Bachelor's Bachelor's 18
## 20 NA NA NA 18
## 21 Male &/or Masculine &/or Man High School High School 30
## 22 Female &/or Feminine &/or Woman High School High School 19
## 23 I don't understand the question Bachelor's Bachelor's 18
## 24 Male &/or Masculine &/or Man Bachelor's Bachelor's 18
## Rac.Eth
## 19 Black or African American
## 20 Black or African American
## 21 Hispanic; Latino; or Spanish origin
## 22 Asian
## 23 Asian
## 24 White
As you can see, these commands shows you the first and last six rows of the specified dataframe. This is often a quick and easy way to see how your data looks without troubling to open it in a new window.
Another useful command is the colnames() command. This shows you a list of all the columns in a dataframe, and can be very helpful if you have a lot of variables and need to see how one is spelled. But more than that, it can be useful if your current column names aren’t serving your purposes. For instance, maybe we want our column names to be lowercase:
colnames(data) # see what the current column names are
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
old_columns <- colnames(data) # save the old column names in a new object
new_columns <- c("stu.id","intent","q1","q2","q3","q4","q5","gender","par1.educ","par2.educ","age","race.eth") # create the new column names and place them in a vector
colnames(data) <- new_columns # apply the new column names to the dataframe
colnames(data) # see the new column names
## [1] "stu.id" "intent" "q1" "q2" "q3" "q4"
## [7] "q5" "gender" "par1.educ" "par2.educ" "age" "race.eth"
colnames(data) <- old_columns # apply the old column names to the dataframe
colnames(data) # see the new old column names
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
Sometimes you need to view or manipulate only part of an object, such as a single variable in a dataset. To do this you use the $ operator. If we wanted to preview the first six lines of the Q1 column in our data, we could use the code below.
head(data$Q1)
## [1] "Disagree" "Prefer not to respond" "Somewhat disagree"
## [4] "Agree" "Strongly Agree" "Disagree"
Last but not least, you can use your new understanding of data structures in R to subset your data. Take our imported data, for instance. Suppose we want to use some of the variables (Stu.ID, Intent, and Age) but not the others, and find that they’re cluttering things up. Subsetting the current dataframe will let us create a new dataframe with the specified information without altering the previous dataframe.
data2 <- subset(data, select=c(Stu.ID,Intent,Age))
head(data2)
## Stu.ID Intent Age
## 1 1001 8 18
## 2 1002 6 19
## 3 1003 5 25
## 4 1004 9 16
## 5 1005 7 19
## 6 1006 8 18
Note that you can refer to columns by their names or their position in the dataframe (although this can backfire if you change your dataframe and re-run old code without updating it, so it’s not recommended). The code below does the same thing as the code above, but refers to column numbers instead of names.
data2 <- subset(data, select=c(1,2,12))
head(data2)
## Stu.ID Intent Rac.Eth
## 1 1001 8 White
## 2 1002 6 Middle Eastern or North African, White
## 3 1003 5 I prefer to identify as [Columbian]
## 4 1004 9 White
## 5 1005 7 White
## 6 1006 8 Black or African American
Perhaps we want to filter our dataframe so that only 19 year olds are included and we only see the columns relating to IDs, intent, and age.
data3 <- subset(data, select=c(Stu.ID,Intent,Age), Age == "19")
head(data3)
## Stu.ID Intent Age
## 2 1002 6 19
## 5 1005 7 19
## 10 1010 9 19
## 12 1012 NA 19
## 15 1015 10-Definitely will 19
## 18 1018 7 19
Or maybe we want to filter only to those over 19 and then get rid of the age column altogether, leaving the others untouched?
data4 <- subset(data, select=-c(Age), Age > "19")
head(data4)
## Stu.ID Intent Q1 Q2 Q3
## 3 1003 5 Somewhat disagree Somewhat disagree Somewhat disagree
## 8 1008 7 Somewhat agree Somewhat agree Agree
## 9 1009 8 Agree Agree Agree
## 13 1013 5 Disagree Disagree Disagree
## 21 1021 5 Agree Agree Strongly Agree
## Q4 Q5
## 3 Agree Agree
## 8 Agree Agree
## 9 Agree Agree
## 13 Somewhat disagree Somewhat disagree
## 21 Strongly Agree Strongly Agree
## Gender Par1.Educ Par2.Educ
## 3 Female &/or Feminine &/or Woman High School High School
## 8 Male &/or Masculine &/or Man, Transgender Bachelor's Bachelor's
## 9 Genderfluid PhD Bachelor's
## 13 Female &/or Feminine &/or Woman I don't know Bachelor's
## 21 Male &/or Masculine &/or Man High School High School
## Rac.Eth
## 3 I prefer to identify as [Columbian]
## 8 Hispanic; Latino; or Spanish origin
## 9 Hispanic; Latino; or Spanish origin
## 13 White
## 21 Hispanic; Latino; or Spanish origin
ON YOUR OWN. If you look at the data, you’ll notice we have a participant under the age of 18, which means we can’t use their data without their guardian’s permission, which we didn’t collect. Try subsetting your data to drop those below the age of 18 and see what happens!
One way to understand how R is reading your data is the str (structure) command.
str(data)
## 'data.frame': 24 obs. of 12 variables:
## $ Stu.ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Intent : chr "8" "6" "5" "9" ...
## $ Q1 : chr "Disagree" "Prefer not to respond" "Somewhat disagree" "Agree" ...
## $ Q2 : chr "Disagree" "Disagree" "Somewhat disagree" "Agree" ...
## $ Q3 : chr "Disagree" "Agree" "Somewhat disagree" "Strongly Agree" ...
## $ Q4 : chr "Agree" "Agree" "Agree" "Strongly Agree" ...
## $ Q5 : chr "Agree" "Agree" "Agree" "Strongly Agree" ...
## $ Gender : chr "Agender" "Female &/or Feminine &/or Woman " "Female &/or Feminine &/or Woman " "NA" ...
## $ Par1.Educ: chr "High School" "High School" "High School" "Bachelor's" ...
## $ Par2.Educ: chr "Some College" "High School" "High School" "PhD" ...
## $ Age : chr "18" "19" "25" "16" ...
## $ Rac.Eth : chr "White" "Middle Eastern or North African, White" "I prefer to identify as [Columbian]" "White" ...
It’s similar to the head() command, but provides two additional bits of information. Between the name of each column (e.g., Stu.ID) and the excerpt of the data in that column (e.g., 1001 1002 1003, etc) is a short code that tells you what type of data is in the column, or what type of variable it is. The Stu.ID variable is an integer type variable, while the remaining variables are character types. The most commonly used types are:
Another way to understand how R is reading your data is the summary command. This function allows you to look at a summary of each column of your data. For numeric data you get to see the spread of the data and for categorical data you get how many individuals are in each category. Character data just returns the class of the data. Generally when you see character data you will want to convert it to categorical data (unless you have other plans for that data).
summary(data)
## Stu.ID Intent Q1 Q2
## Min. :1001 Length:24 Length:24 Length:24
## 1st Qu.:1007 Class :character Class :character Class :character
## Median :1012 Mode :character Mode :character Mode :character
## Mean :1012
## 3rd Qu.:1018
## Max. :1024
## Q3 Q4 Q5 Gender
## Length:24 Length:24 Length:24 Length:24
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Par1.Educ Par2.Educ Age Rac.Eth
## Length:24 Length:24 Length:24 Length:24
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
You might find that your data has been imported as one type (e.g., character) when you want it to be something else (e.g., integer). There’s a very easy way to tell R to convert it from one type to the other!
data$Gender <- as.factor(data$Gender) # recode gender variable from character to factor
This code overwrites the data type for your existing dataframe. So, it converts Par1.Educ and Gender from character data to factors. If you rerun the summary() or str() commands you can see the changes. Instead of class:character, you should see a summary of the categories for each question you converted as well as how many participants chose each category.
Some columns may have so many categories that R doesn’t show them all when your run a global summary. You can tell levels are missing if you see (Other) among your factors (for an example of this look at Gender_factor). To see all the levels you can run a summary() command on a particular column and you will get all the levels of the factor.
summary(data$Gender)
## Agender
## 1
## Agender, Female &/or Feminine &/or Woman
## 1
## female
## 1
## Female &/or Feminine &/or Woman
## 9
## Genderfluid
## 1
## I don't understand the question
## 1
## Male &/or Masculine &/or Man
## 6
## Male &/or Masculine &/or Man, Transgender
## 1
## NA
## 3
To change multiple columns at the same time you can use the following code that uses the lapply function. The lapply() function is powerful but is also a bit beyond this workshop, so we won’t go into depth, but the code is presented here for your use. Instead of naming each column you want it to look at, you can just tell it the column numbers. To get the column numbers you can use the names commands which prints the names of each column and the column numbers.
names(data)
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
data[,3:7] <- lapply(data[,3:7],as.factor)
With this command we are telling R to make columns 3 through 7 (questions Q1 through Q5) factors. You can run the summary() command again to see your data.
ON YOUR OWN: Gender isn’t the only variable that would be better as a factor. What about Par1.Educ? Try converting it on your own and see how it goes!
Now that you have your data visualized, do you see any mistakes in the data that need to be fixed? Or do see where further processing of the data might be necessary to reach the research goals?
There are numerous small things that go into cleaning your data. A few of the most important ones are included in the section below.
You might have noticed in your reflection of the data that in the Gender column that there was one entry for ‘female’. Check it out using the summary() command.
summary(data$Gender)
## Agender
## 1
## Agender, Female &/or Feminine &/or Woman
## 1
## female
## 1
## Female &/or Feminine &/or Woman
## 9
## Genderfluid
## 1
## I don't understand the question
## 1
## Male &/or Masculine &/or Man
## 6
## Male &/or Masculine &/or Man, Transgender
## 1
## NA
## 3
This entry should be combined with ‘Female &/ or Feminine &/or Woman’. Notice also that there is a space after the ‘n’ in ’woman. These little differences can be the most challenging to diagnose. We will first remove all leading and trailing spaces (undesired spaces at the beginning or ending of a string) and then recode the variable.
The best way to remove leading and trailing spaces is one column at a time. In the code below, we use the trimws() command to remove the spaces and create a new column from the result. However, upon viewing the result we can see that the new column as the data type ‘character’, and so we need to change it to factor again. And ta-da! No more unwanted spaces.
data$Gender_recode <- trimws(data$Gender, which = c("both"))
data$Gender_recode <- as.factor(data$Gender_recode)
summary(data$Gender_recode)
## Agender
## 1
## Agender, Female &/or Feminine &/or Woman
## 1
## female
## 1
## Female &/or Feminine &/or Woman
## 9
## Genderfluid
## 1
## I don't understand the question
## 1
## Male &/or Masculine &/or Man
## 6
## Male &/or Masculine &/or Man, Transgender
## 1
## NA
## 3
The dplyr package is a popular and powerful package that you’ll probably use often. The recode() command from that package is straightforward and easy to use. Here, we use it to edit the data so that ‘female’ is added to the larger ‘Female &/or Feminine &/or Woman’ category and the result is saved as a new variable.
install.packages("dplyr")
library(dplyr)
data$Gender_recode2 <- recode(data$Gender_recode, female = "Female &/or Feminine &/or Woman")
summary(data$Gender_recode2)
## Agender
## 1
## Agender, Female &/or Feminine &/or Woman
## 1
## Female &/or Feminine &/or Woman
## 10
## Genderfluid
## 1
## I don't understand the question
## 1
## Male &/or Masculine &/or Man
## 6
## Male &/or Masculine &/or Man, Transgender
## 1
## NA
## 3
ON YOUR OWN: There’s a lot you can do with the recode command! Try checking out the Rac.Eth data. One of the written in responses is ‘I prefer to identify as [black]’. This probably could be combined with the "Black or African American’ category – try writing your own code to make that happen.
Often with Likert scale data you want to convert phrases to numbers (Strongly Disagree to 1 for example) so you can treat the variable as numeric or ordinal. To do this we will use the same command as when we were cleaning up the factors: recode(). This time we enter multiple levels that we want to recode. Any levels that aren’t recoded (such as ‘Prefer not to respond’) will automatically be recoded as NA.
Last, we use the table() command to make sure that the recoded variable and original variable line up. For instance, we can see that there were 10 participants who selected ‘Agree’ in the original variable, and all 10 of them have a score of ‘5’ in the new variable, suggesting that the recoding work and everyone ended up being assigned the correct value
library(dplyr)
summary(data$Q1)
data$Q1_recode <- recode(data$Q1, "Strongly disagree" = 1,
"Disagree" = 2,
"Somewhat disagree" = 3,
"Somewhat agree" = 4,
"Agree" = 5,
"Strongly Agree" = 6)
summary(data$Q1_recode)
table(data$Q1, data$Q1_recode)
## Agree Disagree Prefer not to respond
## 10 4 2
## Somewhat agree Somewhat disagree Strongly Agree
## 1 5 1
## NA's
## 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 3.000 5.000 3.952 5.000 6.000 3
##
## 2 3 4 5 6
## Agree 0 0 0 10 0
## Disagree 4 0 0 0 0
## Prefer not to respond 0 0 0 0 0
## Somewhat agree 0 0 1 0 0
## Somewhat disagree 0 5 0 0 0
## Strongly Agree 0 0 0 0 1
ON YOUR OWN: Look at the Rac.Eth data. One of the written in responses is ‘I prefer to identify as [black]’. This probably could be combined with the "Black or African American’ category. Try recoding it on your own using the code examples above and see how it goes!
The most straightforward, but also the most time-consuming, way to recode multiple variables is to use the above code and switch out the variable names. For instance, to recode column ‘Q2’ you could duplicate the above code, replace all instances of ‘Q1’ with ‘Q2’, and run it again. See below:
library(dplyr)
summary(data$Q2)
data$Q2_recode <- recode(data$Q2, "Strongly disagree" = 1,
"Disagree" = 2,
"Somewhat disagree" = 3,
"Somewhat agree" = 4,
"Agree" = 5,
"Strongly Agree" = 6)
summary(data$Q2_recode)
table(data$Q2, data$Q2_recode)
The other option is to use more complicated code. The below code will work, but we won’t be explaining it in-depth in the current workshop. Remember to check your recoding using the table() command and you should be safe, but if in doubt, use the lengthier but easier to understand script.
library(dplyr)
data2 <- data %>%
transmute_at(c("Q2","Q3","Q4","Q5"), funs(recode(., "Strongly disagree" = 1,
"Disagree" = 2,
"Somewhat disagree" = 3,
"Somewhat agree" = 4,
"Agree" = 5,
"Strongly Agree" = 6)))
colnames(data2) <- paste0(colnames(data2),"_recode")
data <- cbind.data.frame(data, data2)
table(data$Q2, data$Q2_recode)
##
## 2 3 4 5 6
## Agree 0 0 0 10 0
## Disagree 6 0 0 0 0
## Prefer not to respond 0 0 0 0 0
## Somewhat agree 0 0 1 0 0
## Somewhat disagree 0 4 0 0 0
## Strongly Agree 0 0 0 0 1
One of the nice things about converting a string-type variable into a numeric variable, as we did in the last step, is that you can create some nice plots! The code below will create a nice, simple histogram to help you visualize the distribution of your data.
hist(data$Q1_recode, breaks = 5)
There are some other variables that would benefit from similar treatment. On your own, see if you can make the ‘Intent’ variable numeric and visualize it in a histogram.
The self efficacy questions come from two scales: achievement (Q1, Q2, Q5) and mastery (Q3 & Q4). A two question factor is really not great, but we will use it just for an example. The researchers don’t care about the individual questions but want to get a factor score for each scale by averaging student responses on each scale. To do this they need to make a new column for the factor scale and fill that column in with the average score for the individual items. This is actually really easy!
data$Achieve <- (data$Q1_recode + data$Q2_recode + data$Q5_recode)/3
summary(data)
## Stu.ID Intent Q1
## Min. :1001 Length:24 Agree :10
## 1st Qu.:1007 Class :character Disagree : 4
## Median :1012 Mode :character Prefer not to respond: 2
## Mean :1012 Somewhat agree : 1
## 3rd Qu.:1018 Somewhat disagree : 5
## Max. :1024 Strongly Agree : 1
## NA's : 1
## Q2 Q3 Q4
## Agree :10 Agree :6 Agree :8
## Disagree : 6 Disagree :4 Disagree :1
## Prefer not to respond: 1 Prefer not to respond:2 Prefer not to respond:2
## Somewhat agree : 1 Somewhat disagree :4 Somewhat agree :1
## Somewhat disagree : 4 Strongly Agree :7 Somewhat disagree :3
## Strongly Agree : 1 NA's :1 Strongly Agree :8
## NA's : 1 NA's :1
## Q5 Gender
## Agree :8 Female &/or Feminine &/or Woman :9
## Disagree :2 Male &/or Masculine &/or Man :6
## Prefer not to respond:2 NA :3
## Somewhat agree :1 Agender :1
## Somewhat disagree :2 Agender, Female &/or Feminine &/or Woman:1
## Strongly Agree :8 female :1
## NA's :1 (Other) :3
## Par1.Educ Par2.Educ Age Rac.Eth
## Length:24 Length:24 Length:24 Length:24
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Gender_recode
## Female &/or Feminine &/or Woman :9
## Male &/or Masculine &/or Man :6
## NA :3
## Agender :1
## Agender, Female &/or Feminine &/or Woman:1
## female :1
## (Other) :3
## Gender_recode2 Q1_recode
## Female &/or Feminine &/or Woman :10 Min. :2.000
## Male &/or Masculine &/or Man : 6 1st Qu.:3.000
## NA : 3 Median :5.000
## Agender : 1 Mean :3.952
## Agender, Female &/or Feminine &/or Woman: 1 3rd Qu.:5.000
## Genderfluid : 1 Max. :6.000
## (Other) : 2 NA's :3
## Q2_recode Q3_recode Q4_recode Q5_recode
## Min. :2.000 Min. :2.000 Min. :2.000 Min. :2.000
## 1st Qu.:2.250 1st Qu.:3.000 1st Qu.:5.000 1st Qu.:5.000
## Median :4.500 Median :5.000 Median :5.000 Median :5.000
## Mean :3.818 Mean :4.381 Mean :4.905 Mean :4.857
## 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:6.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## NA's :2 NA's :3 NA's :3 NA's :3
## Achieve
## Min. :2.333
## 1st Qu.:2.917
## Median :5.000
## Mean :4.267
## 3rd Qu.:5.333
## Max. :6.000
## NA's :4
Ta-da!
ON YOUR OWN: Give it a try on your own – try to make a new column called Master with the average of the two mastery items (Q3 and Q4). Make sure you use the recoded items!
You made it! Whew! Regardless of how much of this code you were able to run you have at least familarized yourself with some of the language and the common commands for data cleaning. And you have seen some ways to think through the problems that arise in data cleaning.
Now we’d like you to go back and take a look at your data. What things do you need to do to get your data ready for analysis. What do you already have example code for? What do you still need code for?