Introduction (Read Me First)

Hello! This document was created as part of the CC Bio INSITES Quantitative Workshop #2, specifically as a guide for the pre-activity to be completed before our meeting on ?????. This guide has been written assuming that you have little to no familiarity with R, and so begins with a brief overview of the software installation. You shouldn’t feel pressured to memorize or even completely understand everything written here – this is designed as an introduction and a future resource, and as such can be a bit overwhelming at first.

[Add additional bit about “your goal is to get stuck” and link to worksheet.]

Introduction (Read Me First)
Step 1: Installing R and RStudio
Step 2: Getting Started
Step 3: Importing Data
- Step 3.1: Importing from CSV (Recommended)
Step 4: Strategize Data Cleaning
Step 5: Cleaning the Data

Step 1: Installing R and RStudio

Installing R can be unexpectedly complicated at first, as there are two programs from separate websites that need to be installed. To provide a bit of background for the curious: the original ‘R’ is a programming language (like Java, python, C, etc) that began development in the 90s. With the development of this programming language came the first computer program you’ll install, simply titled R, which is a bare-bones interface for writing and running code.

As R became more popular, the RStudio IDE (the second program you’ll install) was developed to build upon the original R and to make writing and running code a bit easier. RStudio doesn’t replace R – instead, it loads it in the background and interfaces between you and the R engine in order to provide some additional functionality. Therefore, in order to use RStudio, you have to download R first (note that you don’t have to run R before running RStudio, so you should only need one additional shortcut on your desktop).

With that information in mind, hopefully getting R and RStudio installed won’t be too confusing anymore! This page provides a good overview for PC and Mac users, but in case it isn’t available (or you would just prefer to stay on this page) we’ll also provide directions for Windows and Macs here. If you use Linux, the link also provides step-by-step directions for Ubuntu towards the bottom of the page.

Step 1.1: Installing on Windows PC

Installing R

Head over to CRAN to download the current version of R. The link you need is right on the front page, labeled ‘Download R for Windows’; from there, look for the bolded ‘install R for the first time’ link and click to download the software. You can save the downloaded EXE where ever you like; when run, it will install R into the default directory that you shouldn’t need to alter.

Installing RStudio

The download page for RStudio makes it very easy to install RStudio for Windows. Hit the big blue button and to get the installer, and then run to get RStudio ready to run. After that, you’re done! The link provided above will walk you through some additional stuff, but we’ll also cover that on our own so you don’t have to worry about it now.

Step 1.2: Installing on Mac

Installing R

Head over to CRAN to download the current version of R. The link you need is right on the front page, labeled ‘Download R for (Mac) OS X’; from there, look for the section titled ‘Latest release’ and download the indicated PKG. When run, it will install R into the default directory that you shouldn’t need to alter.

Installing RStudio

On the download page for RStudio check out the section titled ‘All Installers’, and then locate the link for macOS. Save the indicated file and run; once again, you should accept the defaults (unless you’re familiar enough with your system to know why you shouldn’t). After that, you’re done! The link provided above will walk you through some additional stuff, but we’ll also cover that on our own so you don’t have to worry about it now.

Step 1.3: Run RStudio & Get Familiar With the Windows

Now that you’ve got everything installed, here’s a quick overview of what’s what in the program. When you open RStudio for the first time, you should see this:

Default setup of RStudio

The only thing that won’t match up is the contents of your ‘Files’ window, but that’s okay. Now for a quick breakdown of what the various windows are:

Default setup of RStudio, with some extra labels thrown in

1. Console

This window is where most of the output will be displayed, including errors and warnings. Everything in this window is text – if you create a plot, for instance, it will appear in the ‘Plots’ tab (the same windows as the ‘Files’ tab). The window does have a limit; if you run a bunch of code, or ask R to display something lengthy, some of your older code may get cut off as the window fills with new output. You can write code here, but it won’t be saved, so make sure you’re writing all of your code in the script window (#4, discussed more below).

2. Environment

This window lets you keep track of the objects you’re working with. If you import a dataset, that will be one object (usually a ‘dataframe’ that holds all of your rows and columns). If you run a regression and save the output, that will be another object (usually an ‘lm’, a collection of all the output produced by the test, such as coefficients, residuals, fitted values, and so on). Sometimes you can click objects in this window (e.g., a dataframe) and the object will be output into the console or it will open in a new window. There are some other tabs here as well, but they aren’t used often.

3. Files (and Plots, Packages, Help, and Viewer)

Lots of stuff can be displayed in the tabs of this window, but it’s not as interactive as the others.

The ‘Files’ tab shows you the folder structure of your current working directory (more on working directories in Step 2.1).
‘Plots’ shows you your most recently created plot, and has arrows to let you scroll through past plots or icons to help you save the plots to your computer.
‘Packages’ lists all of the packages available to your current installation of R (more on packages, what they are, and why they’re important later).
‘Help’ can be extremely useful, but it’s not as helpful as it sounds. If you’re new to R and working your way up the introductory learning curve, this window might leave you more confused than you started. Specifically, it’s good if you have a question about a command you’re using (for instance, you want to know what the summary() command does, or how to specify variables for a lm() command).
‘Viewer’ is used to display interactive output, which isn’t as common. For an example: some packages will let you create interactive tables (e.g., sort your variables by mean), and this output will display in this window.

4. Code or Script Window

This is the window where you’ll spend most of your time, but it doesn’t show up automatically. This window only appears when you create a code file (File > New File) for writing and saving code. The files you create will be Untitled and unsaved, like other coding software or word processing programs you might be familiar with, so don’t forget to save! RStudio has a handy default setting that lets it save your workspace, specifically all of the code windows you have open and all of the objects in your environment, but sometimes there are issues so don’t rely on it too much. Note: when you save files, the save location will default to your current working directory.

Step 2: Getting Started

The best way to get familiar with R is to use it, so let’s get started. First step: creating a Project!

Step 2.1: Creating a Project

What is a Project (uppercase P)?

Projects are a useful way of organizing your work environment, working directory, and file structure. They’re especially useful if you’re working on multiple projects (lowercase p). For instance, you might have a dataset consisting of information about your students’ grades, science interest, science self-efficacy, and growth mindsets. One collaborator might be interested in analyzing the relationship between grades and mindsets, while another wants to examine the relationship between interest and self-efficacy. You can, of course, run both analyses from the same Project, but splitting them helps keep things organized. For instance, one project might limit the dataset to first-year students, while the other looks at the entire sample, or even requires you to combine your data with some collected by your collaborator. Keeping track of any such decisions you’ve made is easier when you keep the projects in separate Projects.

They also make switching between projects easier. If you’re going from a meeting with Collaborator 1 to Collaborator 2, all you have to do is pull up the second project and everything – your working directory, the scripts you had open, the objects in your environment, etc. – will all come up exactly as you left them last. Your working directory is the current location that R will use to output files, and so it’s easy to forget about until something goes wrong (e.g., you save something to the wrong location and can’t find it, or overwrite a file you didn’t intend to). So again, Projects help you avoid this hassle by setting your working directory and allowing you to focus on other things.

So how do you create a project?

In the top right corner of your screen is a little blue box and a label that says Project: (None) (assuming, of course, that you don’t have a project open: see the screenshot below for more information).

The project button (outlined in black) and the tab that appears when it’s clicked (outlined in red)

Selecting ‘New Project’ will open a new window, and the option to create a project in a new directory.

Create project window (select New Directory)

The next screen provides a list of potential project types. In this case, we want to select ‘New Project’ again.

The last screen lets you name your project and select the project’s working directory. You don’t need to create a new folder just for your new project! Whatever you input as the directory name will become the new folder for your project. If you created the folder structure recommended by the previous workshop, you can navigate to this location and select it as your new project subdirectory, or you can move your previously created folders into your new project location ourside of R. I titled my project ‘Quant Workshop 2’.

Specify your directory and the name of the folder to be created for your new project

When complete, the project button will have the name of your new project, and the Files window will change to reflect your new working directory. If you look closely at the Files window in the screenshot below, you can see that the full directory for my new project is D:ProjectsWorkshop 2.

Final view after creating new project

And you’re done! Your new project is ready for you to start creating script files, adding code, and importing your data.

Step 2.2: Creating a Script File

File > New File will produce a list of all the programming languages that RStudio can handle, but the most commonly used is the R Script. Once you’ve created the new script, File > Save As will let you save it. The default location will be your project subdirectory, although you can save it outside this folder as well (not recommended).

RStudio after creating a new script file and saving it

In the screenshot above, you can see I created the project and script in a folder without any sub-folders. I added folders using the structure recommended in the previous workshop and moved my new script into the RCode location afterwards.

Step 2.3: Annotating Your Code

If you have experience with other programming languages, you’re probably familiar with comments. Comments are bits of ‘code’ that aren’t run as code; instead, they’re displayed beside the code as commentary for anyone looking at the code. For an example from another of my projects, see the screenshot below.

Screenshot of commented code

In RStudio, comments will automatically display in green (unless you change the default settings) and code will display in blue and black. Comments can be used to provide additional information or guidance. For instance, in line 32 the table() command is used to organize responses to Q42 into a frequency table, and there is a comment afterwords with the text of Q42 and the list of response options.

It’s a good idea to provide comments, but no hard-and-fast rules about what comments should be included or not. In the example above, I provided the text for survey items that generated the data, but I didn’t provide an explanation of the table() command, or of the grep() command that I use in lines 49 and onward, although I might end up doing so later when I share this file with others. That said, it’s usually a good idea to include a few comments including your name and the date you worked on the code. I like to include things (like the survey item text above) that I would otherwise have to spend time looking up. I also use comments to help organize my code, which I talk about more in the next section.

Step 2.4: Organizing Your Code (Optional)

Comments can be used to help make code easier to interpret, but they can also be used to present it more tidily. In the example above, I use two ‘tricks’ to organize my code that you might find useful:

Comment Lines. These are the ‘##########’ lines that span the window and chop the code into chunks. Nothing fancy, but they can have a huge impact.
Code Sections. These look like regular comments, but they have some additional functionality. See lines 23 and 26 in the screenshot above? There is a bit of text in each, and then a long line (—–) after it, plus a little arrow beside the numbers. All of this indicates that the following code is a ‘code section’ and can be collapsed or expanded as needed. In the screenshot below is the same code from earlier, but with all of the code sections collapsed.

Screenshot of commented code, all code sections collapsed

As you can see, when the sections are collapsed, the —— lines are replaced by a purple arrow graphic to further indicate there is code hidden. I like code sections because they help prevent overload and confusion, and allow for a quick review of the entire script and what’s included. Your mileage may vary, but I encourage you to give them a try! When you’re working in RStudio, there are some helpful shortcuts that you can use to add, collapse, and expand code sections easily:

Code > Insert Section. PC shortcut is Ctrl-Shift-R. This will produce a little window where you type the text that will appear in the section divider (e.g., lines 23 and 26 in the above examples).
Edit > Folding > Collapse. PC shortcut is Alt-L. This will expand a code section if it isn’t already. You can also use the little arrow beside the line number. To collapse all code sections in the entire document, use Alt-O.
Edit > Folding > Expand. PC shortcut is Alt-Shift-L. This will collapse a code section if it isn’t already. You can also use the little arrow beside the line number. To expand all code sections in the entire document, use Alt-Shift-O.

Step 3: Importing Data

Once you have your script file created and your basic information annotated, the next step is importing your data. In the sections below, we discuss how to import from CSV, Excel, and SPSS files. This is where we start talking about actual code, but it’s nothing too complicated!

If you’re using the folder structure recommended by earlier workshops, your Files tab should look something like this:

Files tab using established folder structure

The raw data we’ll be importing is saved to the ‘Data’ folder, the script file is saved to the ‘RCode’ folder. Now that we’ve established this, we can start writing the code.

Step 3.1: Importing from CSV (Recommended)

The best filetype for data storage is CSV. It’s extremely basic, and so even huge datasets stay pretty small, and there’s no extra bits and bobs to confuse R or any other programs when you open it. It’s best practice not to edit your CSV file beforehand either – just download the file from Qualtrics (or any other survey software) and import it into R. That way, any data cleaning decisions are saved in your code, both for your future records (so you know what you did) and your future sanity (so you can double-check for mistakes or the like).

Step 3.1.1: Import the Raw Data

The first command we’ll use is the read.csv() command. Let’s give it a shot:

data <- read.csv(file="Data/workshop_data_toy.csv")

In this case, we’ve told R to run the read.csv command on the file ‘workshop_data_toy.csv’ in the ‘Data’ folder of our working directory, and then save the result as an object (in this case, a dataframe) in our workspace, titled ‘data’. Maybe you don’t think ‘data’ is a very descriptive name for an object, and want to label yours as ‘raw_data_toy’. In that case, your code would look like this:

raw_data_toy <- read.csv(file="Data/workshop_data_toy.csv")

(You can call your dataframe whatever you want! I’m going to call mine ‘data’, though, because I’m lazy.)

Once you’ve run this code, you can click the ‘data’ object that’s appeared in your Environment tab, and tada! There’s your data. But if you look closely, you’ll quickly notice an issue:

Your newly imported data

The topmost row of your CSV file has become the header of your new dataframe. In this case, this means your header consists of long strings that would be unwieldy to use in your scripting. What to do?

You could open your CSV in Excel, delete the top row, save it, and re-import it in R. This probably seems easier, especially at first when you’re unfamiliar with R. But it isn’t the best way to proceed!

Step 3.1.2: Skip Unneeded Rows

The read.csv() command actually gives you the tool you need to fix this problem through the ‘skip’ argument. This argument tells the command ‘the number of lines of the data file to skip before beginning to read data’. If we give it a try:

data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1)

Ta-da! Problem solved. New dataframe with first row ‘skipped’

Much better! If we look through our data a bit more, however, we notice a second potential issue in row 19 (aka Stu.ID 1019). There are a bunch of blank cells! It looks like the participant left some questions unanswered. They aren’t the only one – Stu.ID 1012 didn’t respond to the ‘Intent’ item, and 1008 didn’t include their age. However, in these cases there’s a greyed out NA instead of a blank cell. What’s up?

Step 3.1.3: Handle Missing Data Consistently

What’s up is that R isn’t recognizing 1019’s missing responses as missing. If a cell is empty, R will always display the greyed-out NA. If anything else is displayed, then R is assuming that something is there. Let’s look at the output from the table() command, a useful command that creates frequency tables of the indicated columns:

table(data$Q1, useNA = "always")

## 
##                                       Agree              Disagree 
##                     1                    10                     4 
## Prefer not to respond        Somewhat agree     Somewhat disagree 
##                     2                     1                     5 
##        Strongly Agree                  <NA> 
##                     1                     0

To translate: this line of code is telling R to use the table() command on the ‘Q1’ column of the ‘data’ dataframe, and to always display NAs, or cells with missing data (we will talk more about calling objects and what’s up with the ’data$Q1 bit later). As you can see from the output displayed above (it will show in your Console window if you run the code on your own in RStudio), the first response option is a blank line, with only one case. It’s followed by the Agree, Disagree, and Prefer not to respond options, which have 10, 4, and 2 cases, respectively.

If we were to run an analysis on the data as it is, it would consider ‘blank’ as valid of a response as ‘Agree’, and that’s not what we want. So we need to tweak our import code so that R correctly identifies these blank cells as NAs, and displays them in the section of the table.

This might seem an overwhelming task, but the solution is simple. Because there are different ways of indicating missing data – some survey software use values like 99, Excel uses #N/A, and so on – the read.csv() command has an argument in which you can specify which values it should treat as NAs. Like this:

data <- read.csv(file="Data/workshop_data_toy.csv", skip = 1, na.strings = c("#N/A",""," "))

This code is identical to the import command we used earlier, but the added argument tells R to mark any cells with the provided values as NAs. When I import CSVs, I always include the ‘na.strings’ argument as written here, even when I don’t expect any issues, just to avoid tripping over them later.

Did this fix the issue with Stu.Id 1019? If not, what additional values could you add to the list to catch any remaining import errors? Note: make sure your additions to the list are enclosed in quotes and separated by commas, as displayed above. Otherwise R will assume they’re all one long value or spit out an error!

Step 4: Strategize Data Cleaning

Once your data is in R you need to understand how R sees it. Just because you know a question has categorical or continuous data does not mean R sees it this way. There are multiple ways to understand how R sees your data. We share two below.

Step 4.1: Working with Objects

Now that you’ve got your data into R, lets get some practice looking at it. There are a few ways to view a dataframe object like our new data file. One is to click it in the Environment tab – it will open in a new window that can do some useful stuff (e.g., sort by columns or filter). Another way is to use the head() command (and it’s opposite, the tail() command).

head(data)

##   Stu.ID Intent                    Q1                Q2                Q3
## 1   1001      8              Disagree          Disagree          Disagree
## 2   1002      6 Prefer not to respond          Disagree             Agree
## 3   1003      5     Somewhat disagree Somewhat disagree Somewhat disagree
## 4   1004      9                 Agree             Agree    Strongly Agree
## 5   1005      7        Strongly Agree    Strongly Agree    Strongly Agree
## 6   1006      8              Disagree          Disagree          Disagree
##               Q4             Q5                           Gender   Par1.Educ
## 1          Agree          Agree                          Agender High School
## 2          Agree          Agree Female &/or Feminine &/or Woman  High School
## 3          Agree          Agree Female &/or Feminine &/or Woman  High School
## 4 Strongly Agree Strongly Agree                               NA  Bachelor's
## 5 Strongly Agree Strongly Agree Female &/or Feminine &/or Woman  High School
## 6 Somewhat agree Somewhat agree    Male &/or Masculine &/or Man   Bachelor's
##        Par2.Educ Age                                Rac.Eth
## 1   Some College  18                                  White
## 2    High School  19 Middle Eastern or North African, White
## 3    High School  25    I prefer to identify as [Columbian]
## 4            PhD  16                                  White
## 5 Not applicable  19                                  White
## 6            PhD  18              Black or African American

tail(data)

##    Stu.ID Intent                    Q1                    Q2
## 19   1019      9                  <NA>                  <NA>
## 20   1020      8 Prefer not to respond Prefer not to respond
## 21   1021      5                 Agree                 Agree
## 22   1022      5              Disagree              Disagree
## 23   1023      7                 Agree                 Agree
## 24   1024      6                 Agree                 Agree
##                       Q3                    Q4                    Q5
## 19                  <NA>                  <NA>                  <NA>
## 20 Prefer not to respond Prefer not to respond Prefer not to respond
## 21        Strongly Agree        Strongly Agree        Strongly Agree
## 22              Disagree     Somewhat disagree     Somewhat disagree
## 23        Strongly Agree        Strongly Agree        Strongly Agree
## 24        Strongly Agree        Strongly Agree        Strongly Agree
##                              Gender   Par1.Educ   Par2.Educ Age
## 19                               NA  Bachelor's  Bachelor's  18
## 20                               NA          NA          NA  18
## 21    Male &/or Masculine &/or Man  High School High School  30
## 22 Female &/or Feminine &/or Woman  High School High School  19
## 23  I don't understand the question  Bachelor's  Bachelor's  18
## 24    Male &/or Masculine &/or Man   Bachelor's  Bachelor's  18
##                                Rac.Eth
## 19           Black or African American
## 20           Black or African American
## 21 Hispanic; Latino; or Spanish origin
## 22                               Asian
## 23                               Asian
## 24                               White

As you can see, these commands shows you the first and last six rows of the specified dataframe. This is often a quick and easy way to see how your data looks without troubling to open it in a new window.

Another useful command is the colnames() command. This shows you a list of all the columns in a dataframe, and can be very helpful if you have a lot of variables and need to see how one is spelled. But more than that, it can be useful if your current column names aren’t serving your purposes. For instance, maybe we want our column names to be lowercase:

colnames(data) # see what the current column names are

##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"

old_columns <- colnames(data) # save the old column names in a new object
new_columns <- c("stu.id","intent","q1","q2","q3","q4","q5","gender","par1.educ","par2.educ","age","race.eth") # create the new column names and place them in a vector

colnames(data) <- new_columns # apply the new column names to the dataframe
colnames(data) # see the new column names

##  [1] "stu.id"    "intent"    "q1"        "q2"        "q3"        "q4"       
##  [7] "q5"        "gender"    "par1.educ" "par2.educ" "age"       "race.eth"

colnames(data) <- old_columns # apply the old column names to the dataframe
colnames(data) # see the new old column names

##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"

Sometimes you need to view or manipulate only part of an object, such as a single variable in a dataset. To do this you use the $ operator. If we wanted to preview the first six lines of the Q1 column in our data, we could use the code below.

head(data$Q1)

## [1] "Disagree"              "Prefer not to respond" "Somewhat disagree"    
## [4] "Agree"                 "Strongly Agree"        "Disagree"

Last but not least, you can use your new understanding of data structures in R to subset your data. Take our imported data, for instance. Suppose we want to use some of the variables (Stu.ID, Intent, and Age) but not the others, and find that they’re cluttering things up. Subsetting the current dataframe will let us create a new dataframe with the specified information without altering the previous dataframe.

data2 <- subset(data, select=c(Stu.ID,Intent,Age))
head(data2)

##   Stu.ID Intent Age
## 1   1001      8  18
## 2   1002      6  19
## 3   1003      5  25
## 4   1004      9  16
## 5   1005      7  19
## 6   1006      8  18

Note that you can refer to columns by their names or their position in the dataframe (although this can backfire if you change your dataframe and re-run old code without updating it, so it’s not recommended). The code below does the same thing as the code above, but refers to column numbers instead of names.

data2 <- subset(data, select=c(1,2,12))
head(data2)

##   Stu.ID Intent                                Rac.Eth
## 1   1001      8                                  White
## 2   1002      6 Middle Eastern or North African, White
## 3   1003      5    I prefer to identify as [Columbian]
## 4   1004      9                                  White
## 5   1005      7                                  White
## 6   1006      8              Black or African American

Perhaps we want to filter our dataframe so that only 19 year olds are included and we only see the columns relating to IDs, intent, and age.

data3 <- subset(data, select=c(Stu.ID,Intent,Age), Age == "19")
head(data3)

##    Stu.ID             Intent Age
## 2    1002                  6  19
## 5    1005                  7  19
## 10   1010                  9  19
## 12   1012                 NA  19
## 15   1015 10-Definitely will  19
## 18   1018                  7  19

Or maybe we want to filter only to those over 19 and then get rid of the age column altogether, leaving the others untouched?

data4 <- subset(data, select=-c(Age), Age > "19")
head(data4)

##    Stu.ID Intent                Q1                Q2                Q3
## 3    1003      5 Somewhat disagree Somewhat disagree Somewhat disagree
## 8    1008      7    Somewhat agree    Somewhat agree             Agree
## 9    1009      8             Agree             Agree             Agree
## 13   1013      5          Disagree          Disagree          Disagree
## 21   1021      5             Agree             Agree    Strongly Agree
##                   Q4                Q5
## 3              Agree             Agree
## 8              Agree             Agree
## 9              Agree             Agree
## 13 Somewhat disagree Somewhat disagree
## 21    Strongly Agree    Strongly Agree
##                                        Gender    Par1.Educ   Par2.Educ
## 3            Female &/or Feminine &/or Woman   High School High School
## 8  Male &/or Masculine &/or Man, Transgender    Bachelor's  Bachelor's
## 9                                 Genderfluid          PhD  Bachelor's
## 13           Female &/or Feminine &/or Woman  I don't know  Bachelor's
## 21              Male &/or Masculine &/or Man   High School High School
##                                Rac.Eth
## 3  I prefer to identify as [Columbian]
## 8  Hispanic; Latino; or Spanish origin
## 9  Hispanic; Latino; or Spanish origin
## 13                               White
## 21 Hispanic; Latino; or Spanish origin

ON YOUR OWN. If you look at the data, you’ll notice we have a participant under the age of 18, which means we can’t use their data without their guardian’s permission, which we didn’t collect. Try subsetting your data to drop those below the age of 18 and see what happens!

Step 4.2: Determine Types of Data R Sees

The str() Command

One way to understand how R is reading your data is the str (structure) command.

str(data)

## 'data.frame':    24 obs. of  12 variables:
##  $ Stu.ID   : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Intent   : chr  "8" "6" "5" "9" ...
##  $ Q1       : chr  "Disagree" "Prefer not to respond" "Somewhat disagree" "Agree" ...
##  $ Q2       : chr  "Disagree" "Disagree" "Somewhat disagree" "Agree" ...
##  $ Q3       : chr  "Disagree" "Agree" "Somewhat disagree" "Strongly Agree" ...
##  $ Q4       : chr  "Agree" "Agree" "Agree" "Strongly Agree" ...
##  $ Q5       : chr  "Agree" "Agree" "Agree" "Strongly Agree" ...
##  $ Gender   : chr  "Agender" "Female &/or Feminine &/or Woman " "Female &/or Feminine &/or Woman " "NA" ...
##  $ Par1.Educ: chr  "High School" "High School" "High School" "Bachelor's" ...
##  $ Par2.Educ: chr  "Some College" "High School" "High School" "PhD" ...
##  $ Age      : chr  "18" "19" "25" "16" ...
##  $ Rac.Eth  : chr  "White" "Middle Eastern or North African, White" "I prefer to identify as [Columbian]" "White" ...

It’s similar to the head() command, but provides two additional bits of information. Between the name of each column (e.g., Stu.ID) and the excerpt of the data in that column (e.g., 1001 1002 1003, etc) is a short code that tells you what type of data is in the column, or what type of variable it is. The Stu.ID variable is an integer type variable, while the remaining variables are character types. The most commonly used types are:

Character: includes single characters (“a”) and strings (“Agree”)
Integer: only allows integers
Numeric: allows all numbers (including integers)
Logical: a binary variable that can either be TRUE (T) or FALSE (F).
Factors: data organized into categories.

The summary() Command

Another way to understand how R is reading your data is the summary command. This function allows you to look at a summary of each column of your data. For numeric data you get to see the spread of the data and for categorical data you get how many individuals are in each category. Character data just returns the class of the data. Generally when you see character data you will want to convert it to categorical data (unless you have other plans for that data).

summary(data)

##      Stu.ID        Intent               Q1                 Q2           
##  Min.   :1001   Length:24          Length:24          Length:24         
##  1st Qu.:1007   Class :character   Class :character   Class :character  
##  Median :1012   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1012                                                           
##  3rd Qu.:1018                                                           
##  Max.   :1024                                                           
##       Q3                 Q4                 Q5               Gender         
##  Length:24          Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Par1.Educ          Par2.Educ             Age              Rac.Eth         
##  Length:24          Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

Step 4.3: Change Data Types

You might find that your data has been imported as one type (e.g., character) when you want it to be something else (e.g., integer). There’s a very easy way to tell R to convert it from one type to the other!

data$Gender <- as.factor(data$Gender) # recode gender variable from character to factor

This code overwrites the data type for your existing dataframe. So, it converts Par1.Educ and Gender from character data to factors. If you rerun the summary() or str() commands you can see the changes. Instead of class:character, you should see a summary of the categories for each question you converted as well as how many participants chose each category.

Some columns may have so many categories that R doesn’t show them all when your run a global summary. You can tell levels are missing if you see (Other) among your factors (for an example of this look at Gender_factor). To see all the levels you can run a summary() command on a particular column and you will get all the levels of the factor.

summary(data$Gender)

##                                    Agender 
##                                          1 
##   Agender, Female &/or Feminine &/or Woman 
##                                          1 
##                                     female 
##                                          1 
##           Female &/or Feminine &/or Woman  
##                                          9 
##                                Genderfluid 
##                                          1 
##            I don't understand the question 
##                                          1 
##              Male &/or Masculine &/or Man  
##                                          6 
## Male &/or Masculine &/or Man, Transgender  
##                                          1 
##                                         NA 
##                                          3

To change multiple columns at the same time you can use the following code that uses the lapply function. The lapply() function is powerful but is also a bit beyond this workshop, so we won’t go into depth, but the code is presented here for your use. Instead of naming each column you want it to look at, you can just tell it the column numbers. To get the column numbers you can use the names commands which prints the names of each column and the column numbers.

names(data)

##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"

data[,3:7] <- lapply(data[,3:7],as.factor)

With this command we are telling R to make columns 3 through 7 (questions Q1 through Q5) factors. You can run the summary() command again to see your data.

ON YOUR OWN: Gender isn’t the only variable that would be better as a factor. What about Par1.Educ? Try converting it on your own and see how it goes!

Step 4.4: REFLECTION QUESTION:

Now that you have your data visualized, do you see any mistakes in the data that need to be fixed? Or do see where further processing of the data might be necessary to reach the research goals?

Step 5: Cleaning the Data

There are numerous small things that go into cleaning your data. A few of the most important ones are included in the section below.

Step 5.1: Standardizing the Levels of a Factor

You might have noticed in your reflection of the data that in the Gender column that there was one entry for ‘female’. Check it out using the summary() command.

summary(data$Gender)

##                                    Agender 
##                                          1 
##   Agender, Female &/or Feminine &/or Woman 
##                                          1 
##                                     female 
##                                          1 
##           Female &/or Feminine &/or Woman  
##                                          9 
##                                Genderfluid 
##                                          1 
##            I don't understand the question 
##                                          1 
##              Male &/or Masculine &/or Man  
##                                          6 
## Male &/or Masculine &/or Man, Transgender  
##                                          1 
##                                         NA 
##                                          3

This entry should be combined with ‘Female &/ or Feminine &/or Woman’. Notice also that there is a space after the ‘n’ in ’woman. These little differences can be the most challenging to diagnose. We will first remove all leading and trailing spaces (undesired spaces at the beginning or ending of a string) and then recode the variable.

Step 5.1.1: Remove Leading and Trailing Spaces

The best way to remove leading and trailing spaces is one column at a time. In the code below, we use the trimws() command to remove the spaces and create a new column from the result. However, upon viewing the result we can see that the new column as the data type ‘character’, and so we need to change it to factor again. And ta-da! No more unwanted spaces.

data$Gender_recode <- trimws(data$Gender, which = c("both"))
data$Gender_recode <- as.factor(data$Gender_recode)
summary(data$Gender_recode)

##                                   Agender 
##                                         1 
##  Agender, Female &/or Feminine &/or Woman 
##                                         1 
##                                    female 
##                                         1 
##           Female &/or Feminine &/or Woman 
##                                         9 
##                               Genderfluid 
##                                         1 
##           I don't understand the question 
##                                         1 
##              Male &/or Masculine &/or Man 
##                                         6 
## Male &/or Masculine &/or Man, Transgender 
##                                         1 
##                                        NA 
##                                         3

Step 5.1.2: Recode A Variable Using the dplyr Package

The dplyr package is a popular and powerful package that you’ll probably use often. The recode() command from that package is straightforward and easy to use. Here, we use it to edit the data so that ‘female’ is added to the larger ‘Female &/or Feminine &/or Woman’ category and the result is saved as a new variable.

install.packages("dplyr")
library(dplyr)
data$Gender_recode2 <- recode(data$Gender_recode, female = "Female &/or Feminine &/or Woman")
summary(data$Gender_recode2)

##                                   Agender 
##                                         1 
##  Agender, Female &/or Feminine &/or Woman 
##                                         1 
##           Female &/or Feminine &/or Woman 
##                                        10 
##                               Genderfluid 
##                                         1 
##           I don't understand the question 
##                                         1 
##              Male &/or Masculine &/or Man 
##                                         6 
## Male &/or Masculine &/or Man, Transgender 
##                                         1 
##                                        NA 
##                                         3

ON YOUR OWN: There’s a lot you can do with the recode command! Try checking out the Rac.Eth data. One of the written in responses is ‘I prefer to identify as [black]’. This probably could be combined with the "Black or African American’ category – try writing your own code to make that happen.

Step 5.2: Changing String Variables to Numbers

Often with Likert scale data you want to convert phrases to numbers (Strongly Disagree to 1 for example) so you can treat the variable as numeric or ordinal. To do this we will use the same command as when we were cleaning up the factors: recode(). This time we enter multiple levels that we want to recode. Any levels that aren’t recoded (such as ‘Prefer not to respond’) will automatically be recoded as NA.

Last, we use the table() command to make sure that the recoded variable and original variable line up. For instance, we can see that there were 10 participants who selected ‘Agree’ in the original variable, and all 10 of them have a score of ‘5’ in the new variable, suggesting that the recoding work and everyone ended up being assigned the correct value

library(dplyr)
summary(data$Q1)
data$Q1_recode <- recode(data$Q1, "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6)
summary(data$Q1_recode)
table(data$Q1, data$Q1_recode)

##                 Agree              Disagree Prefer not to respond 
##                    10                     4                     2 
##        Somewhat agree     Somewhat disagree        Strongly Agree 
##                     1                     5                     1 
##                  NA's 
##                     1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   3.000   5.000   3.952   5.000   6.000       3

##                        
##                          2  3  4  5  6
##   Agree                  0  0  0 10  0
##   Disagree               4  0  0  0  0
##   Prefer not to respond  0  0  0  0  0
##   Somewhat agree         0  0  1  0  0
##   Somewhat disagree      0  5  0  0  0
##   Strongly Agree         0  0  0  0  1

ON YOUR OWN: Look at the Rac.Eth data. One of the written in responses is ‘I prefer to identify as [black]’. This probably could be combined with the "Black or African American’ category. Try recoding it on your own using the code examples above and see how it goes!

OPTIONAL: Recoding Multiple Variables Using dplyr package

The most straightforward, but also the most time-consuming, way to recode multiple variables is to use the above code and switch out the variable names. For instance, to recode column ‘Q2’ you could duplicate the above code, replace all instances of ‘Q1’ with ‘Q2’, and run it again. See below:

library(dplyr)
summary(data$Q2)
data$Q2_recode <- recode(data$Q2, "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6)
summary(data$Q2_recode)
table(data$Q2, data$Q2_recode)

The other option is to use more complicated code. The below code will work, but we won’t be explaining it in-depth in the current workshop. Remember to check your recoding using the table() command and you should be safe, but if in doubt, use the lengthier but easier to understand script.

library(dplyr)
data2 <- data %>%
  transmute_at(c("Q2","Q3","Q4","Q5"), funs(recode(., "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6)))
colnames(data2) <- paste0(colnames(data2),"_recode")
data <- cbind.data.frame(data, data2)
table(data$Q2, data$Q2_recode)

##                        
##                          2  3  4  5  6
##   Agree                  0  0  0 10  0
##   Disagree               6  0  0  0  0
##   Prefer not to respond  0  0  0  0  0
##   Somewhat agree         0  0  1  0  0
##   Somewhat disagree      0  4  0  0  0
##   Strongly Agree         0  0  0  0  1

Step 5.3: Visualizing Numeric Variables

One of the nice things about converting a string-type variable into a numeric variable, as we did in the last step, is that you can create some nice plots! The code below will create a nice, simple histogram to help you visualize the distribution of your data.

hist(data$Q1_recode, breaks = 5)

There are some other variables that would benefit from similar treatment. On your own, see if you can make the ‘Intent’ variable numeric and visualize it in a histogram.

Step 5.4: Creating Factor Scores

The self efficacy questions come from two scales: achievement (Q1, Q2, Q5) and mastery (Q3 & Q4). A two question factor is really not great, but we will use it just for an example. The researchers don’t care about the individual questions but want to get a factor score for each scale by averaging student responses on each scale. To do this they need to make a new column for the factor scale and fill that column in with the average score for the individual items. This is actually really easy!

data$Achieve <- (data$Q1_recode + data$Q2_recode + data$Q5_recode)/3
summary(data)

##      Stu.ID        Intent                              Q1    
##  Min.   :1001   Length:24          Agree                :10  
##  1st Qu.:1007   Class :character   Disagree             : 4  
##  Median :1012   Mode  :character   Prefer not to respond: 2  
##  Mean   :1012                      Somewhat agree       : 1  
##  3rd Qu.:1018                      Somewhat disagree    : 5  
##  Max.   :1024                      Strongly Agree       : 1  
##                                    NA's                 : 1  
##                      Q2                         Q3                        Q4   
##  Agree                :10   Agree                :6   Agree                :8  
##  Disagree             : 6   Disagree             :4   Disagree             :1  
##  Prefer not to respond: 1   Prefer not to respond:2   Prefer not to respond:2  
##  Somewhat agree       : 1   Somewhat disagree    :4   Somewhat agree       :1  
##  Somewhat disagree    : 4   Strongly Agree       :7   Somewhat disagree    :3  
##  Strongly Agree       : 1   NA's                 :1   Strongly Agree       :8  
##  NA's                 : 1                             NA's                 :1  
##                      Q5                                         Gender 
##  Agree                :8   Female &/or Feminine &/or Woman         :9  
##  Disagree             :2   Male &/or Masculine &/or Man            :6  
##  Prefer not to respond:2   NA                                      :3  
##  Somewhat agree       :1   Agender                                 :1  
##  Somewhat disagree    :2   Agender, Female &/or Feminine &/or Woman:1  
##  Strongly Agree       :8   female                                  :1  
##  NA's                 :1   (Other)                                 :3  
##   Par1.Educ          Par2.Educ             Age              Rac.Eth         
##  Length:24          Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                   Gender_recode
##  Female &/or Feminine &/or Woman         :9    
##  Male &/or Masculine &/or Man            :6    
##  NA                                      :3    
##  Agender                                 :1    
##  Agender, Female &/or Feminine &/or Woman:1    
##  female                                  :1    
##  (Other)                                 :3    
##                                   Gender_recode2   Q1_recode    
##  Female &/or Feminine &/or Woman         :10     Min.   :2.000  
##  Male &/or Masculine &/or Man            : 6     1st Qu.:3.000  
##  NA                                      : 3     Median :5.000  
##  Agender                                 : 1     Mean   :3.952  
##  Agender, Female &/or Feminine &/or Woman: 1     3rd Qu.:5.000  
##  Genderfluid                             : 1     Max.   :6.000  
##  (Other)                                 : 2     NA's   :3      
##    Q2_recode       Q3_recode       Q4_recode       Q5_recode    
##  Min.   :2.000   Min.   :2.000   Min.   :2.000   Min.   :2.000  
##  1st Qu.:2.250   1st Qu.:3.000   1st Qu.:5.000   1st Qu.:5.000  
##  Median :4.500   Median :5.000   Median :5.000   Median :5.000  
##  Mean   :3.818   Mean   :4.381   Mean   :4.905   Mean   :4.857  
##  3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##  NA's   :2       NA's   :3       NA's   :3       NA's   :3      
##     Achieve     
##  Min.   :2.333  
##  1st Qu.:2.917  
##  Median :5.000  
##  Mean   :4.267  
##  3rd Qu.:5.333  
##  Max.   :6.000  
##  NA's   :4

Ta-da!

ON YOUR OWN: Give it a try on your own – try to make a new column called Master with the average of the two mastery items (Q3 and Q4). Make sure you use the recoded items!

Step 5.5: REFLECTION QUESTION

You made it! Whew! Regardless of how much of this code you were able to run you have at least familarized yourself with some of the language and the common commands for data cleaning. And you have seen some ways to think through the problems that arise in data cleaning.

Now we’d like you to go back and take a look at your data. What things do you need to do to get your data ready for analysis. What do you already have example code for? What do you still need code for?

Pre-Activity Instructions

Heather Perkins & Sarah Eddy

3/24/2021

Introduction (Read Me First)

Table of Contents

Step 1: Installing R and RStudio

Step 1.1: Installing on Windows PC

Installing R

Installing RStudio

Step 1.2: Installing on Mac

Installing R

Installing RStudio

Step 1.3: Run RStudio & Get Familiar With the Windows

Default setup of RStudio

Default setup of RStudio, with some extra labels thrown in

1. Console

2. Environment

3. Files (and Plots, Packages, Help, and Viewer)

4. Code or Script Window

Step 2: Getting Started

Step 2.1: Creating a Project

What is a Project (uppercase P)?

So how do you create a project?

The project button (outlined in black) and the tab that appears when it’s clicked (outlined in red)

Create project window (select New Directory)

Specify your directory and the name of the folder to be created for your new project

Final view after creating new project

Step 2.2: Creating a Script File

RStudio after creating a new script file and saving it

Step 2.3: Annotating Your Code

Screenshot of commented code

Step 2.4: Organizing Your Code (Optional)

Screenshot of commented code, all code sections collapsed

Step 3: Importing Data

Files tab using established folder structure

Step 3.1: Importing from CSV (Recommended)

Step 3.1.1: Import the Raw Data

Your newly imported data

Step 3.1.2: Skip Unneeded Rows

Ta-da! Problem solved. New dataframe with first row ‘skipped’

Step 3.1.3: Handle Missing Data Consistently

Step 4: Strategize Data Cleaning

Step 4.1: Working with Objects

Step 4.2: Determine Types of Data R Sees

The str() Command

The summary() Command

Step 4.3: Change Data Types

Step 4.4: REFLECTION QUESTION:

Step 5: Cleaning the Data

Step 5.1: Standardizing the Levels of a Factor

Step 5.1.1: Remove Leading and Trailing Spaces

Step 5.1.2: Recode A Variable Using the dplyr Package

Step 5.2: Changing String Variables to Numbers

OPTIONAL: Recoding Multiple Variables Using dplyr package

Step 5.3: Visualizing Numeric Variables

Step 5.4: Creating Factor Scores

Step 5.5: REFLECTION QUESTION