Step 5: Starting Data Cleaning

There are numerous small things that go into cleaning your data. A few of the most important ones are included in the section below.

Step 5.1: Standardizing the Levels of a Factor

You might have noticed in your reflection of the data that in the Gender column that there was one entry for ‘female’. Check it out using the summary() command.

data$Gender <- as.factor(data$Gender)
summary(data$Gender)
##                                    Agender 
##                                          1 
##   Agender, Female &/or Feminine &/or Woman 
##                                          1 
##                                     female 
##                                          1 
##           Female &/or Feminine &/or Woman  
##                                          9 
##                                Genderfluid 
##                                          1 
##            I don't understand the question 
##                                          1 
##              Male &/or Masculine &/or Man  
##                                          6 
## Male &/or Masculine &/or Man, Transgender  
##                                          1 
##                                         NA 
##                                          3

This entry should be combined with ‘Female &/ or Feminine &/or Woman’. Notice also that there is a space after the ‘n’ in ’woman. These little differences can be the most challenging to diagnose. We will first remove all leading and trailing spaces (undesired spaces at the beginning or ending of a string) and then recode the variable.

Step 5.1.1: Remove Leading and Trailing Spaces

The best way to remove leading and trailing spaces is one column at a time. In the code below, we use the trimws() command to remove the spaces and create a new column from the result. However, upon viewing the result we can see that the new column as the data type ‘character’, and so we need to change it to factor again. And ta-da! No more unwanted spaces.

data$Gender_recode <- trimws(data$Gender, which = c("both"))
data$Gender_recode <- as.factor(data$Gender_recode)
summary(data$Gender_recode)
##                                   Agender 
##                                         1 
##  Agender, Female &/or Feminine &/or Woman 
##                                         1 
##                                    female 
##                                         1 
##           Female &/or Feminine &/or Woman 
##                                         9 
##                               Genderfluid 
##                                         1 
##           I don't understand the question 
##                                         1 
##              Male &/or Masculine &/or Man 
##                                         6 
## Male &/or Masculine &/or Man, Transgender 
##                                         1 
##                                        NA 
##                                         3

Step 5.1.2: Recode A Variable Using the dplyr Package

The dplyr package is a popular and powerful package that you’ll probably use often. The recode() command from that package is straightforward and easy to use. Here, we use it to edit the data so that ‘female’ is added to the larger ‘Female &/or Feminine &/or Woman’ category and the result is saved as a new variable.

install.packages("dplyr")
library(dplyr)
data$Gender_recode2 <- recode(data$Gender_recode, female = "Female &/or Feminine &/or Woman")
summary(data$Gender_recode2)
##                                   Agender 
##                                         1 
##  Agender, Female &/or Feminine &/or Woman 
##                                         1 
##           Female &/or Feminine &/or Woman 
##                                        10 
##                               Genderfluid 
##                                         1 
##           I don't understand the question 
##                                         1 
##              Male &/or Masculine &/or Man 
##                                         6 
## Male &/or Masculine &/or Man, Transgender 
##                                         1 
##                                        NA 
##                                         3

ON YOUR OWN: There’s a lot you can do with the recode command! Try checking out the Rac.Eth data. One of the written in responses is ‘I prefer to identify as [black]’. This probably could be combined with the "Black or African American’ category – try writing your own code to make that happen.

Step 5.2: Changing String Variables to Numbers

Often with Likert scale data you want to convert phrases to numbers (Strongly Disagree to 1 for example) so you can treat the variable as numeric or ordinal. To do this we will use the same command as when we were cleaning up the factors: recode(). This time we enter multiple levels that we want to recode. Any levels that aren’t recoded (such as ‘Prefer not to respond’) will automatically be recoded as NA.

Last, we use the table() command to make sure that the recoded variable and original variable line up. For instance, we can see that there were 10 participants who selected ‘Agree’ in the original variable, and all 10 of them have a score of ‘5’ in the new variable, suggesting that the recoding work and everyone ended up being assigned the correct value

library(dplyr)
summary(data$Q1)
data$Q1_recode <- recode(data$Q1, "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6,
                         .default = NA_real_)
summary(data$Q1_recode)
table(data$Q1, data$Q1_recode, useNA = "always")
##    Length     Class      Mode 
##        24 character character
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   3.000   5.000   3.952   5.000   6.000       3
##                        
##                          2  3  4  5  6 <NA>
##   Agree                  0  0  0 10  0    0
##   Disagree               4  0  0  0  0    0
##   Prefer not to respond  0  0  0  0  0    2
##   Somewhat agree         0  0  1  0  0    0
##   Somewhat disagree      0  5  0  0  0    0
##   Strongly Agree         0  0  0  0  1    0
##   <NA>                   0  0  0  0  0    1

OPTIONAL: Recoding Multiple Variables Using dplyr package

The most straightforward, but also the most time-consuming, way to recode multiple variables is to use the above code and switch out the variable names. For instance, to recode column ‘Q2’ you could duplicate the above code, replace all instances of ‘Q1’ with ‘Q2’, and run it again. See below:

library(dplyr)
summary(data$Q2)
data$Q2_recode <- recode(data$Q2, "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6,
                         .default = NA_real_)
summary(data$Q2_recode)
table(data$Q2, data$Q2_recode, useNA = "always")

The other option is to use more complicated code. The below code will work, but we won’t be explaining it in-depth in the current workshop. Remember to check your recoding using the table() command and you should be safe, but if in doubt, use the lengthier but easier to understand script.

library(dplyr)
data2 <- data %>%
  transmute_at(c("Q2","Q3","Q4","Q5"), funs(recode(., "Strongly disagree" = 1,
                                       "Disagree" = 2,
                                       "Somewhat disagree" = 3,
                                       "Somewhat agree" = 4,
                                       "Agree" = 5,
                                       "Strongly Agree" = 6,
                                       .default = NA_real_)))
colnames(data2) <- paste0(colnames(data2),"_recode")
data <- cbind.data.frame(data, data2)
table(data$Q2, data$Q2_recode, useNA = "always")
##                        
##                          2  3  4  5  6 <NA>
##   Agree                  0  0  0 10  0    0
##   Disagree               6  0  0  0  0    0
##   Prefer not to respond  0  0  0  0  0    1
##   Somewhat agree         0  0  1  0  0    0
##   Somewhat disagree      0  4  0  0  0    0
##   Strongly Agree         0  0  0  0  1    0
##   <NA>                   0  0  0  0  0    1

Step 5.3: Visualizing Numeric Variables

One of the nice things about converting a string-type variable into a numeric variable, as we did in the last step, is that you can create some nice plots! The code below will create a nice, simple histogram to help you visualize the distribution of your data.

hist(data$Q1_recode, breaks = 5)

ON OUR OWN: There are some other variables that would benefit from similar treatment. On your own, see if you can make the ‘Intent’ variable numeric and visualize it in a histogram.

Step 5.4: Creating Factor Scores

The self efficacy questions come from two scales: achievement (Q1, Q2, Q5) and mastery (Q3 & Q4). A two question factor is really not great, but we will use it just for an example. The researchers don’t care about the individual questions but want to get a factor score for each scale by averaging student responses on each scale. To do this they need to make a new column for the factor scale and fill that column in with the average score for the individual items. This is actually really easy!

data$Achieve <- (data$Q1_recode + data$Q2_recode + data$Q5_recode)/3
summary(data)
##      Stu.ID        Intent               Q1                 Q2           
##  Min.   :1001   Length:24          Length:24          Length:24         
##  1st Qu.:1007   Class :character   Class :character   Class :character  
##  Median :1012   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1012                                                           
##  3rd Qu.:1018                                                           
##  Max.   :1024                                                           
##                                                                         
##       Q3                 Q4                 Q5           
##  Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##                                       Gender   Par1.Educ        
##  Female &/or Feminine &/or Woman         :9   Length:24         
##  Male &/or Masculine &/or Man            :6   Class :character  
##  NA                                      :3   Mode  :character  
##  Agender                                 :1                     
##  Agender, Female &/or Feminine &/or Woman:1                     
##  female                                  :1                     
##  (Other)                                 :3                     
##   Par2.Educ             Age              Rac.Eth         
##  Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##                                   Gender_recode
##  Female &/or Feminine &/or Woman         :9    
##  Male &/or Masculine &/or Man            :6    
##  NA                                      :3    
##  Agender                                 :1    
##  Agender, Female &/or Feminine &/or Woman:1    
##  female                                  :1    
##  (Other)                                 :3    
##                                   Gender_recode2   Q1_recode    
##  Female &/or Feminine &/or Woman         :10     Min.   :2.000  
##  Male &/or Masculine &/or Man            : 6     1st Qu.:3.000  
##  NA                                      : 3     Median :5.000  
##  Agender                                 : 1     Mean   :3.952  
##  Agender, Female &/or Feminine &/or Woman: 1     3rd Qu.:5.000  
##  Genderfluid                             : 1     Max.   :6.000  
##  (Other)                                 : 2     NA's   :3      
##    Q2_recode       Q3_recode       Q4_recode       Q5_recode    
##  Min.   :2.000   Min.   :2.000   Min.   :2.000   Min.   :2.000  
##  1st Qu.:2.250   1st Qu.:3.000   1st Qu.:5.000   1st Qu.:5.000  
##  Median :4.500   Median :5.000   Median :5.000   Median :5.000  
##  Mean   :3.818   Mean   :4.381   Mean   :4.905   Mean   :4.857  
##  3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##  NA's   :2       NA's   :3       NA's   :3       NA's   :3      
##     Achieve     
##  Min.   :2.333  
##  1st Qu.:2.917  
##  Median :5.000  
##  Mean   :4.267  
##  3rd Qu.:5.333  
##  Max.   :6.000  
##  NA's   :4

Ta-da!

ON YOUR OWN: Give it a try on your own – try to make a new column called Master with the average of the two mastery items (Q3 and Q4). Make sure you use the recoded items!

Step 5.5: REFLECTION QUESTION

You made it! Whew! Regardless of how much of this code you were able to run you have at least familiarized yourself with some of the language and the common commands for data cleaning.And you have seen some ways to think through the problems that arise in data cleaning.

Now we’d like you to go back and take a look at your data. What things do you need to do to get your data ready for analysis.What do you already have example code for? What do you still need code for?

ON YOUR OWN: Want to practice what you’ve learned? Try loading in your own data, or the provided CC data if you haven’t had a chance to gather data of your own yet. You’ll no doubt run into new and exciting issues working with a new dataset, so try to keep track of them and bring them to our meeting so we can all troubleshoot together. Hitting the wall and finding a way past it is an essential part of programming and the R experience, so don’t let yourself get too frustrated with any hurdles you encounter – they’re the best way to learn!