Step 4: Strategize Data Cleaning

Once your data is in R you need to understand how R sees it. Just because you know a question has categorical or continuous data does not mean R sees it this way. There are multiple ways to understand how R sees your data. We share two below.

Step 4.1: Previewing Your Data

Now that you’ve got your data into R, lets get some practice looking at it. There are a few ways to view a dataframe object like our new data file. One is to click it in the Environment tab – it will open in a new window that can do some useful stuff (e.g., sort by columns or filter). Another way is to use the head() command (and it’s opposite, the tail() command).

head(data)
##   Stu.ID Intent                    Q1                Q2                Q3
## 1   1001      8              Disagree          Disagree          Disagree
## 2   1002      6 Prefer not to respond          Disagree             Agree
## 3   1003      5     Somewhat disagree Somewhat disagree Somewhat disagree
## 4   1004      9                 Agree             Agree    Strongly Agree
## 5   1005      7        Strongly Agree    Strongly Agree    Strongly Agree
## 6   1006      8              Disagree          Disagree          Disagree
##               Q4             Q5                           Gender   Par1.Educ
## 1          Agree          Agree                          Agender High School
## 2          Agree          Agree Female &/or Feminine &/or Woman  High School
## 3          Agree          Agree Female &/or Feminine &/or Woman  High School
## 4 Strongly Agree Strongly Agree                               NA  Bachelor's
## 5 Strongly Agree Strongly Agree Female &/or Feminine &/or Woman  High School
## 6 Somewhat agree Somewhat agree    Male &/or Masculine &/or Man   Bachelor's
##        Par2.Educ Age                                Rac.Eth
## 1   Some College  18                                  White
## 2    High School  19 Middle Eastern or North African, White
## 3    High School  25    I prefer to identify as [Columbian]
## 4            PhD  16                                  White
## 5 Not applicable  19                                  White
## 6            PhD  18              Black or African American
tail(data)
##    Stu.ID Intent                    Q1                    Q2
## 19   1019      9                  <NA>                  <NA>
## 20   1020      8 Prefer not to respond Prefer not to respond
## 21   1021      5                 Agree                 Agree
## 22   1022      5              Disagree              Disagree
## 23   1023      7                 Agree                 Agree
## 24   1024      6                 Agree                 Agree
##                       Q3                    Q4                    Q5
## 19                  <NA>                  <NA>                  <NA>
## 20 Prefer not to respond Prefer not to respond Prefer not to respond
## 21        Strongly Agree        Strongly Agree        Strongly Agree
## 22              Disagree     Somewhat disagree     Somewhat disagree
## 23        Strongly Agree        Strongly Agree        Strongly Agree
## 24        Strongly Agree        Strongly Agree        Strongly Agree
##                              Gender   Par1.Educ   Par2.Educ Age
## 19                               NA  Bachelor's  Bachelor's  18
## 20                               NA          NA          NA  18
## 21    Male &/or Masculine &/or Man  High School High School  30
## 22 Female &/or Feminine &/or Woman  High School High School  19
## 23  I don't understand the question  Bachelor's  Bachelor's  18
## 24    Male &/or Masculine &/or Man   Bachelor's  Bachelor's  18
##                                Rac.Eth
## 19           Black or African American
## 20           Black or African American
## 21 Hispanic; Latino; or Spanish origin
## 22                               Asian
## 23                               Asian
## 24                               White

As you can see, these commands shows you the first and last six rows of the specified dataframe. This is often a quick and easy way to see how your data looks without troubling to open it in a new window.

Another useful command is the colnames() command. This shows you a list of all the columns in a dataframe, and can be very helpful if you have a lot of variables and need to see how one is spelled. But more than that, it can be useful if your current column names aren’t serving your purposes. For instance, maybe we want our column names to be lowercase:

colnames(data) # see what the current column names are
##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"
old_columns <- colnames(data) # save the old column names in a new object
new_columns <- c("stu.id","intent","q1","q2","q3","q4","q5","gender","par1.educ","par2.educ","age","race.eth") # create the new column names and place them in a vector

colnames(data) <- new_columns # apply the new column names to the dataframe
colnames(data) # see the new column names
##  [1] "stu.id"    "intent"    "q1"        "q2"        "q3"        "q4"       
##  [7] "q5"        "gender"    "par1.educ" "par2.educ" "age"       "race.eth"
colnames(data) <- old_columns # apply the old column names to the dataframe
colnames(data) # see the new old column names
##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"

Sometimes you need to view or manipulate only part of an object, such as a single variable in a dataset. To do this you use the $ operator. If we wanted to preview the first six lines of the Q1 column in our data, we could use the code below.

head(data$Q1)
## [1] "Disagree"              "Prefer not to respond" "Somewhat disagree"    
## [4] "Agree"                 "Strongly Agree"        "Disagree"

Last but not least, you can use your new understanding of data structures in R to subset your data. Take our imported data, for instance. Suppose we want to use some of the variables (Stu.ID, Intent, and Age) but not the others, and find that they’re cluttering things up. Subsetting the current dataframe will let us create a new dataframe with the specified information without altering the previous dataframe.

data2 <- subset(data, select=c(Stu.ID,Intent,Age))
head(data2)
##   Stu.ID Intent Age
## 1   1001      8  18
## 2   1002      6  19
## 3   1003      5  25
## 4   1004      9  16
## 5   1005      7  19
## 6   1006      8  18

Perhaps we want to filter our dataframe so that only 19 year olds are included and we only see the columns relating to IDs, intent, and age.

data3 <- subset(data, select=c(Stu.ID,Intent,Age), Age == "19")
head(data3)
##    Stu.ID             Intent Age
## 2    1002                  6  19
## 5    1005                  7  19
## 10   1010                  9  19
## 12   1012                 NA  19
## 15   1015 10-Definitely will  19
## 18   1018                  7  19

ON YOUR OWN. If you look at the data, you’ll notice we have a participant under the age of 18, which means we can’t use their data without their guardian’s permission, which we didn’t collect. Try subsetting your data to drop those below the age of 18 and see what happens!

Step 4.2: Determine Types of Data R Sees

The str() Command

One way to understand how R is reading your data is the str (structure) command.

str(data)
## 'data.frame':    24 obs. of  12 variables:
##  $ Stu.ID   : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Intent   : chr  "8" "6" "5" "9" ...
##  $ Q1       : chr  "Disagree" "Prefer not to respond" "Somewhat disagree" "Agree" ...
##  $ Q2       : chr  "Disagree" "Disagree" "Somewhat disagree" "Agree" ...
##  $ Q3       : chr  "Disagree" "Agree" "Somewhat disagree" "Strongly Agree" ...
##  $ Q4       : chr  "Agree" "Agree" "Agree" "Strongly Agree" ...
##  $ Q5       : chr  "Agree" "Agree" "Agree" "Strongly Agree" ...
##  $ Gender   : chr  "Agender" "Female &/or Feminine &/or Woman " "Female &/or Feminine &/or Woman " "NA" ...
##  $ Par1.Educ: chr  "High School" "High School" "High School" "Bachelor's" ...
##  $ Par2.Educ: chr  "Some College" "High School" "High School" "PhD" ...
##  $ Age      : chr  "18" "19" "25" "16" ...
##  $ Rac.Eth  : chr  "White" "Middle Eastern or North African, White" "I prefer to identify as [Columbian]" "White" ...

It’s similar to the head() command, but provides two additional bits of information. Between the name of each column (e.g., Stu.ID) and the excerpt of the data in that column (e.g., 1001 1002 1003, etc) is a short code that tells you what type of data is in the column, or what type of variable it is. The Stu.ID variable is an integer type variable, while the remaining variables are character types. The most commonly used types are:

  • Character: includes single characters (“a”) and strings (“Agree”)
  • Integer: only allows integers
  • Numeric: allows all numbers (including integers)
  • Logical: a binary variable that can either be TRUE (T) or FALSE (F).
  • Factors: data organized into categories.

The summary() Command

Another way to understand how R is reading your data is the summary command. This function allows you to look at a summary of each column of your data. For numeric data you get to see the spread of the data and for categorical data you get how many individuals are in each category. Character data just returns the class of the data. Generally when you see character data you will want to convert it to categorical data (unless you have other plans for that data).

summary(data)
##      Stu.ID        Intent               Q1                 Q2           
##  Min.   :1001   Length:24          Length:24          Length:24         
##  1st Qu.:1007   Class :character   Class :character   Class :character  
##  Median :1012   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1012                                                           
##  3rd Qu.:1018                                                           
##  Max.   :1024                                                           
##       Q3                 Q4                 Q5               Gender         
##  Length:24          Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Par1.Educ          Par2.Educ             Age              Rac.Eth         
##  Length:24          Length:24          Length:24          Length:24         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Step 4.3: Change Data Types

You might find that your data has been imported as one type (e.g., character) when you want it to be something else (e.g., integer). There’s a very easy way to tell R to convert it from one type to the other!

data$Gender <- as.factor(data$Gender) # recode gender variable from character to factor

This code overwrites the data type for your existing dataframe. So, it converts Par1.Educ and Gender from character data to factors. If you rerun the summary() or str() commands you can see the changes. Instead of class:character, you should see a summary of the categories for each question you converted as well as how many participants chose each category.

Some columns may have so many categories that R doesn’t show them all when your run a global summary. You can tell levels are missing if you see (Other) among your factors (for an example of this look at Gender_factor). To see all the levels you can run a summary() command on a particular column and you will get all the levels of the factor.

summary(data$Gender)
##                                    Agender 
##                                          1 
##   Agender, Female &/or Feminine &/or Woman 
##                                          1 
##                                     female 
##                                          1 
##           Female &/or Feminine &/or Woman  
##                                          9 
##                                Genderfluid 
##                                          1 
##            I don't understand the question 
##                                          1 
##              Male &/or Masculine &/or Man  
##                                          6 
## Male &/or Masculine &/or Man, Transgender  
##                                          1 
##                                         NA 
##                                          3

To change multiple columns at the same time you can use the following code that uses the lapply function. The lapply() function is powerful but is also a bit beyond this workshop, so we won’t go into depth, but the code is presented here for your use. Instead of naming each column you want it to look at, you can just tell it the column numbers. To get the column numbers you can use the names commands which prints the names of each column and the column numbers.

names(data)
##  [1] "Stu.ID"    "Intent"    "Q1"        "Q2"        "Q3"        "Q4"       
##  [7] "Q5"        "Gender"    "Par1.Educ" "Par2.Educ" "Age"       "Rac.Eth"
data[,3:7] <- lapply(data[,3:7],as.factor)

With this command we are telling R to make columns 3 through 7 (questions Q1 through Q5) factors. You can run the summary() command again to see your data.

ON YOUR OWN: Gender isn’t the only variable that would be better as a factor. What about Par1.Educ? Try converting it on your own and see how it goes!

Step 4.4: REFLECTION QUESTION

After completing Step 3, what issues do you see in this data set that will need to be addressed before the data can be analyzed? Refer back the project scenario, the survey below, the summaries you made in R, and the raw data if you need to.