Data cleaning in R

R is an object based language. So, you need to create the object first to use it later.

a <- 5
b <- 6
a + b
## [1] 11

The # sign is used for commenting in R. So, R ignores everything after the pound sign.

Libraries/Packages (a.k.a. Applications)

Libraries are similar to the applications on your smart phone. We will use the following libraries today.

#install.packages("tidyverse")
#install.packages("knitr")

library(tidyverse)
library(knitr)

Finding missing values

There are two types of missing data

  • NA is for general missing data
    • to check whether a value is NA, use is.na()
    • we will mainly focus on this one
  • NAN is for “Not a Number” (0/0)
    • to check whether a value is NAN, use is.nan()

Let’s create a vector called a and assign some values to it.

a<-c(0, NA, 2, 3, 4,NA,7)
a
## [1]  0 NA  2  3  4 NA  7

Using paranthesis () around the objects

(a<-c(0, NA, 2, 3, 4,NA,7))
## [1]  0 NA  2  3  4 NA  7

is.na is asking whether the object inside the parantheses includes missing cases. It checks all values in that object and returns TRUE (Yes) or FALSE (No).

is.na(a)
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

What if we want to check non-missing values?

!is.na(a)
## [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE

How about where the missing cases are located?

which(is.na(a))
## [1] 2 6

How about where the non-missing cases are located?

which(!is.na(a))
## [1] 1 3 4 5 7

How many missing cases?

sum(is.na(a))
## [1] 2

R treats TRUE as 1, and FALSE as 0. This is very useful when determining how many cases each row or column has missing values in your data.

TRUE+TRUE+FALSE
## [1] 2

length is another useful gunctions that tells us the number of values in a vector.

length(a)
## [1] 7

Let’s calculate the percent missing.

(sum(is.na(a))/length(a))*100
## [1] 28.57143

Practice 1

Here is some data with missing values

set.seed(123)
b<-sample(x=c(1:6),size=500,replace=TRUE)
b[sample(x=c(1:500),size=12)]<-NA
  • How many missing values?
  • Which cases are missing?
  • What is the percent missing?

rowSums() and colSums()

We will use two new functions; rowSums and colSums (R is case sensitive) in the next section.

set.seed(12)
a<-as.data.frame(matrix(sample(1:5,20,replace=TRUE),nrow=5,ncol=4))
kable(a)
V1 V2 V3 V4
1 1 2 3
5 1 5 3
5 4 2 3
2 1 2 4
1 1 2 1
rowSums(a)
## [1]  7 14 14  9  5
colSums(a)
## V1 V2 V3 V4 
## 14  8 13 14
is.na(a)
##         V1    V2    V3    V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE

Let’s review what we’ve learned so far

  • is.na() checks whether a cell/case/value is missing
  • !is.na() checks whether a cell/case/value is NOT missing
    • is.na() returns TRUE or FALSE
    • TRUE is 1 and FALSE is 0.
  • ! negates the expression after
  • which() tells us index/location of a value
  • length() tells us how many cases in a vector
  • sum() is summation
  • rowSums() and colSums ()

Let’s put it all together with a simulated data

All the examples so far was with a vector, now we can work with a simulated data frame or just data. You can think of data frame as collection of vectors of same length.

Let’s create a dataset first.

N<-600
ageRange<- c(11:50)
likertScale<-c("Stongly Disagree","Disagree","Neutral","Agree","Stongly Agree")

set.seed(230)
mastID<-c(1001:(1000+N))
age<-sample(x=ageRange,size=N,replace=TRUE)
gender<-sample(x=c("Female","Male","Prefer Not to Answer",NA),
               size=N,replace=TRUE,prob=c(.45,.40,.1,.05))

mindset1<-sample(x=c(likertScale,NA),size=N,replace=TRUE,prob=c(.05,.15,.04,.40,.30,.06))
mindset2<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.14,.05,.35,.30,.12))
mindset3<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.11,.08,.35,.39,.03))

datRaw<-data.frame(mastID,age,gender,mindset1,mindset2,mindset3)
datRaw$mindset4<-NA
dat<-datRaw

kable(head(dat))
mastID age gender mindset1 mindset2 mindset3 mindset4
1001 27 Male Stongly Agree 4 4 NA
1002 15 Female Stongly Agree 4 5 NA
1003 32 Female Stongly Agree NA 2 NA
1004 22 Female NA 5 5 NA
1005 36 Prefer Not to Answer Stongly Agree NA 4 NA
1006 15 Male Stongly Agree 5 5 NA

Data frames have two dimensions; rows and columns. To check the dimensions of a data frame we use the dim() function.

dim(dat)
## [1] 600   7

The first value ALWAYS represents the rows and the second value always represents the columns. So, in the data file we just created, we have 600 rows and 7 columns.

nrow(dat)
## [1] 600
ncol(dat)
## [1] 7

We use brackes [,] to call a case/cell/value from a data frame. For example, dat[1,] returns the first row of the dat and dat[,1] returns the first column of the data.

dat[1,]
##   mastID age gender      mindset1 mindset2 mindset3 mindset4
## 1   1001  27   Male Stongly Agree        4        4       NA
head(dat[,1])
## [1] 1001 1002 1003 1004 1005 1006

Practice 2

  • How many variables have missing values in the first ROW (respondents)?
  • How many missing values in the seventh COLUMN (variable)?
  • Which variables are missing in the 125th row?
  • How many cases are missing for each variable?
  • How many variables are missing for each participant?

table() function and $ sign

To call a column from a data set, we can use the $ sign. For example, to call the mindset2 variable, we can write dat$mindset2.

dat$mindset2
##   [1]  4  4 NA  5 NA  5  5  2  4  4 NA  5  4 NA  4  5  5  5  2  5  4  4  4
##  [24]  5  4  3  5  4  4 NA  5  2  5  4  1  2  5  3  3  4  5  4  5  4  3  4
##  [47]  4  4 NA  4  2  5  4 NA  2  5  2  5  4 NA  2  2  5 NA  5  4  4  5  4
##  [70]  5  1  5  2 NA  5  5  5  4  2  4  4 NA  5  5  4  3  4  4  4 NA  5  5
##  [93]  5  5  4 NA  4  5  5  5  5  5  4  3  2  4  4  5  5  2  5  4  5 NA  2
## [116]  4  5  4  5  2  4  5  4  4  5  5  4  5  4  4  4  2  4  4  2  4  2  5
## [139]  5 NA  4  2  2  2  2  4  5  5  4 NA  4  2  2  4  2  5  2  4  3  4  5
## [162]  5  5  4  3 NA  5 NA  4  2  4  4  5  2  3  5  5  4  5  4  4 NA  2  1
## [185]  4 NA NA  5  2  5  4  2  5 NA  4  5  4  2  5  4  2  4  1  1  5  4  2
## [208]  4  5  4  4  2  4 NA  4 NA NA NA  2  5  2  4 NA  5 NA  5  5  4  4  4
## [231] NA  4  5  4  5  5  4  5 NA  2  5  4  5  4  5  5  2  4  5  4  4  5  4
## [254]  4 NA  4  5  5  5  5  1  5  4  4  2  5  3  2 NA NA  4  4  5  2  3  2
## [277]  4  2  5  4  4 NA  5 NA  4  1  4  4  5  2 NA  5  5  2  4  4  1 NA  5
## [300]  5 NA  5 NA  2  4  2  5  4  2  5  2  2 NA  2  4  5  2  2  4  5  5  5
## [323]  5  5  5  4 NA  4  5  3  4  5  4  4  4  2  5  4  5  5  5  4 NA  4  5
## [346]  4  4  4 NA  2  4  2  2 NA  2  4  2  5  3  4  5  5  5 NA  5  3  5  4
## [369]  4  4  4  4  4  4  2  4  4 NA  1 NA  2  2  4 NA NA  5  2 NA  5  5 NA
## [392]  2  5  2  4  4  2  4  2  5  5  2  5  3  5  1  4  2  5 NA  2  5  4  5
## [415]  4  4  1  5  5  4  4  4  5  4  4  4 NA  1  4  4  5  2  2  4  4  4  3
## [438]  5  4  4  4  2  4  2  4 NA  2  4  5  5  2  4  4  2  5  3 NA  4  2  5
## [461]  2  1  5  5  2  4 NA NA  4  4  4  4  2  2  4  4  4  5  4  5  3  5  4
## [484]  5  5  4 NA NA NA  4  4  1  4  2  4  4  5  4 NA  1  4 NA  1 NA  3  4
## [507]  5 NA  5  5  2  4  4  4  5  4  4  5  5  5  2  5  5  4  5  4  1  2  5
## [530]  2  4  5  5  4  5 NA  3 NA  2 NA  4  5  5  4  4  1  5  5  4  2  4  4
## [553]  5 NA  1  4  2  4  4  2  4  4  5  4  2  4  5  5  5  5  2  4  5  1 NA
## [576]  5 NA  4  4 NA  4  4  4  5  5  4  5  4  5  5 NA  5  5  2  4 NA  5  5
## [599]  4  4

Let’s create a frequency table.

table(dat$mindset2)
## 
##   1   2   3   4   5 
##  20  95  20 211 181
table(dat$mindset2,useNA="ifany")
## 
##    1    2    3    4    5 <NA> 
##   20   95   20  211  181   73

Renaming and/or recoding variables

is.factor(dat$mindset1)
## [1] TRUE
dat$ms1<-as.character(dat$mindset1)
dat$ms2<-as.character(dat$mindset1)

is.factor(dat$ms1)
## [1] FALSE
table(dat$ms1)
## 
##            Agree         Disagree          Neutral    Stongly Agree 
##              228              104               23              191 
## Stongly Disagree 
##               25
dat$ms1[which(dat$ms1=="Stongly Disagree")]<-1
dat$ms1[which(dat$ms1=="Disagree")]<-2
dat$ms1[which(dat$ms1=="Neutral")]<-3
dat$ms1[which(dat$ms1=="Agree")]<-4
dat$ms1[which(dat$ms1=="Stongly Agree")]<-5


dat$ms2<-with(dat,ifelse(is.na(ms2),NA,
                         ifelse(ms2=="Stongly Disagree",1,
                                ifelse(ms2=="Disagree",2,
                                       ifelse(ms2=="Neutral",3,
                                              ifelse(ms2=="Agree",4,
                                                     ifelse(ms2=="Stongly Agree",5,ms2)))))))


dat$ms3<-factor(dat$ms2,levels=c(1:5),labels=likertScale,ordered = TRUE)

table(dat$mindset1)
## 
##            Agree         Disagree          Neutral    Stongly Agree 
##              228              104               23              191 
## Stongly Disagree 
##               25
table(dat$ms1)
## 
##   1   2   3   4   5 
##  25 104  23 228 191
table(dat$ms2)
## 
##   1   2   3   4   5 
##  25 104  23 228 191
table(dat$ms3)
## 
## Stongly Disagree         Disagree          Neutral            Agree 
##               25              104               23              228 
##    Stongly Agree 
##              191

Reverse coding variables

If it is a numeric variable, that needs to be reverse coded. Otherwise, we can use the same methods in the previous section

range(as.numeric(dat$ms1),na.rm=TRUE)
## [1] 1 5
dat$ms4<- 6-as.numeric(dat$ms1)

table(dat$ms1,useNA = "ifany")
## 
##    1    2    3    4    5 <NA> 
##   25  104   23  228  191   29
table(dat$ms4,useNA = "ifany")
## 
##    1    2    3    4    5 <NA> 
##  191  228   23  104   25   29

Recoding Age Variable

dat$age1<-dat$age
for(which.row in 1:nrow(dat)){
  ageToCheck<-dat$age[which.row]
  if(is.na(ageToCheck)){
          dat$age1[which.row]<-NA
          } else if (ageToCheck>10 & ageToCheck<=20){
            dat$age1[which.row]<-"gr1"
          } else if (ageToCheck>20 & ageToCheck<=30){
           dat$age1[which.row]<-"gr2" 
          } else if (ageToCheck>30 & ageToCheck<=40){
           dat$age1[which.row]<-"gr3" 
          } else {
            dat$age1[which.row]<-"gr4" 
          } 
}
table(dat$age1)
## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142
dat$age2<-dat$age
range(dat$age)
## [1] 11 50
dat$age2[dat$age>10 & dat$age<=20]<-"gr1"
dat$age2[dat$age>20 & dat$age<=30]<-"gr2"
dat$age2[dat$age>30 & dat$age<=40]<-"gr3"
dat$age2[dat$age>40 & dat$age<=50]<-"gr4"


dat$age3<-dat$age
dat$age3<-with(dat,ifelse(age>10 & age<=20,"gr1",
                    ifelse(age>20 & age<=30,"gr2",
                      ifelse(age>30 & age<=40,"gr3","gr4"))))

table(dat$age1)
## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142
table(dat$age2)
## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142
table(dat$age3)
## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142

Let’s Practice 3

  • Pick one of the methods above and create three groups for age, name this variable age4
    • age 10 to 25 - “group 1”
    • age 26 to 39 - “group 2”
    • age 40 to 50 - “group 3”
  • Recode gender to numeric and name this variable gender1
    • Male to 1
    • Female to 2
    • Prefer Not to Answer to 3

Creating Aggregate Variables 1

mutate()

datAgg<-dat %>% 
    group_by(age3) %>%
    mutate(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
    ungroup()
kable(head(datAgg))
mastID age gender mindset1 mindset2 mindset3 mindset4 ms1 ms2 ms3 ms4 age1 age2 age3 mindset3mean n
1001 27 Male Stongly Agree 4 4 NA 5 5 Stongly Agree 1 gr2 gr2 gr2 4.097143 184
1002 15 Female Stongly Agree 4 5 NA 5 5 Stongly Agree 1 gr1 gr1 gr1 3.783582 137
1003 32 Female Stongly Agree NA 2 NA 5 5 Stongly Agree 1 gr3 gr3 gr3 4.000000 137
1004 22 Female NA 5 5 NA NA NA NA NA gr2 gr2 gr2 4.097143 184
1005 36 Prefer Not to Answer Stongly Agree NA 4 NA 5 5 Stongly Agree 1 gr3 gr3 gr3 4.000000 137
1006 15 Male Stongly Agree 5 5 NA 5 5 Stongly Agree 1 gr1 gr1 gr1 3.783582 137

Creating Aggregate Variables 2

summarize()

datAgg<-dat %>% 
    group_by(age3) %>%
    summarize(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
    ungroup()
kable(head(datAgg))
age3 mindset3mean n
gr1 3.783582 137
gr2 4.097143 184
gr3 4.000000 137
gr4 3.970588 142

Practice 4

  • Group by age3 and gender
    • add the new variables to the datAgg
    • calculate the mean (function is “mean()”) for mindset3 for each group
    • calculate the standard deviation (function is “sd()”) for mindset3 for each group
  • Group by gender and age
    • Create datAgg with only calculated variables
    • calculate the mean (function is “mean()”) for mindset3 for each group
    • calculate the standard deviation (function is “sd()”) for mindset3 for each group