Data cleaning in R

R is an object based language. So, you need to create the object first to use it later.

a <- 5
b <- 6
a + b

## [1] 11

The # sign is used for commenting in R. So, R ignores everything after the pound sign.

Libraries/Packages (a.k.a. Applications)

Libraries are similar to the applications on your smart phone. We will use the following libraries today.

#install.packages("tidyverse")
#install.packages("knitr")

library(tidyverse)
library(knitr)

Finding missing values

There are two types of missing data

NA is for general missing data
- to check whether a value is NA, use is.na()
- we will mainly focus on this one
NAN is for “Not a Number” (0/0)
- to check whether a value is NAN, use is.nan()

Let’s create a vector called a and assign some values to it.

a<-c(0, NA, 2, 3, 4,NA,7)
a

## [1]  0 NA  2  3  4 NA  7

Using paranthesis () around the objects

(a<-c(0, NA, 2, 3, 4,NA,7))

## [1]  0 NA  2  3  4 NA  7

is.na is asking whether the object inside the parantheses includes missing cases. It checks all values in that object and returns TRUE (Yes) or FALSE (No).

is.na(a)

## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

What if we want to check non-missing values?

!is.na(a)

## [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE

How about where the missing cases are located?

which(is.na(a))

## [1] 2 6

How about where the non-missing cases are located?

which(!is.na(a))

## [1] 1 3 4 5 7

How many missing cases?

sum(is.na(a))

## [1] 2

R treats TRUE as 1, and FALSE as 0. This is very useful when determining how many cases each row or column has missing values in your data.

TRUE+TRUE+FALSE

## [1] 2

length is another useful gunctions that tells us the number of values in a vector.

length(a)

## [1] 7

Let’s calculate the percent missing.

(sum(is.na(a))/length(a))*100

## [1] 28.57143

Practice 1

Here is some data with missing values

set.seed(123)
b<-sample(x=c(1:6),size=500,replace=TRUE)
b[sample(x=c(1:500),size=12)]<-NA

How many missing values?
Which cases are missing?
What is the percent missing?

rowSums() and colSums()

We will use two new functions; rowSums and colSums (R is case sensitive) in the next section.

set.seed(12)
a<-as.data.frame(matrix(sample(1:5,20,replace=TRUE),nrow=5,ncol=4))
kable(a)

V1	V2	V3	V4
1	1	2	3
5	1	5	3
5	4	2	3
2	1	2	4
1	1	2	1

rowSums(a)

## [1]  7 14 14  9  5

colSums(a)

## V1 V2 V3 V4 
## 14  8 13 14

is.na(a)

##         V1    V2    V3    V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE

Let’s review what we’ve learned so far

is.na() checks whether a cell/case/value is missing
!is.na() checks whether a cell/case/value is NOT missing
- is.na() returns TRUE or FALSE
- TRUE is 1 and FALSE is 0.
! negates the expression after
which() tells us index/location of a value
length() tells us how many cases in a vector
sum() is summation
rowSums() and colSums ()

Let’s put it all together with a simulated data

All the examples so far was with a vector, now we can work with a simulated data frame or just data. You can think of data frame as collection of vectors of same length.

Let’s create a dataset first.

N<-600
ageRange<- c(11:50)
likertScale<-c("Stongly Disagree","Disagree","Neutral","Agree","Stongly Agree")

set.seed(230)
mastID<-c(1001:(1000+N))
age<-sample(x=ageRange,size=N,replace=TRUE)
gender<-sample(x=c("Female","Male","Prefer Not to Answer",NA),
               size=N,replace=TRUE,prob=c(.45,.40,.1,.05))

mindset1<-sample(x=c(likertScale,NA),size=N,replace=TRUE,prob=c(.05,.15,.04,.40,.30,.06))
mindset2<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.14,.05,.35,.30,.12))
mindset3<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.11,.08,.35,.39,.03))

datRaw<-data.frame(mastID,age,gender,mindset1,mindset2,mindset3)
datRaw$mindset4<-NA
dat<-datRaw

kable(head(dat))

mastID	age	gender	mindset1	mindset2	mindset3	mindset4
1001	27	Male	Stongly Agree	4	4	NA
1002	15	Female	Stongly Agree	4	5	NA
1003	32	Female	Stongly Agree	NA	2	NA
1004	22	Female	NA	5	5	NA
1005	36	Prefer Not to Answer	Stongly Agree	NA	4	NA
1006	15	Male	Stongly Agree	5	5	NA

Data frames have two dimensions; rows and columns. To check the dimensions of a data frame we use the dim() function.

dim(dat)

## [1] 600   7

The first value ALWAYS represents the rows and the second value always represents the columns. So, in the data file we just created, we have 600 rows and 7 columns.

nrow(dat)

## [1] 600

ncol(dat)

## [1] 7

We use brackes [,] to call a case/cell/value from a data frame. For example, dat[1,] returns the first row of the dat and dat[,1] returns the first column of the data.

dat[1,]

##   mastID age gender      mindset1 mindset2 mindset3 mindset4
## 1   1001  27   Male Stongly Agree        4        4       NA

head(dat[,1])

## [1] 1001 1002 1003 1004 1005 1006

Practice 2

How many variables have missing values in the first ROW (respondents)?
How many missing values in the seventh COLUMN (variable)?
Which variables are missing in the 125th row?
How many cases are missing for each variable?
How many variables are missing for each participant?

table() function and $ sign

To call a column from a data set, we can use the $ sign. For example, to call the mindset2 variable, we can write dat$mindset2.

dat$mindset2

##   [1]  4  4 NA  5 NA  5  5  2  4  4 NA  5  4 NA  4  5  5  5  2  5  4  4  4
##  [24]  5  4  3  5  4  4 NA  5  2  5  4  1  2  5  3  3  4  5  4  5  4  3  4
##  [47]  4  4 NA  4  2  5  4 NA  2  5  2  5  4 NA  2  2  5 NA  5  4  4  5  4
##  [70]  5  1  5  2 NA  5  5  5  4  2  4  4 NA  5  5  4  3  4  4  4 NA  5  5
##  [93]  5  5  4 NA  4  5  5  5  5  5  4  3  2  4  4  5  5  2  5  4  5 NA  2
## [116]  4  5  4  5  2  4  5  4  4  5  5  4  5  4  4  4  2  4  4  2  4  2  5
## [139]  5 NA  4  2  2  2  2  4  5  5  4 NA  4  2  2  4  2  5  2  4  3  4  5
## [162]  5  5  4  3 NA  5 NA  4  2  4  4  5  2  3  5  5  4  5  4  4 NA  2  1
## [185]  4 NA NA  5  2  5  4  2  5 NA  4  5  4  2  5  4  2  4  1  1  5  4  2
## [208]  4  5  4  4  2  4 NA  4 NA NA NA  2  5  2  4 NA  5 NA  5  5  4  4  4
## [231] NA  4  5  4  5  5  4  5 NA  2  5  4  5  4  5  5  2  4  5  4  4  5  4
## [254]  4 NA  4  5  5  5  5  1  5  4  4  2  5  3  2 NA NA  4  4  5  2  3  2
## [277]  4  2  5  4  4 NA  5 NA  4  1  4  4  5  2 NA  5  5  2  4  4  1 NA  5
## [300]  5 NA  5 NA  2  4  2  5  4  2  5  2  2 NA  2  4  5  2  2  4  5  5  5
## [323]  5  5  5  4 NA  4  5  3  4  5  4  4  4  2  5  4  5  5  5  4 NA  4  5
## [346]  4  4  4 NA  2  4  2  2 NA  2  4  2  5  3  4  5  5  5 NA  5  3  5  4
## [369]  4  4  4  4  4  4  2  4  4 NA  1 NA  2  2  4 NA NA  5  2 NA  5  5 NA
## [392]  2  5  2  4  4  2  4  2  5  5  2  5  3  5  1  4  2  5 NA  2  5  4  5
## [415]  4  4  1  5  5  4  4  4  5  4  4  4 NA  1  4  4  5  2  2  4  4  4  3
## [438]  5  4  4  4  2  4  2  4 NA  2  4  5  5  2  4  4  2  5  3 NA  4  2  5
## [461]  2  1  5  5  2  4 NA NA  4  4  4  4  2  2  4  4  4  5  4  5  3  5  4
## [484]  5  5  4 NA NA NA  4  4  1  4  2  4  4  5  4 NA  1  4 NA  1 NA  3  4
## [507]  5 NA  5  5  2  4  4  4  5  4  4  5  5  5  2  5  5  4  5  4  1  2  5
## [530]  2  4  5  5  4  5 NA  3 NA  2 NA  4  5  5  4  4  1  5  5  4  2  4  4
## [553]  5 NA  1  4  2  4  4  2  4  4  5  4  2  4  5  5  5  5  2  4  5  1 NA
## [576]  5 NA  4  4 NA  4  4  4  5  5  4  5  4  5  5 NA  5  5  2  4 NA  5  5
## [599]  4  4

Let’s create a frequency table.

table(dat$mindset2)

## 
##   1   2   3   4   5 
##  20  95  20 211 181

table(dat$mindset2,useNA="ifany")

## 
##    1    2    3    4    5 <NA> 
##   20   95   20  211  181   73

Renaming and/or recoding variables

is.factor(dat$mindset1)

## [1] TRUE

dat$ms1<-as.character(dat$mindset1)
dat$ms2<-as.character(dat$mindset1)

is.factor(dat$ms1)

## [1] FALSE

table(dat$ms1)

## 
##            Agree         Disagree          Neutral    Stongly Agree 
##              228              104               23              191 
## Stongly Disagree 
##               25

dat$ms1[which(dat$ms1=="Stongly Disagree")]<-1
dat$ms1[which(dat$ms1=="Disagree")]<-2
dat$ms1[which(dat$ms1=="Neutral")]<-3
dat$ms1[which(dat$ms1=="Agree")]<-4
dat$ms1[which(dat$ms1=="Stongly Agree")]<-5


dat$ms2<-with(dat,ifelse(is.na(ms2),NA,
                         ifelse(ms2=="Stongly Disagree",1,
                                ifelse(ms2=="Disagree",2,
                                       ifelse(ms2=="Neutral",3,
                                              ifelse(ms2=="Agree",4,
                                                     ifelse(ms2=="Stongly Agree",5,ms2)))))))


dat$ms3<-factor(dat$ms2,levels=c(1:5),labels=likertScale,ordered = TRUE)

table(dat$mindset1)

## 
##            Agree         Disagree          Neutral    Stongly Agree 
##              228              104               23              191 
## Stongly Disagree 
##               25

table(dat$ms1)

## 
##   1   2   3   4   5 
##  25 104  23 228 191

table(dat$ms2)

## 
##   1   2   3   4   5 
##  25 104  23 228 191

table(dat$ms3)

## 
## Stongly Disagree         Disagree          Neutral            Agree 
##               25              104               23              228 
##    Stongly Agree 
##              191

Reverse coding variables

If it is a numeric variable, that needs to be reverse coded. Otherwise, we can use the same methods in the previous section

range(as.numeric(dat$ms1),na.rm=TRUE)

## [1] 1 5

dat$ms4<- 6-as.numeric(dat$ms1)

table(dat$ms1,useNA = "ifany")

## 
##    1    2    3    4    5 <NA> 
##   25  104   23  228  191   29

table(dat$ms4,useNA = "ifany")

## 
##    1    2    3    4    5 <NA> 
##  191  228   23  104   25   29

Recoding Age Variable

dat$age1<-dat$age
for(which.row in 1:nrow(dat)){
  ageToCheck<-dat$age[which.row]
  if(is.na(ageToCheck)){
          dat$age1[which.row]<-NA
          } else if (ageToCheck>10 & ageToCheck<=20){
            dat$age1[which.row]<-"gr1"
          } else if (ageToCheck>20 & ageToCheck<=30){
           dat$age1[which.row]<-"gr2" 
          } else if (ageToCheck>30 & ageToCheck<=40){
           dat$age1[which.row]<-"gr3" 
          } else {
            dat$age1[which.row]<-"gr4" 
          } 
}
table(dat$age1)

## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142

dat$age2<-dat$age
range(dat$age)

## [1] 11 50

dat$age2[dat$age>10 & dat$age<=20]<-"gr1"
dat$age2[dat$age>20 & dat$age<=30]<-"gr2"
dat$age2[dat$age>30 & dat$age<=40]<-"gr3"
dat$age2[dat$age>40 & dat$age<=50]<-"gr4"


dat$age3<-dat$age
dat$age3<-with(dat,ifelse(age>10 & age<=20,"gr1",
                    ifelse(age>20 & age<=30,"gr2",
                      ifelse(age>30 & age<=40,"gr3","gr4"))))

table(dat$age1)

## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142

table(dat$age2)

## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142

table(dat$age3)

## 
## gr1 gr2 gr3 gr4 
## 137 184 137 142

Let’s Practice 3

Pick one of the methods above and create three groups for age, name this variable age4
- age 10 to 25 - “group 1”
- age 26 to 39 - “group 2”
- age 40 to 50 - “group 3”
Recode gender to numeric and name this variable gender1
- Male to 1
- Female to 2
- Prefer Not to Answer to 3

Creating Aggregate Variables 1

mutate()

datAgg<-dat %>% 
    group_by(age3) %>%
    mutate(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
    ungroup()
kable(head(datAgg))

mastID	age	gender	mindset1	mindset2	mindset3	mindset4	ms1	ms2	ms3	ms4	age1	age2	age3	mindset3mean	n
1001	27	Male	Stongly Agree	4	4	NA	5	5	Stongly Agree	1	gr2	gr2	gr2	4.097143	184
1002	15	Female	Stongly Agree	4	5	NA	5	5	Stongly Agree	1	gr1	gr1	gr1	3.783582	137
1003	32	Female	Stongly Agree	NA	2	NA	5	5	Stongly Agree	1	gr3	gr3	gr3	4.000000	137
1004	22	Female	NA	5	5	NA	NA	NA	NA	NA	gr2	gr2	gr2	4.097143	184
1005	36	Prefer Not to Answer	Stongly Agree	NA	4	NA	5	5	Stongly Agree	1	gr3	gr3	gr3	4.000000	137
1006	15	Male	Stongly Agree	5	5	NA	5	5	Stongly Agree	1	gr1	gr1	gr1	3.783582	137

Creating Aggregate Variables 2

summarize()

datAgg<-dat %>% 
    group_by(age3) %>%
    summarize(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
    ungroup()
kable(head(datAgg))

age3	mindset3mean	n
gr1	3.783582	137
gr2	4.097143	184
gr3	4.000000	137
gr4	3.970588	142

Practice 4

Group by age3 and gender
- add the new variables to the datAgg
- calculate the mean (function is “mean()”) for mindset3 for each group
- calculate the standard deviation (function is “sd()”) for mindset3 for each group
Group by gender and age
- Create datAgg with only calculated variables
- calculate the mean (function is “mean()”) for mindset3 for each group
- calculate the standard deviation (function is “sd()”) for mindset3 for each group

The Mind and Identity in Context Lab R Workshop

March 5, 2018