R is an object based language. So, you need to create the object first to use it later.
a <- 5
b <- 6
a + b
## [1] 11
The # sign is used for commenting in R. So, R ignores everything after the pound sign.
Libraries are similar to the applications on your smart phone. We will use the following libraries today.
#install.packages("tidyverse")
#install.packages("knitr")
library(tidyverse)
library(knitr)
There are two types of missing data
Let’s create a vector called a and assign some values to it.
a<-c(0, NA, 2, 3, 4,NA,7)
a
## [1] 0 NA 2 3 4 NA 7
Using paranthesis () around the objects
(a<-c(0, NA, 2, 3, 4,NA,7))
## [1] 0 NA 2 3 4 NA 7
is.na is asking whether the object inside the parantheses includes missing cases. It checks all values in that object and returns TRUE (Yes) or FALSE (No).
is.na(a)
## [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE
What if we want to check non-missing values?
!is.na(a)
## [1] TRUE FALSE TRUE TRUE TRUE FALSE TRUE
How about where the missing cases are located?
which(is.na(a))
## [1] 2 6
How about where the non-missing cases are located?
which(!is.na(a))
## [1] 1 3 4 5 7
How many missing cases?
sum(is.na(a))
## [1] 2
R treats TRUE as 1, and FALSE as 0. This is very useful when determining how many cases each row or column has missing values in your data.
TRUE+TRUE+FALSE
## [1] 2
length is another useful gunctions that tells us the number of values in a vector.
length(a)
## [1] 7
Let’s calculate the percent missing.
(sum(is.na(a))/length(a))*100
## [1] 28.57143
Here is some data with missing values
set.seed(123)
b<-sample(x=c(1:6),size=500,replace=TRUE)
b[sample(x=c(1:500),size=12)]<-NA
We will use two new functions; rowSums and colSums (R is case sensitive) in the next section.
set.seed(12)
a<-as.data.frame(matrix(sample(1:5,20,replace=TRUE),nrow=5,ncol=4))
kable(a)
| V1 | V2 | V3 | V4 |
|---|---|---|---|
| 1 | 1 | 2 | 3 |
| 5 | 1 | 5 | 3 |
| 5 | 4 | 2 | 3 |
| 2 | 1 | 2 | 4 |
| 1 | 1 | 2 | 1 |
rowSums(a)
## [1] 7 14 14 9 5
colSums(a)
## V1 V2 V3 V4
## 14 8 13 14
is.na(a)
## V1 V2 V3 V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
All the examples so far was with a vector, now we can work with a simulated data frame or just data. You can think of data frame as collection of vectors of same length.
Let’s create a dataset first.
N<-600
ageRange<- c(11:50)
likertScale<-c("Stongly Disagree","Disagree","Neutral","Agree","Stongly Agree")
set.seed(230)
mastID<-c(1001:(1000+N))
age<-sample(x=ageRange,size=N,replace=TRUE)
gender<-sample(x=c("Female","Male","Prefer Not to Answer",NA),
size=N,replace=TRUE,prob=c(.45,.40,.1,.05))
mindset1<-sample(x=c(likertScale,NA),size=N,replace=TRUE,prob=c(.05,.15,.04,.40,.30,.06))
mindset2<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.14,.05,.35,.30,.12))
mindset3<-sample(x=c(1:5,NA),size=N,replace=TRUE,prob=c(.04,.11,.08,.35,.39,.03))
datRaw<-data.frame(mastID,age,gender,mindset1,mindset2,mindset3)
datRaw$mindset4<-NA
dat<-datRaw
kable(head(dat))
| mastID | age | gender | mindset1 | mindset2 | mindset3 | mindset4 |
|---|---|---|---|---|---|---|
| 1001 | 27 | Male | Stongly Agree | 4 | 4 | NA |
| 1002 | 15 | Female | Stongly Agree | 4 | 5 | NA |
| 1003 | 32 | Female | Stongly Agree | NA | 2 | NA |
| 1004 | 22 | Female | NA | 5 | 5 | NA |
| 1005 | 36 | Prefer Not to Answer | Stongly Agree | NA | 4 | NA |
| 1006 | 15 | Male | Stongly Agree | 5 | 5 | NA |
Data frames have two dimensions; rows and columns. To check the dimensions of a data frame we use the dim() function.
dim(dat)
## [1] 600 7
The first value ALWAYS represents the rows and the second value always represents the columns. So, in the data file we just created, we have 600 rows and 7 columns.
nrow(dat)
## [1] 600
ncol(dat)
## [1] 7
We use brackes [,] to call a case/cell/value from a data frame. For example, dat[1,] returns the first row of the dat and dat[,1] returns the first column of the data.
dat[1,]
## mastID age gender mindset1 mindset2 mindset3 mindset4
## 1 1001 27 Male Stongly Agree 4 4 NA
head(dat[,1])
## [1] 1001 1002 1003 1004 1005 1006
To call a column from a data set, we can use the $ sign. For example, to call the mindset2 variable, we can write dat$mindset2.
dat$mindset2
## [1] 4 4 NA 5 NA 5 5 2 4 4 NA 5 4 NA 4 5 5 5 2 5 4 4 4
## [24] 5 4 3 5 4 4 NA 5 2 5 4 1 2 5 3 3 4 5 4 5 4 3 4
## [47] 4 4 NA 4 2 5 4 NA 2 5 2 5 4 NA 2 2 5 NA 5 4 4 5 4
## [70] 5 1 5 2 NA 5 5 5 4 2 4 4 NA 5 5 4 3 4 4 4 NA 5 5
## [93] 5 5 4 NA 4 5 5 5 5 5 4 3 2 4 4 5 5 2 5 4 5 NA 2
## [116] 4 5 4 5 2 4 5 4 4 5 5 4 5 4 4 4 2 4 4 2 4 2 5
## [139] 5 NA 4 2 2 2 2 4 5 5 4 NA 4 2 2 4 2 5 2 4 3 4 5
## [162] 5 5 4 3 NA 5 NA 4 2 4 4 5 2 3 5 5 4 5 4 4 NA 2 1
## [185] 4 NA NA 5 2 5 4 2 5 NA 4 5 4 2 5 4 2 4 1 1 5 4 2
## [208] 4 5 4 4 2 4 NA 4 NA NA NA 2 5 2 4 NA 5 NA 5 5 4 4 4
## [231] NA 4 5 4 5 5 4 5 NA 2 5 4 5 4 5 5 2 4 5 4 4 5 4
## [254] 4 NA 4 5 5 5 5 1 5 4 4 2 5 3 2 NA NA 4 4 5 2 3 2
## [277] 4 2 5 4 4 NA 5 NA 4 1 4 4 5 2 NA 5 5 2 4 4 1 NA 5
## [300] 5 NA 5 NA 2 4 2 5 4 2 5 2 2 NA 2 4 5 2 2 4 5 5 5
## [323] 5 5 5 4 NA 4 5 3 4 5 4 4 4 2 5 4 5 5 5 4 NA 4 5
## [346] 4 4 4 NA 2 4 2 2 NA 2 4 2 5 3 4 5 5 5 NA 5 3 5 4
## [369] 4 4 4 4 4 4 2 4 4 NA 1 NA 2 2 4 NA NA 5 2 NA 5 5 NA
## [392] 2 5 2 4 4 2 4 2 5 5 2 5 3 5 1 4 2 5 NA 2 5 4 5
## [415] 4 4 1 5 5 4 4 4 5 4 4 4 NA 1 4 4 5 2 2 4 4 4 3
## [438] 5 4 4 4 2 4 2 4 NA 2 4 5 5 2 4 4 2 5 3 NA 4 2 5
## [461] 2 1 5 5 2 4 NA NA 4 4 4 4 2 2 4 4 4 5 4 5 3 5 4
## [484] 5 5 4 NA NA NA 4 4 1 4 2 4 4 5 4 NA 1 4 NA 1 NA 3 4
## [507] 5 NA 5 5 2 4 4 4 5 4 4 5 5 5 2 5 5 4 5 4 1 2 5
## [530] 2 4 5 5 4 5 NA 3 NA 2 NA 4 5 5 4 4 1 5 5 4 2 4 4
## [553] 5 NA 1 4 2 4 4 2 4 4 5 4 2 4 5 5 5 5 2 4 5 1 NA
## [576] 5 NA 4 4 NA 4 4 4 5 5 4 5 4 5 5 NA 5 5 2 4 NA 5 5
## [599] 4 4
Let’s create a frequency table.
table(dat$mindset2)
##
## 1 2 3 4 5
## 20 95 20 211 181
table(dat$mindset2,useNA="ifany")
##
## 1 2 3 4 5 <NA>
## 20 95 20 211 181 73
is.factor(dat$mindset1)
## [1] TRUE
dat$ms1<-as.character(dat$mindset1)
dat$ms2<-as.character(dat$mindset1)
is.factor(dat$ms1)
## [1] FALSE
table(dat$ms1)
##
## Agree Disagree Neutral Stongly Agree
## 228 104 23 191
## Stongly Disagree
## 25
dat$ms1[which(dat$ms1=="Stongly Disagree")]<-1
dat$ms1[which(dat$ms1=="Disagree")]<-2
dat$ms1[which(dat$ms1=="Neutral")]<-3
dat$ms1[which(dat$ms1=="Agree")]<-4
dat$ms1[which(dat$ms1=="Stongly Agree")]<-5
dat$ms2<-with(dat,ifelse(is.na(ms2),NA,
ifelse(ms2=="Stongly Disagree",1,
ifelse(ms2=="Disagree",2,
ifelse(ms2=="Neutral",3,
ifelse(ms2=="Agree",4,
ifelse(ms2=="Stongly Agree",5,ms2)))))))
dat$ms3<-factor(dat$ms2,levels=c(1:5),labels=likertScale,ordered = TRUE)
table(dat$mindset1)
##
## Agree Disagree Neutral Stongly Agree
## 228 104 23 191
## Stongly Disagree
## 25
table(dat$ms1)
##
## 1 2 3 4 5
## 25 104 23 228 191
table(dat$ms2)
##
## 1 2 3 4 5
## 25 104 23 228 191
table(dat$ms3)
##
## Stongly Disagree Disagree Neutral Agree
## 25 104 23 228
## Stongly Agree
## 191
If it is a numeric variable, that needs to be reverse coded. Otherwise, we can use the same methods in the previous section
range(as.numeric(dat$ms1),na.rm=TRUE)
## [1] 1 5
dat$ms4<- 6-as.numeric(dat$ms1)
table(dat$ms1,useNA = "ifany")
##
## 1 2 3 4 5 <NA>
## 25 104 23 228 191 29
table(dat$ms4,useNA = "ifany")
##
## 1 2 3 4 5 <NA>
## 191 228 23 104 25 29
dat$age1<-dat$age
for(which.row in 1:nrow(dat)){
ageToCheck<-dat$age[which.row]
if(is.na(ageToCheck)){
dat$age1[which.row]<-NA
} else if (ageToCheck>10 & ageToCheck<=20){
dat$age1[which.row]<-"gr1"
} else if (ageToCheck>20 & ageToCheck<=30){
dat$age1[which.row]<-"gr2"
} else if (ageToCheck>30 & ageToCheck<=40){
dat$age1[which.row]<-"gr3"
} else {
dat$age1[which.row]<-"gr4"
}
}
table(dat$age1)
##
## gr1 gr2 gr3 gr4
## 137 184 137 142
dat$age2<-dat$age
range(dat$age)
## [1] 11 50
dat$age2[dat$age>10 & dat$age<=20]<-"gr1"
dat$age2[dat$age>20 & dat$age<=30]<-"gr2"
dat$age2[dat$age>30 & dat$age<=40]<-"gr3"
dat$age2[dat$age>40 & dat$age<=50]<-"gr4"
dat$age3<-dat$age
dat$age3<-with(dat,ifelse(age>10 & age<=20,"gr1",
ifelse(age>20 & age<=30,"gr2",
ifelse(age>30 & age<=40,"gr3","gr4"))))
table(dat$age1)
##
## gr1 gr2 gr3 gr4
## 137 184 137 142
table(dat$age2)
##
## gr1 gr2 gr3 gr4
## 137 184 137 142
table(dat$age3)
##
## gr1 gr2 gr3 gr4
## 137 184 137 142
datAgg<-dat %>%
group_by(age3) %>%
mutate(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
ungroup()
kable(head(datAgg))
| mastID | age | gender | mindset1 | mindset2 | mindset3 | mindset4 | ms1 | ms2 | ms3 | ms4 | age1 | age2 | age3 | mindset3mean | n |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | 27 | Male | Stongly Agree | 4 | 4 | NA | 5 | 5 | Stongly Agree | 1 | gr2 | gr2 | gr2 | 4.097143 | 184 |
| 1002 | 15 | Female | Stongly Agree | 4 | 5 | NA | 5 | 5 | Stongly Agree | 1 | gr1 | gr1 | gr1 | 3.783582 | 137 |
| 1003 | 32 | Female | Stongly Agree | NA | 2 | NA | 5 | 5 | Stongly Agree | 1 | gr3 | gr3 | gr3 | 4.000000 | 137 |
| 1004 | 22 | Female | NA | 5 | 5 | NA | NA | NA | NA | NA | gr2 | gr2 | gr2 | 4.097143 | 184 |
| 1005 | 36 | Prefer Not to Answer | Stongly Agree | NA | 4 | NA | 5 | 5 | Stongly Agree | 1 | gr3 | gr3 | gr3 | 4.000000 | 137 |
| 1006 | 15 | Male | Stongly Agree | 5 | 5 | NA | 5 | 5 | Stongly Agree | 1 | gr1 | gr1 | gr1 | 3.783582 | 137 |
datAgg<-dat %>%
group_by(age3) %>%
summarize(mindset3mean=mean(mindset3,na.rm=TRUE),n=n()) %>%
ungroup()
kable(head(datAgg))
| age3 | mindset3mean | n |
|---|---|---|
| gr1 | 3.783582 | 137 |
| gr2 | 4.097143 | 184 |
| gr3 | 4.000000 | 137 |
| gr4 | 3.970588 | 142 |