R Dataframes part 1 Dataframes
Creating Data Frames
The first two arguments in the call to data.frame() are clear: We wish to produce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment.
#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
kids <- c("Jack","Jill")
ages <- c(12,10)
d <- data.frame(kids,ages,stringsAsFactors=FALSE)
d # matrix-like viewpoint
Since d is a list, we can access it as such via component index values or component names.
d[[1]]
## [1] "Jack" "Jill"
d$kids
## [1] "Jack" "Jill"
d[,1]
## [1] "Jack" "Jill"
This matrix-like quality is also seen when we take d apart using str().
str(d)
## 'data.frame': 2 obs. of 2 variables:
## $ kids: chr "Jack" "Jill"
## $ ages: num 12 10
R tells us here that d consists of two observations—our two rows—that store data on two variables—our two columns.
Extended Example: Regression Analysis of Exam Grades Continued
getwd()
## [1] "/cloud/project"
examsquiz <- read.csv("ExamsQuiz.csv",sep=",",header=TRUE)
head(examsquiz)
Other Matrix-Like Operations We extracted subdata frames by rows or columns.
examsquiz[2:5,]
examsquiz[2:5,2]
## [1] 3.2 2.0 4.0 2.0
class(examsquiz[2:5,2])
## [1] "numeric"
examsquiz[2:5,2,drop=FALSE]
class(examsquiz[2:5,2,drop=FALSE])
## [1] "data.frame"
We also did filtering. We extracted the subframe of all students whose first exam score was a 3.8
examsquiz[examsquiz$Exam1 >= 3.8,]
More on Treatment of NA Values
#2.0 NA 4.0
In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values.
x <- c(2,NA,4)
mean(x)
## [1] NA
mean(x,na.rm=TRUE)
## [1] 3
The column names are taken in the context of the given data frame.
examsquiz[examsquiz$Exam1 >= 3.8,]
subset(examsquiz,Exam1 >= 3.8)
NA
## [1] NA
NA
## [1] NA