R Dataframes part 1 Dataframes

Creating Data Frames

The first two arguments in the call to data.frame() are clear: We wish to produce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment.

#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
kids <- c("Jack","Jill")
ages <- c(12,10)
d <- data.frame(kids,ages,stringsAsFactors=FALSE)
d # matrix-like viewpoint

Since d is a list, we can access it as such via component index values or component names.

d[[1]]
## [1] "Jack" "Jill"
d$kids
## [1] "Jack" "Jill"
d[,1]
## [1] "Jack" "Jill"

This matrix-like quality is also seen when we take d apart using str().

str(d)
## 'data.frame':    2 obs. of  2 variables:
##  $ kids: chr  "Jack" "Jill"
##  $ ages: num  12 10

R tells us here that d consists of two observations—our two rows—that store data on two variables—our two columns.

Extended Example: Regression Analysis of Exam Grades Continued

getwd()
## [1] "/cloud/project"
examsquiz <- read.csv("ExamsQuiz.csv",sep=",",header=TRUE)
head(examsquiz)

Other Matrix-Like Operations We extracted subdata frames by rows or columns.

examsquiz[2:5,]
examsquiz[2:5,2]
## [1] 3.2 2.0 4.0 2.0
class(examsquiz[2:5,2])
## [1] "numeric"
examsquiz[2:5,2,drop=FALSE]
class(examsquiz[2:5,2,drop=FALSE])
## [1] "data.frame"

We also did filtering. We extracted the subframe of all students whose first exam score was a 3.8

examsquiz[examsquiz$Exam1 >= 3.8,]

More on Treatment of NA Values

#2.0 NA 4.0

In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values.

x <- c(2,NA,4)
mean(x)
## [1] NA
mean(x,na.rm=TRUE)
## [1] 3

The column names are taken in the context of the given data frame.

examsquiz[examsquiz$Exam1 >= 3.8,]
subset(examsquiz,Exam1 >= 3.8)
NA
## [1] NA
NA
## [1] NA