To begin, let’s take another look at our simple data frame example from
kids <- c("Jack","Jill")
ages <- c(12,10)
d <- data.frame(kids,ages,stringsAsFactors=FALSE)
d # matrix-like viewpoint## [1] "Jack" "Jill"
## [1] "Jack" "Jill"
## [1] "Jack" "Jill"
## 'data.frame': 2 obs. of 2 variables:
## $ kids: chr "Jack" "Jill"
## $ ages: num 12 10
| Midterm | Finals | Quiz |
|---|---|---|
| 2.0 | 3.3 | 4.0 |
| 3.3 | 2.0 | 3.7 |
| 4.0 | 4.3 | 4.0 |
| 2.3 | 0.0 | 3.3 |
| 2.3 | 1.0 | 3.3 |
| 3.3 | 3.7 | 4.0 |
Exam.1<-c(2.0,3.3,4.0,2.3,2.3,3.3)
Exam.2<-c(3.3,2.0,4.3,0.0,1.0,3.7)
Quiz<-c(4.0,3.7,4.0,3.3,3.3,4.0)
examsquiz<-data.frame(Exam.1, Exam.2, Quiz)
head(examsquiz)As mentioned, a data frame can be viewed in row-and-column terms. In particular, we can extract subdata frames by rows or columns. Here’s an example:
## [1] 2.0 4.3 0.0 1.0
## [1] "numeric"
## [1] "data.frame"
Note that in that second call, since examsquiz[2:5,2] is a vector, R created a vector instead of another data frame. By specifying drop=FALSE, as described for the matrix case. We can also do filtering. Here’s how to extract the subframe of all students whose first exam score was at least 3.8:
Suppose the second exam score for the first student had been missing. Then we would have typed the following into that line when we were preparing the data file:
## [1] 2 NA 4
In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R’s mean() function would skip that first student in finding the mean. Otherwise, R would just report NA for the mean.
Here’s a little example:
## [1] NA
## [1] 3
## [1] TRUE FALSE TRUE FALSE
Cases 2 and 4 were incomplete; hence the FALSE values in the output of complete.cases(d4). We then use that output to select the intact rows.
## [1] "data.frame"
## [1] 4.0 3.7 4.3 3.3 3.3 4.0
In the relational database world, one of the most important operations is that of a join, in which two tables can be combined according to the values of a common variable. In R, two data frames can be similarly combined using the merge() function. The simplest form is as follows:
Keep in mind that data frames are special cases of lists, with the list components consisting of the data frame’s columns. Thus, if you call lapply() on a data frame with a specified function f(), then f() will be called on each of the frame’s columns, with the return values placed in a list.
For instance, with our previous example, we can use lapply as follows:
## $kids
## [1] "Jack" "Jill"
##
## $ages
## [1] 10 12
So, dl is a list consisting of two vectors, the sorted versions of kids and ages. Note that dl is just a list, not a data frame. We could coerce it to a data frame, like this:
abaloneDataURL<- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
aba <- read.csv(abaloneDataURL,header=FALSE,as.is=T)
names(aba)[1:9] <- c("Gender","Length","Diameter","Height","WholeWt","ShuckedWt","ViscWt","ShellWt","Rings")
abaabamf <- aba[aba$Gender != "I",] # exclude infants from the analysis
abamf$Sex_num <- ifelse(abamf$Gender=="M", 1, 0)
lftn <- function(clmn) {
glm(abamf$Sex_num ~ clmn, family=binomial)$coef}
loall <- sapply(abamf[,-1],lftn)## Length Diameter Height WholeWt ShuckedWt ViscWt
## (Intercept) 1.275832 1.289130 1.027872 0.4300827 0.2855054 0.4829153
## clmn -1.962613 -2.533227 -5.643495 -0.2688070 -0.2941351 -1.4647507
## ShellWt Rings Sex_num
## (Intercept) 0.5103942 0.64823569 -26.56607
## clmn -1.2135496 -0.04509376 53.13213
## [1] "matrix" "array"