Introduction

  • On an intuitive level, a data frame is like a matrix, with a two-dimensional rows-andcolumns structure. However, it differs from a matrix in that each column may have a different mode.
  • For instance, one column may consist of numbers, and another column might have character strings. In this sense, just as lists are the heterogeneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data.

Creating Data Frames

To begin, let’s take another look at our simple data frame example from

kids <- c("Jack","Jill")
ages <- c(12,10)
d <- data.frame(kids,ages,stringsAsFactors=FALSE)
d # matrix-like viewpoint
  • The first two arguments in the call to data.frame() are clear: We wish to produce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment.
  • If the named argument stringsAsFactors is not specified, then by default, stringsAsFactors will be TRUE. (You can also use options() to arrange the opposite default.) This means that if we create a data frame from a character vector—in this case, kids—R will convert that vector to a factor. - Because our work with character data will typically be with vectors rather than factors, we’ll set stringsAsFactors to FALSE.

Accessing Data Frames

  • Now that we have a data frame, let’s explore a bit. Since d is a list, we can access it as such via component index values or component names:
d[[1]]
## [1] "Jack" "Jill"
d$kids
## [1] "Jack" "Jill"
  • But we can treat it in a matrix-like fashion as well. For example, we can view column 1:
d[,1]
## [1] "Jack" "Jill"
  • This matrix-like quality is also seen when we take d apart using str():
str(d)
## 'data.frame':    2 obs. of  2 variables:
##  $ kids: chr  "Jack" "Jill"
##  $ ages: num  12 10
  • R tells us here that d consists of two observations—our two rows—that store data on two variables—our two columns.
  • Consider three ways to access the first column of our data frame above: d[[1]], d[,1], and d$kids. Of these, the third would generally considered to be clearer and, more importantly, safer than the first two. This better identifies the column and makes it less likely that you will reference the wrong column. But in writing general code—say writing R packages—matrix-like notation d[,1] is needed, and it is especially handy if you are extracting subdata frames (as you’ll see when we talk about extracting subdata frames.

Example: Regression Analysis of Exam Grades Continued

  • Recall our course examination data set in previous lesson.
Midterm Finals Quiz
2.0 3.3 4.0
3.3 2.0 3.7
4.0 4.3 4.0
2.3 0.0 3.3
2.3 1.0 3.3
3.3 3.7 4.0
Exam.1<-c(2.0,3.3,4.0,2.3,2.3,3.3)
Exam.2<-c(3.3,2.0,4.3,0.0,1.0,3.7)
Quiz<-c(4.0,3.7,4.0,3.3,3.3,4.0)
examsquiz<-data.frame(Exam.1, Exam.2, Quiz)
head(examsquiz)

Other Matrix-Like Operations

  • Various matrix operations also apply to data frames. Most notably and usefully, we can do filtering to extract various subdata frames of interest

Extracting Subdata Frames

As mentioned, a data frame can be viewed in row-and-column terms. In particular, we can extract subdata frames by rows or columns. Here’s an example:

examsquiz[2:5,]
examsquiz[2:5,2]
## [1] 2.0 4.3 0.0 1.0
class(examsquiz[2:5,2])
## [1] "numeric"
examsquiz[2:5,2,drop=FALSE]
class(examsquiz[2:5,2,drop=FALSE])
## [1] "data.frame"

Note that in that second call, since examsquiz[2:5,2] is a vector, R created a vector instead of another data frame. By specifying drop=FALSE, as described for the matrix case. We can also do filtering. Here’s how to extract the subframe of all students whose first exam score was at least 3.8:

examsquiz[examsquiz$Exam.1 >= 3.8,]

More on Treatment of NA Values

Suppose the second exam score for the first student had been missing. Then we would have typed the following into that line when we were preparing the data file:

c(2.0,NA,4.0)
## [1]  2 NA  4

In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R’s mean() function would skip that first student in finding the mean. Otherwise, R would just report NA for the mean.

Here’s a little example:

x <- c(2,NA,4)
mean(x)
## [1] NA
mean(x,na.rm=TRUE)
## [1] 3
  • You were introduced to the subset() function, which saves you the trouble of specifying na.rm=TRUE. You can apply it in data frames for row selection. The column names are taken in the context of the given data frame. In our example, instead of typing this:
examsquiz[examsquiz$Exam.1 >= 3.8,]
  • we could run this:
subset(examsquiz,Exam.1 >= 3.8)
  • Note that we do not need to write this:
subset(examsquiz,examsquiz$Exam.1 >= 3.8)
  • In some cases, we may wish to rid our data frame of any observation that has at least one NA value. A handy function for this purpose is complete.cases().
d4<- data.frame( 
  kids= c("Jack", NA, "Jillian", "John"),
  states= c("CA", "MA", "MA", NA))
d4
complete.cases(d4)
## [1]  TRUE FALSE  TRUE FALSE
d5 <- d4[complete.cases(d4),]
d5

Cases 2 and 4 were incomplete; hence the FALSE values in the output of complete.cases(d4). We then use that output to select the intact rows.

Using the rbind() and cbind() Functions and Alternatives

  • The rbind() and cbind() matrix functions introduced in Section 3.4 work with data frames, too, providing that you have compatible sizes, of course. For instance, you can use cbind() to add a new column that has the same length as the existing columns. In using rbind() to add a row, the added row is typically in the form of another data frame or list.
d
rbind(d,list("Laura",19))
  • You can also create new columns from old ones. For instance, we can add a variable that is the difference between exams 1 and 2:
eq <- cbind(examsquiz,examsquiz$Exam.2-examsquiz$Exam.1)
class(eq)
## [1] "data.frame"
head(eq)
  • The new name is rather unwieldy: It’s long, and it has embedded blanks. We could change it, using the names() function, but it would be better to exploit the list basis of data frames and add a column (of the same length) to the data frame for this result:
examsquiz$ExamDiff <- examsquiz$Exam.2 - examsquiz$Exam.1
head(examsquiz)
  • What happened here? Since one can add a new component to an already existing list at any time, we did so: We added a component ExamDiff to the list/data frame examsquiz.
  • We can even exploit recycling to add a column that is of a different length than those in the data frame:
d
d$one <- 1
d

Applying apply()

  • You can use apply() on data frames, if the columns are all of the same type. For instance, we can find the maximum grade for each student, as follows:
apply(examsquiz,1,max)
## [1] 4.0 3.7 4.3 3.3 3.3 4.0

Example: A Salary Study

Merging Data Frames

In the relational database world, one of the most important operations is that of a join, in which two tables can be combined according to the values of a common variable. In R, two data frames can be similarly combined using the merge() function. The simplest form is as follows:

merge(x,y)
  • This merges data frames x and y. It assumes that the two data frames have one or more columns with names in common. Here’s an example:
d1 <- data.frame(
  kids=c("Jack", "Jill", "Jillian", "John"),
  states=c("CA", "MA","MA", "HI")
)
d1
d2 <- data.frame(
  ages=c(10,7,12),
  kids=c("Jill", "Lillian", "Jack")
)
d2
d <- merge(d1,d2)
d
  • Even though our variable was called kids in one data frame and pals in the other, it was meant to store the same information, and thus the merge made sense.
  • Duplicate matches will appear in full in the result, possibly in undesirable ways.
d1
d2a <- rbind(d2,list(15,"Jill"))
d2a
merge(d1,d2a)
  • There are two Jills in d2a. There is a Jill in d1 who lives in Massachusetts and another Jill with unknown residence. In our previous example, merge(d1,d2), there was only one Jill, who was presumed to be the same person in both data frames. But here, in the call merge(d1,d2a), it may have been the case that only one of the Jills was a Massachusetts resident. It is clear from this little example that you must choose matching variables with great care.

Applying Functions to Data Frames

  • As with lists, you can use the lapply and sapply functions with data frames.

Using lapply() and sapply() on Data Frames

  • Keep in mind that data frames are special cases of lists, with the list components consisting of the data frame’s columns. Thus, if you call lapply() on a data frame with a specified function f(), then f() will be called on each of the frame’s columns, with the return values placed in a list.

  • For instance, with our previous example, we can use lapply as follows:

d<- data.frame(
  kids=c("Jack", "Jill"),
  ages=c(12,10)
)
d
dl <- lapply(d,sort)
dl
## $kids
## [1] "Jack" "Jill"
## 
## $ages
## [1] 10 12

So, dl is a list consisting of two vectors, the sorted versions of kids and ages. Note that dl is just a list, not a data frame. We could coerce it to a data frame, like this:

as.data.frame(dl)

Example: Applying Logistic Regression Models

  • Let’s run a logistic regression model on the abalone data, predicting gender from the other eight variables: height, weight, rings, and so on, one at a time.
  • The logistic model is used to predict a 0- or 1-valued random variable Y from one or more explanatory variables. The function value is the probability that Y = 1, given the explanatory variables. Let’s say we have just one of the latter, X. Then the model is as follows: \[ Pr(Y=1/X=t)= \frac{1}{1+exp[-(\beta_0+\beta_1)]}\]
  • As with linear regression models, the βi values are estimated from the data, using the function glm() with the argument family=binomial.
  • We can use sapply() to fit eight single-predictor models—one for each of the eight variables other than gender in this data set—all in just one line of code.
abaloneDataURL<- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
aba <- read.csv(abaloneDataURL,header=FALSE,as.is=T)
names(aba)[1:9] <- c("Gender","Length","Diameter","Height","WholeWt","ShuckedWt","ViscWt","ShellWt","Rings")
aba
abamf <- aba[aba$Gender != "I",] # exclude infants from the analysis
abamf$Sex_num <- ifelse(abamf$Gender=="M", 1, 0)
lftn <- function(clmn) {
  glm(abamf$Sex_num ~ clmn, family=binomial)$coef}
loall <- sapply(abamf[,-1],lftn)
  • In lines 1 and 2, we read in the data frame and then exclude the observations for infants. In line 6, we call sapply() on the subdata frame in which column 1, named Gender, has been excluded. In other words, this is an eight-column subframe consisting of our eight explanatory variables. Thus, lftn() is called on each column of that subframe.
  • Taking as input a column from the subframe, accessed via the formal argument clmn, line 4 fits a logistic model that predicts gender from that column and hence from that explanatory variable. Recall from Section 1.5 that the ordinary regression function lm() returns a class “lm” object containing many components, one of which is coefficients, the vector of estimated \(\beta_i\). This component is also in the return value of glm(). Also recall that list component names can be abbreviated if there is no ambiguity. Here, we’ve shortened coefficients to coef.
loall
##                Length  Diameter    Height    WholeWt  ShuckedWt     ViscWt
## (Intercept)  1.275832  1.289130  1.027872  0.4300827  0.2855054  0.4829153
## clmn        -1.962613 -2.533227 -5.643495 -0.2688070 -0.2941351 -1.4647507
##                ShellWt       Rings   Sex_num
## (Intercept)  0.5103942  0.64823569 -26.56607
## clmn        -1.2135496 -0.04509376  53.13213
class(loall)
## [1] "matrix" "array"