Activity 9:
Dataframes
On an intuitive level, a data frame is like a matrix, with a
two-dimensional rows-and columns structure. However, it differs from a
matrix in that each column may have a different mode. For instance, one
column may consist of numbers, and another column might have character
strings. In this sense, just as lists are the heterogeneous analogs of
vectors in one dimension, data frames are the heterogeneous analogs of
matrices for two-dimensional data.
Creating Data Frames
#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
kids <- c("RProgrammer","PythonProgrammer")
ages <- c(85212,6510)
d <- data.frame(kids,ages,stringsAsFactors=FALSE)
d # matrix-like viewpoint
-> assigning two vectors and creating a dataframe
The first two arguments in the call to data.frame() are clear: We
wish to produce a data frame from our two vectors: kids and ages.
However, that third argument, stringsAsFactors=FALSE requires more
comment.
If the named argument stringsAsFactors is not specified, then by
default, stringsAsFactors will be TRUE. (You can also use options() to
arrange the opposite default.) This means that if we create a data frame
from a character vector—in this case, kids—R will convert that vector to
a factor. Because our work with character data will typically be with
vectors rather than factors, we’ll set stringsAsFactors to FALSE. We’ll
cover factors in Chapter 6.
Accessing Data Frames
Now that we have a data frame, let’s explore a bit. Since d is a
list, we can access it as such via component index values or component
names:
d[[1]]
[1] "RProgrammer" "PythonProgrammer"
d$kids
[1] "RProgrammer" "PythonProgrammer"
-> we can access the first column (kids) in two ways: by column
name or by index
But we can treat it in a matrix-like fashion as well. For example, we
can view column 1: view column one by index in different notation:
d[,1]
[1] "RProgrammer" "PythonProgrammer"
This matrix-like quality is also seen when we take d apart using
str():
str(d)
'data.frame': 2 obs. of 2 variables:
$ kids: chr "RProgrammer" "PythonProgrammer"
$ ages: num 85212 6510
-> the str() function gives us the structure: - data.frame as
structure - kids column is character - ages column is numeric
R tells us here that d consists of two observations—our two rows—that
store data on two variables—our two columns.
Consider three ways to access the first column of our data frame
above:d[[1]], d[,1], and d$kids. Of these, the third would generally
considered to be clearer and, more importantly, safer than the first
two. This better identifies the column and makes it less likely that you
will reference the wrong column. But in writing general code—say writing
R packages—matrix-like notation d[,1] is needed, and it is especially
handy if you are extracting subdata frames.
Extended Example: Regression Analysis of Exam Grades Continued
print the working directory
getwd()
[1] "C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti9"
import the csv file and separate them by the comma our dataframe has
headers
examsquiz <- read.csv("C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti9/ExamsQuiz.csv",sep=",",header=TRUE)
print the first 5 rows of the dataframe
head(examsquiz)
Other Matrix-Like Operations
Various matrix operations also apply to data frames. Most notably and
usefully, we can do filtering to extract various subdata frames of
interest.
Extracting Subdata Frames As mentioned, a data frame can be viewed in
row-and-column terms. In particular, we can extract subdata frames by
rows or columns. Here’s an example:
print the second to fifth rows of all column
examsquiz[2:5,]
return second to fifth rows of second column
examsquiz[2:5,2]
[1] 3.2 2.0 4.0 2.0
return the class of the second to fifth rows of second column ->
numeric
class(examsquiz[2:5,2])
[1] "numeric"
return the class of the second to fifth rows of second column with
headers
examsquiz[2:5,2,drop=FALSE]
return the class of the second to fifth rows of second column with
headers -> data.frame
class(examsquiz[2:5,2,drop=FALSE])
[1] "data.frame"
Note that in that second call, since examsquiz[2:5,2] is a vector, R
created a vector instead of another data frame. By specifying
drop=FALSE, as described for the matrix case in Section 3.6, we can keep
it as a (onecolumn) data frame.
We can also do filtering. Here’s how to extract the subframe of all
students whose first exam score was at least 3.8:
return all rows and column where the exam1 score is higher than
3.8:
examsquiz[examsquiz$Exam1 >= 3.8,]
More on Treatment of NA Values
Suppose the second exam score for the first student had been missing.
Then we would have typed the following into that line when we were
preparing the data file:
#2.0 NA 4.0
In any subsequent statistical analyses, R would do its best to cope
with the missing data. However, in some situations, we need to set the
option na.rm=TRUE, explicitly telling R to ignore NA values. For
instance, with the missing exam score, calculating the mean score on
exam 2 by calling R’s mean() function would skip that first student in
finding the mean. Otherwise, R would just report NA for the mean.
Here’s a little example:
create X with three values:
x <- c(2,NA,4)
mean(x)
[1] NA
-> mean is NA since it cannot calculate mean with an NA value
remove na for mean calculation:
mean(x,na.rm=TRUE)
[1] 3
-> na is removed and mean is calculated
In Section 2.8.2, you were introduced to the subset() function, which
saves you the trouble of specifying na.rm=TRUE. You can apply it in data
frames for row selection. The column names are taken in the context of
the given data frame. In our example, instead of typing this:
same as before:
examsquiz[examsquiz$Exam1 >= 3.8,]
same result, different method -> subset the examsquiz
dataframe
subset(examsquiz,Exam1 >= 3.8)
NA
NA
LS0tDQp0aXRsZTogIlIgRGF0YWZyYW1lcyBwYXJ0IDEiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQojIEFjdGl2aXR5IDk6DQojIyBEYXRhZnJhbWVzDQoNCk9uIGFuIGludHVpdGl2ZSBsZXZlbCwgYSBkYXRhIGZyYW1lIGlzIGxpa2UgYSBtYXRyaXgsIHdpdGggYSB0d28tZGltZW5zaW9uYWwgcm93cy1hbmQgY29sdW1ucyBzdHJ1Y3R1cmUuIEhvd2V2ZXIsIGl0IGRpZmZlcnMgZnJvbQ0KYSBtYXRyaXggaW4gdGhhdCBlYWNoIGNvbHVtbiBtYXkgaGF2ZSBhIGRpZmZlcmVudCBtb2RlLiBGb3IgaW5zdGFuY2UsIG9uZSBjb2x1bW4gbWF5IGNvbnNpc3Qgb2YgbnVtYmVycywgYW5kIGFub3RoZXIgY29sdW1uIG1pZ2h0IGhhdmUgY2hhcmFjdGVyIHN0cmluZ3MuIEluIHRoaXMgc2Vuc2UsIGp1c3QgYXMgbGlzdHMgYXJlIHRoZSBoZXRlcm9nZW5lb3VzIGFuYWxvZ3Mgb2YgdmVjdG9ycyBpbiBvbmUgZGltZW5zaW9uLCBkYXRhDQpmcmFtZXMgYXJlIHRoZSBoZXRlcm9nZW5lb3VzIGFuYWxvZ3Mgb2YgbWF0cmljZXMgZm9yIHR3by1kaW1lbnNpb25hbCBkYXRhLg0KDQoNCiMjIENyZWF0aW5nIERhdGEgRnJhbWVzDQoNCg0KYGBge3J9DQojVG8gYmVnaW4sIGxldOKAmXMgdGFrZSBhbm90aGVyIGxvb2sgYXQgb3VyIHNpbXBsZSBkYXRhIGZyYW1lIGV4YW1wbGUgZnJvbSBTZWN0aW9uIDEuNC41Og0Ka2lkcyA8LSBjKCJSUHJvZ3JhbW1lciIsIlB5dGhvblByb2dyYW1tZXIiKQ0KYWdlcyA8LSBjKDg1MjEyLDY1MTApDQpkIDwtIGRhdGEuZnJhbWUoa2lkcyxhZ2VzLHN0cmluZ3NBc0ZhY3RvcnM9RkFMU0UpDQpkICMgbWF0cml4LWxpa2Ugdmlld3BvaW50DQpgYGANCi0+IGFzc2lnbmluZyB0d28gdmVjdG9ycyBhbmQgY3JlYXRpbmcgYSBkYXRhZnJhbWUgDQoNCg0KVGhlIGZpcnN0IHR3byBhcmd1bWVudHMgaW4gdGhlIGNhbGwgdG8gZGF0YS5mcmFtZSgpIGFyZSBjbGVhcjogV2Ugd2lzaCB0byBwcm9kdWNlIGEgZGF0YSBmcmFtZSBmcm9tIG91ciB0d28gdmVjdG9yczoga2lkcyBhbmQgYWdlcy4gSG93ZXZlciwgdGhhdCB0aGlyZCBhcmd1bWVudCwgc3RyaW5nc0FzRmFjdG9ycz1GQUxTRSByZXF1aXJlcyBtb3JlIGNvbW1lbnQuIA0KDQpJZiB0aGUgbmFtZWQgYXJndW1lbnQgc3RyaW5nc0FzRmFjdG9ycyBpcyBub3Qgc3BlY2lmaWVkLCB0aGVuIGJ5IGRlZmF1bHQsIHN0cmluZ3NBc0ZhY3RvcnMgd2lsbCBiZSBUUlVFLiAoWW91IGNhbiBhbHNvIHVzZSBvcHRpb25zKCkgdG8gYXJyYW5nZSB0aGUgb3Bwb3NpdGUgZGVmYXVsdC4pIFRoaXMgbWVhbnMgdGhhdCBpZiB3ZSBjcmVhdGUgYSBkYXRhIGZyYW1lIGZyb20gYSBjaGFyYWN0ZXIgdmVjdG9y4oCUaW4gdGhpcyBjYXNlLCBraWRz4oCUUiB3aWxsIGNvbnZlcnQgdGhhdCB2ZWN0b3IgdG8gYSBmYWN0b3IuIEJlY2F1c2Ugb3VyIHdvcmsgd2l0aCBjaGFyYWN0ZXIgZGF0YSB3aWxsIHR5cGljYWxseSBiZSB3aXRoIHZlY3RvcnMgcmF0aGVyIHRoYW4gZmFjdG9ycywgd2XigJlsbCBzZXQgc3RyaW5nc0FzRmFjdG9ycyB0byBGQUxTRS4gV2XigJlsbCBjb3ZlciBmYWN0b3JzIGluIENoYXB0ZXIgNi4NCg0KDQojIyBBY2Nlc3NpbmcgRGF0YSBGcmFtZXMNCk5vdyB0aGF0IHdlIGhhdmUgYSBkYXRhIGZyYW1lLCBsZXTigJlzIGV4cGxvcmUgYSBiaXQuIFNpbmNlIGQgaXMgYSBsaXN0LCB3ZSBjYW4gYWNjZXNzIGl0IGFzIHN1Y2ggdmlhIGNvbXBvbmVudCBpbmRleCB2YWx1ZXMgb3IgY29tcG9uZW50IG5hbWVzOg0KDQpgYGB7cn0NCmRbWzFdXQ0KZCRraWRzDQpgYGANCi0+IHdlIGNhbiBhY2Nlc3MgdGhlIGZpcnN0IGNvbHVtbiAoa2lkcykgaW4gdHdvIHdheXM6IGJ5IGNvbHVtbiBuYW1lIG9yIGJ5IGluZGV4DQoNCkJ1dCB3ZSBjYW4gdHJlYXQgaXQgaW4gYSBtYXRyaXgtbGlrZSBmYXNoaW9uIGFzIHdlbGwuIEZvciBleGFtcGxlLCB3ZSBjYW4gdmlldyBjb2x1bW4gMToNCnZpZXcgY29sdW1uIG9uZSBieSBpbmRleCBpbiBkaWZmZXJlbnQgbm90YXRpb246DQpgYGB7cn0NCmRbLDFdDQpgYGANClRoaXMgbWF0cml4LWxpa2UgcXVhbGl0eSBpcyBhbHNvIHNlZW4gd2hlbiB3ZSB0YWtlIGQgYXBhcnQgdXNpbmcgc3RyKCk6DQoNCmBgYHtyfQ0Kc3RyKGQpDQpgYGANCi0+IHRoZSBzdHIoKSBmdW5jdGlvbiBnaXZlcyB1cyB0aGUgc3RydWN0dXJlOg0KICAgLSBkYXRhLmZyYW1lIGFzIHN0cnVjdHVyZQ0KICAgLSBraWRzIGNvbHVtbiBpcyBjaGFyYWN0ZXINCiAgIC0gYWdlcyBjb2x1bW4gaXMgbnVtZXJpYw0KDQoNClIgdGVsbHMgdXMgaGVyZSB0aGF0IGQgY29uc2lzdHMgb2YgdHdvIG9ic2VydmF0aW9uc+KAlG91ciB0d28gcm93c+KAlHRoYXQgc3RvcmUgZGF0YSBvbiB0d28gdmFyaWFibGVz4oCUb3VyIHR3byBjb2x1bW5zLg0KDQpDb25zaWRlciB0aHJlZSB3YXlzIHRvIGFjY2VzcyB0aGUgZmlyc3QgY29sdW1uIG9mIG91ciBkYXRhIGZyYW1lIGFib3ZlOmRbWzFdXSwgZFssMV0sIGFuZCBkJGtpZHMuIE9mIHRoZXNlLCB0aGUgdGhpcmQgd291bGQgZ2VuZXJhbGx5IGNvbnNpZGVyZWQgdG8gYmUgY2xlYXJlciBhbmQsIG1vcmUgaW1wb3J0YW50bHksIHNhZmVyIHRoYW4gdGhlIGZpcnN0IHR3by4gVGhpcyBiZXR0ZXIgaWRlbnRpZmllcyB0aGUgY29sdW1uIGFuZCBtYWtlcyBpdCBsZXNzIGxpa2VseSB0aGF0IHlvdSB3aWxsIHJlZmVyZW5jZSB0aGUgd3JvbmcgY29sdW1uLiBCdXQgaW4gd3JpdGluZyBnZW5lcmFsIGNvZGXigJRzYXkgd3JpdGluZyBSIHBhY2thZ2Vz4oCUbWF0cml4LWxpa2UNCm5vdGF0aW9uIGRbLDFdIGlzIG5lZWRlZCwgYW5kIGl0IGlzIGVzcGVjaWFsbHkgaGFuZHkgaWYgeW91IGFyZSBleHRyYWN0aW5nIHN1YmRhdGEgZnJhbWVzLg0KDQojIyBFeHRlbmRlZCBFeGFtcGxlOiBSZWdyZXNzaW9uIEFuYWx5c2lzIG9mIEV4YW0gR3JhZGVzIENvbnRpbnVlZA0KDQpwcmludCB0aGUgd29ya2luZyBkaXJlY3RvcnkNCmBgYHtyfQ0KZ2V0d2QoKQ0KYGBgDQoNCmltcG9ydCB0aGUgY3N2IGZpbGUgYW5kIHNlcGFyYXRlIHRoZW0gYnkgdGhlIGNvbW1hDQpvdXIgZGF0YWZyYW1lIGhhcyBoZWFkZXJzDQpgYGB7cn0NCmV4YW1zcXVpeiA8LSByZWFkLmNzdigiQzovVXNlcnMvbWFyeG0vT25lRHJpdmUvRG9jdW1lbnRzL1Byb2dyYW1taW5nRm9yRGF0YV9SL2FjdGk5L0V4YW1zUXVpei5jc3YiLHNlcD0iLCIsaGVhZGVyPVRSVUUpDQpgYGANCg0KcHJpbnQgdGhlIGZpcnN0IDUgcm93cyBvZiB0aGUgZGF0YWZyYW1lDQpgYGB7cn0NCmhlYWQoZXhhbXNxdWl6KQ0KYGBgDQoNCiMjIE90aGVyIE1hdHJpeC1MaWtlIE9wZXJhdGlvbnMNCg0KVmFyaW91cyBtYXRyaXggb3BlcmF0aW9ucyBhbHNvIGFwcGx5IHRvIGRhdGEgZnJhbWVzLiBNb3N0IG5vdGFibHkgYW5kIHVzZWZ1bGx5LCB3ZSBjYW4gZG8gZmlsdGVyaW5nIHRvIGV4dHJhY3QgdmFyaW91cyBzdWJkYXRhIGZyYW1lcyBvZiBpbnRlcmVzdC4NCg0KDQpFeHRyYWN0aW5nIFN1YmRhdGEgRnJhbWVzDQpBcyBtZW50aW9uZWQsIGEgZGF0YSBmcmFtZSBjYW4gYmUgdmlld2VkIGluIHJvdy1hbmQtY29sdW1uIHRlcm1zLiBJbiBwYXJ0aWN1bGFyLCB3ZSBjYW4gZXh0cmFjdCBzdWJkYXRhIGZyYW1lcyBieSByb3dzIG9yIGNvbHVtbnMuIEhlcmXigJlzIGFuIGV4YW1wbGU6DQoNCnByaW50IHRoZSBzZWNvbmQgdG8gZmlmdGggcm93cyBvZiBhbGwgY29sdW1uDQpgYGB7cn0NCmV4YW1zcXVpelsyOjUsXQ0KYGBgDQoNCg0KDQpyZXR1cm4gc2Vjb25kIHRvIGZpZnRoIHJvd3Mgb2Ygc2Vjb25kIGNvbHVtbg0KYGBge3J9DQpleGFtc3F1aXpbMjo1LDJdDQpgYGANCg0KcmV0dXJuIHRoZSBjbGFzcyBvZiB0aGUgc2Vjb25kIHRvIGZpZnRoIHJvd3Mgb2Ygc2Vjb25kIGNvbHVtbg0KLT4gbnVtZXJpYw0KYGBge3J9DQpjbGFzcyhleGFtc3F1aXpbMjo1LDJdKQ0KYGBgDQoNCnJldHVybiB0aGUgY2xhc3Mgb2YgdGhlIHNlY29uZCB0byBmaWZ0aCByb3dzIG9mIHNlY29uZCBjb2x1bW4gd2l0aCBoZWFkZXJzDQpgYGB7cn0NCmV4YW1zcXVpelsyOjUsMixkcm9wPUZBTFNFXQ0KYGBgDQoNCnJldHVybiB0aGUgY2xhc3Mgb2YgdGhlIHNlY29uZCB0byBmaWZ0aCByb3dzIG9mIHNlY29uZCBjb2x1bW4gd2l0aCBoZWFkZXJzIA0KLT4gZGF0YS5mcmFtZQ0KYGBge3J9DQpjbGFzcyhleGFtc3F1aXpbMjo1LDIsZHJvcD1GQUxTRV0pDQpgYGANCg0KTm90ZSB0aGF0IGluIHRoYXQgc2Vjb25kIGNhbGwsIHNpbmNlIGV4YW1zcXVpelsyOjUsMl0gaXMgYSB2ZWN0b3IsIFIgY3JlYXRlZCBhIHZlY3RvciBpbnN0ZWFkIG9mIGFub3RoZXIgZGF0YSBmcmFtZS4gQnkgc3BlY2lmeWluZyBkcm9wPUZBTFNFLCBhcyBkZXNjcmliZWQgZm9yIHRoZSBtYXRyaXggY2FzZSBpbiBTZWN0aW9uIDMuNiwgd2UgY2FuIGtlZXAgaXQgYXMgYSAob25lY29sdW1uKSBkYXRhIGZyYW1lLg0KDQpXZSBjYW4gYWxzbyBkbyBmaWx0ZXJpbmcuIEhlcmXigJlzIGhvdyB0byBleHRyYWN0IHRoZSBzdWJmcmFtZSBvZiBhbGwgc3R1ZGVudHMgd2hvc2UgZmlyc3QgZXhhbSBzY29yZSB3YXMgYXQgbGVhc3QgMy44Og0KDQpyZXR1cm4gYWxsIHJvd3MgYW5kIGNvbHVtbiB3aGVyZSB0aGUgZXhhbTEgc2NvcmUgaXMgaGlnaGVyIHRoYW4gMy44Og0KYGBge3J9DQpleGFtc3F1aXpbZXhhbXNxdWl6JEV4YW0xID49IDMuOCxdDQpgYGANCg0KDQojIyBNb3JlIG9uIFRyZWF0bWVudCBvZiBOQSBWYWx1ZXMNClN1cHBvc2UgdGhlIHNlY29uZCBleGFtIHNjb3JlIGZvciB0aGUgZmlyc3Qgc3R1ZGVudCBoYWQgYmVlbiBtaXNzaW5nLiBUaGVuIHdlIHdvdWxkIGhhdmUgdHlwZWQgdGhlIGZvbGxvd2luZyBpbnRvIHRoYXQgbGluZSB3aGVuIHdlIHdlcmUgcHJlcGFyaW5nIHRoZSBkYXRhIGZpbGU6DQoNCg0KYGBge3J9DQojMi4wIE5BIDQuMA0KYGBgDQoNCg0KSW4gYW55IHN1YnNlcXVlbnQgc3RhdGlzdGljYWwgYW5hbHlzZXMsIFIgd291bGQgZG8gaXRzIGJlc3QgdG8gY29wZSB3aXRoIHRoZSBtaXNzaW5nIGRhdGEuIEhvd2V2ZXIsIGluIHNvbWUgc2l0dWF0aW9ucywgd2UgbmVlZCB0byBzZXQgdGhlIG9wdGlvbiBuYS5ybT1UUlVFLCBleHBsaWNpdGx5IHRlbGxpbmcgUiB0byBpZ25vcmUgTkEgdmFsdWVzLiBGb3IgaW5zdGFuY2UsIHdpdGggdGhlIG1pc3NpbmcgZXhhbSBzY29yZSwgY2FsY3VsYXRpbmcgdGhlIG1lYW4gc2NvcmUgb24gZXhhbSAyIGJ5IGNhbGxpbmcgUuKAmXMgbWVhbigpIGZ1bmN0aW9uIHdvdWxkIHNraXAgdGhhdCBmaXJzdCBzdHVkZW50IGluIGZpbmRpbmcgdGhlIG1lYW4uIE90aGVyd2lzZSwgUiB3b3VsZCBqdXN0IHJlcG9ydCBOQSBmb3IgdGhlIG1lYW4uDQoNCkhlcmXigJlzIGEgbGl0dGxlIGV4YW1wbGU6DQoNCmNyZWF0ZSBYIHdpdGggdGhyZWUgdmFsdWVzOg0KYGBge3J9DQp4IDwtIGMoMixOQSw0KQ0KbWVhbih4KQ0KYGBgDQotPiBtZWFuIGlzIE5BIHNpbmNlIGl0IGNhbm5vdCBjYWxjdWxhdGUgbWVhbiB3aXRoIGFuIE5BIHZhbHVlDQoNCnJlbW92ZSBuYSBmb3IgbWVhbiBjYWxjdWxhdGlvbjoNCmBgYHtyfQ0KbWVhbih4LG5hLnJtPVRSVUUpDQpgYGANCi0+IG5hIGlzIHJlbW92ZWQgYW5kIG1lYW4gaXMgY2FsY3VsYXRlZA0KDQoNCkluIFNlY3Rpb24gMi44LjIsIHlvdSB3ZXJlIGludHJvZHVjZWQgdG8gdGhlIHN1YnNldCgpIGZ1bmN0aW9uLCB3aGljaCBzYXZlcyB5b3UgdGhlIHRyb3VibGUgb2Ygc3BlY2lmeWluZyBuYS5ybT1UUlVFLiBZb3UgY2FuIGFwcGx5IGl0IGluIGRhdGEgZnJhbWVzIGZvciByb3cgc2VsZWN0aW9uLiBUaGUgY29sdW1uIG5hbWVzIGFyZSB0YWtlbiBpbiB0aGUgY29udGV4dCBvZiB0aGUgZ2l2ZW4gZGF0YSBmcmFtZS4gSW4gb3VyIGV4YW1wbGUsIGluc3RlYWQgb2YgdHlwaW5nIHRoaXM6DQoNCnNhbWUgYXMgYmVmb3JlOg0KYGBge3J9DQpleGFtc3F1aXpbZXhhbXNxdWl6JEV4YW0xID49IDMuOCxdDQpgYGANCg0Kc2FtZSByZXN1bHQsIGRpZmZlcmVudCBtZXRob2QNCi0+IHN1YnNldCB0aGUgZXhhbXNxdWl6IGRhdGFmcmFtZQ0KYGBge3J9DQpzdWJzZXQoZXhhbXNxdWl6LEV4YW0xID49IDMuOCkNCg0KDQpgYGANCg0KDQoNCg0K