Dataframes
A data frame is like a matrix, it has a two-dimensional rows and
columns structure. The main difference is that values for different
columns can be be of different modes, being one column characters and
the other on enumber.
Creating Data Frames
#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
cars <- c("Ford","Ferrari")
horsepower <- c(120,500)
d <- data.frame(cars,horsepower,stringsAsFactors=FALSE)
d # matrix-like viewpoint
stringsAsFactors is FALSE because if it is specified as TRUE, the
character values would work as factors and not as vector. Since most of
times we work with vector and not with factors this is going to be
needed.
#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
cars2 <- c("Ford","Ferrari")
horsepower2 <- c("120","500")
d2 <- data.frame(cars2,horsepower2,stringsAsFactors=TRUE)
d2 # matrix-like viewpoint
class(d2$horsepower2)
[1] "factor"
As could be seen above there is no apparently difference between the
output being a factor or a character but may cause problems working with
the data.
Accessing Data Frames Lets explore the data frame
has been created. d is a list so we can access it in this way:
d[[1]]
[1] "Ford" "Ferrari"
d$cars
[1] "Ford" "Ferrari"
It can be output as a matrix too. Below can be seen column 1:
d[,1]
[1] "Ford" "Ferrari"
This matrix-like quality is also seen when we take d apart using
str():
str(d)
'data.frame': 2 obs. of 2 variables:
$ cars : chr "Ford" "Ferrari"
$ horsepower: num 120 500
This output shows that d has 2 observations (rows) and 2 variables
(columns).
Consider three ways to access the first column of our data frame
above:d[[1]], d[,1], and d$cars. Of these, the third would generally
considered to be clearer and, more importantly, safer than the first
two. This better identifies the column and makes it less likely that you
will reference the wrong column. But in writing general code—say writing
R packages—matrix-like notation d[,1] is needed, and it is especially
handy if you are extracting subdata frames.
Extended Example: Regression Analysis of Exam Grades
Continued
getwd() function returns where r file has been saved.
getwd()
[1] "C:/Users/raulr/Downloads"
examsquiz <- read.csv("examsquiz.csv",sep=",",header=TRUE)
Warning: cannot open file 'examsquiz.csv': No such file or directoryError in file(file, "rt") : cannot open the connection
First 6 rows of the data frame
Other Matrix-Like Operations
Some of the matrix operations could be used to data frames. With
filtering subdata frames of interest could be extracted.
Extracting Subdata Frames As mentioned, a data frame can be viewed in
row-and-column terms. In particular, we can extract subdata frames by
rows or columns. Here’s an example:
This piece of code returns rows from 3 to 7 and all the columns
Now, rows 3 to 7 of column number 2 (Exam2) are outputted.
examsquiz[3:7,2]
[1] 2 4 2 4 3
The class of the rows 3 to 7 of coulmn number 2 is plotted
class(examsquiz[3:7,2])
[1] "numeric"
Changing drop from true to false the class of the output has changed
from numeric to dataframe.
As can be seen below class of the output has change to data
frame.
class(examsquiz[3:7,2,drop=FALSE])
[1] "data.frame"
Note that in that second call, since examsquiz[3:7,2] is a vector, R
created a vector instead of another data frame. By specifying
drop=FALSE, as described for the matrix case in Section 3.6, we can keep
it as a (onecolumn) data frame.
We can also do filtering. Here’s how to extract the subframe of all
students whose first exam score was at least 2.9:
More on Treatment of NA Values Suppose the second
exam score for the first student had been missing. Then we would have
typed the following into that line when we were preparing the data
file:
#2.0 NA 4.0
In any subsequent statistical analyses, R would do its best to cope
with the missing data. However, in some situations, we need to set the
option na.rm=TRUE, explicitly telling R to ignore NA values. For
instance, with the missing exam score, calculating the mean score on
exam 2 by calling R’s mean() function would skip that first student in
finding the mean. Otherwise, R would just report NA for the mean.
Here’s a little example:
Since one of the values is missing, r can not compute the mean.
x <- c(3,NA,9,27)
mean(x)
[1] NA
Deleting the missing values the mean would be (3+9+27)/3
mean(x,na.rm=TRUE)
[1] 13
In Section 2.8.2, you were introduced to the subset() function, which
saves you the trouble of specifying na.rm=TRUE. You can apply it in data
frames for row selection. The column names are taken in the context of
the given data frame. In our example, instead of typing this:
examsquiz[examsquiz$Exam1 >= 3.8,]
subset(examsquiz,Exam1 >= 3.8)
NA
NA
LS0tDQp0aXRsZTogIlIgRGF0YWZyYW1lcyBwYXJ0IDEiDQphdXRob3I6ICJSYXVsIFJvY2VzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KKipEYXRhZnJhbWVzKioNCg0KQSBkYXRhIGZyYW1lIGlzIGxpa2UgYSBtYXRyaXgsIGl0IGhhcyBhIHR3by1kaW1lbnNpb25hbCByb3dzIGFuZCBjb2x1bW5zIHN0cnVjdHVyZS4gVGhlIG1haW4gZGlmZmVyZW5jZSBpcyB0aGF0IHZhbHVlcyBmb3IgZGlmZmVyZW50IGNvbHVtbnMgY2FuIGJlIGJlIG9mIGRpZmZlcmVudCBtb2RlcywgYmVpbmcgb25lIGNvbHVtbiBjaGFyYWN0ZXJzIGFuZCB0aGUgb3RoZXIgb24gZW51bWJlci4NCg0KDQoqKkNyZWF0aW5nIERhdGEgRnJhbWVzKioNCg0KDQpgYGB7cn0NCiNUbyBiZWdpbiwgbGV04oCZcyB0YWtlIGFub3RoZXIgbG9vayBhdCBvdXIgc2ltcGxlIGRhdGEgZnJhbWUgZXhhbXBsZSBmcm9tIFNlY3Rpb24gMS40LjU6DQpjYXJzIDwtIGMoIkZvcmQiLCJGZXJyYXJpIikNCmhvcnNlcG93ZXIgPC0gYygxMjAsNTAwKQ0KZCA8LSBkYXRhLmZyYW1lKGNhcnMsaG9yc2Vwb3dlcixzdHJpbmdzQXNGYWN0b3JzPUZBTFNFKQ0KZCAjIG1hdHJpeC1saWtlIHZpZXdwb2ludA0KYGBgDQoNCnN0cmluZ3NBc0ZhY3RvcnMgaXMgRkFMU0UgYmVjYXVzZSBpZiBpdCBpcyBzcGVjaWZpZWQgYXMgVFJVRSwgdGhlIGNoYXJhY3RlciB2YWx1ZXMgd291bGQgd29yayBhcyBmYWN0b3JzIGFuZCBub3QgYXMgdmVjdG9yLiBTaW5jZSBtb3N0IG9mIHRpbWVzIHdlIHdvcmsgd2l0aCB2ZWN0b3IgYW5kIG5vdCB3aXRoIGZhY3RvcnMgdGhpcyBpcyBnb2luZyB0byBiZSBuZWVkZWQuDQoNCg0KYGBge3J9DQojVG8gYmVnaW4sIGxldOKAmXMgdGFrZSBhbm90aGVyIGxvb2sgYXQgb3VyIHNpbXBsZSBkYXRhIGZyYW1lIGV4YW1wbGUgZnJvbSBTZWN0aW9uIDEuNC41Og0KY2FyczIgPC0gYygiRm9yZCIsIkZlcnJhcmkiKQ0KaG9yc2Vwb3dlcjIgPC0gYygiMTIwIiwiNTAwIikNCmQyIDwtIGRhdGEuZnJhbWUoY2FyczIsaG9yc2Vwb3dlcjIsc3RyaW5nc0FzRmFjdG9ycz1UUlVFKQ0KZDIgIyBtYXRyaXgtbGlrZSB2aWV3cG9pbnQNCmNsYXNzKGQyJGhvcnNlcG93ZXIyKQ0KYGBgDQoNCg0KQXMgY291bGQgYmUgc2VlbiBhYm92ZSB0aGVyZSBpcyBubyBhcHBhcmVudGx5IGRpZmZlcmVuY2UgYmV0d2VlbiB0aGUgb3V0cHV0IGJlaW5nIGEgZmFjdG9yIG9yIGEgY2hhcmFjdGVyIGJ1dCBtYXkgY2F1c2UgcHJvYmxlbXMgd29ya2luZyB3aXRoIHRoZSBkYXRhLg0KDQoNCioqQWNjZXNzaW5nIERhdGEgRnJhbWVzKioNCkxldHMgZXhwbG9yZSB0aGUgZGF0YSBmcmFtZSBoYXMgYmVlbiBjcmVhdGVkLiBkIGlzIGEgbGlzdCBzbyB3ZSBjYW4gYWNjZXNzIGl0IGluIHRoaXMgd2F5Og0KYGBge3J9DQpkW1sxXV0NCmQkY2Fycw0KYGBgDQoNCkl0IGNhbiBiZSBvdXRwdXQgYXMgYSBtYXRyaXggdG9vLiBCZWxvdyBjYW4gYmUgc2VlbiBjb2x1bW4gMToNCg0KYGBge3J9DQpkWywxXQ0KYGBgDQpUaGlzIG1hdHJpeC1saWtlIHF1YWxpdHkgaXMgYWxzbyBzZWVuIHdoZW4gd2UgdGFrZSBkIGFwYXJ0IHVzaW5nIHN0cigpOg0KDQpgYGB7cn0NCnN0cihkKQ0KYGBgDQpUaGlzIG91dHB1dCBzaG93cyB0aGF0IGQgaGFzIDIgb2JzZXJ2YXRpb25zIChyb3dzKSBhbmQgMiB2YXJpYWJsZXMgKGNvbHVtbnMpLg0KDQpDb25zaWRlciB0aHJlZSB3YXlzIHRvIGFjY2VzcyB0aGUgZmlyc3QgY29sdW1uIG9mIG91ciBkYXRhIGZyYW1lIGFib3ZlOmRbWzFdXSwgZFssMV0sIGFuZCBkJGNhcnMuIE9mIHRoZXNlLCB0aGUgdGhpcmQgd291bGQgZ2VuZXJhbGx5IGNvbnNpZGVyZWQgdG8gYmUgY2xlYXJlciBhbmQsIG1vcmUgaW1wb3J0YW50bHksIHNhZmVyIHRoYW4gdGhlIGZpcnN0IHR3by4gVGhpcyBiZXR0ZXIgaWRlbnRpZmllcyB0aGUgY29sdW1uIGFuZCBtYWtlcyBpdCBsZXNzIGxpa2VseSB0aGF0IHlvdSB3aWxsIHJlZmVyZW5jZSB0aGUgd3JvbmcgY29sdW1uLiBCdXQgaW4gd3JpdGluZyBnZW5lcmFsIGNvZGXigJRzYXkgd3JpdGluZyBSIHBhY2thZ2Vz4oCUbWF0cml4LWxpa2UNCm5vdGF0aW9uIGRbLDFdIGlzIG5lZWRlZCwgYW5kIGl0IGlzIGVzcGVjaWFsbHkgaGFuZHkgaWYgeW91IGFyZSBleHRyYWN0aW5nIHN1YmRhdGEgZnJhbWVzLg0KDQoqKkV4dGVuZGVkIEV4YW1wbGU6IFJlZ3Jlc3Npb24gQW5hbHlzaXMgb2YgRXhhbSBHcmFkZXMgQ29udGludWVkKioNCg0KZ2V0d2QoKSBmdW5jdGlvbiByZXR1cm5zIHdoZXJlIHIgZmlsZSBoYXMgYmVlbiBzYXZlZC4NCg0KYGBge3J9DQpnZXR3ZCgpDQpgYGANCg0KDQpgYGB7cn0NCmV4YW1zcXVpeiA8LSByZWFkLmNzdigiZXhhbXNxdWl6LmNzdiIsc2VwPSIsIixoZWFkZXI9VFJVRSkNCmBgYA0KDQpGaXJzdCA2IHJvd3Mgb2YgdGhlIGRhdGEgZnJhbWUNCmBgYHtyfQ0KaGVhZChleGFtc3F1aXopDQpgYGANCg0KKipPdGhlciBNYXRyaXgtTGlrZSBPcGVyYXRpb25zKioNCg0KU29tZSBvZiB0aGUgbWF0cml4IG9wZXJhdGlvbnMgY291bGQgYmUgdXNlZCB0byBkYXRhIGZyYW1lcy4gV2l0aCBmaWx0ZXJpbmcgc3ViZGF0YSBmcmFtZXMgb2YgaW50ZXJlc3QgY291bGQgYmUgZXh0cmFjdGVkLg0KDQpFeHRyYWN0aW5nIFN1YmRhdGEgRnJhbWVzDQpBcyBtZW50aW9uZWQsIGEgZGF0YSBmcmFtZSBjYW4gYmUgdmlld2VkIGluIHJvdy1hbmQtY29sdW1uIHRlcm1zLiBJbiBwYXJ0aWN1bGFyLCB3ZSBjYW4gZXh0cmFjdCBzdWJkYXRhIGZyYW1lcyBieSByb3dzIG9yIGNvbHVtbnMuIEhlcmXigJlzIGFuIGV4YW1wbGU6DQoNClRoaXMgcGllY2Ugb2YgY29kZSByZXR1cm5zIHJvd3MgZnJvbSAzIHRvIDcgYW5kIGFsbCB0aGUgY29sdW1ucw0KYGBge3J9DQpleGFtc3F1aXpbMzo3LF0NCmBgYA0KDQoNCg0KTm93LCByb3dzIDMgdG8gNyBvZiBjb2x1bW4gbnVtYmVyIDIgKEV4YW0yKSBhcmUgb3V0cHV0dGVkLiANCmBgYHtyfQ0KZXhhbXNxdWl6WzM6NywyXQ0KYGBgDQpUaGUgY2xhc3Mgb2YgdGhlIHJvd3MgMyB0byA3IG9mIGNvdWxtbiBudW1iZXIgMiBpcyBwbG90dGVkDQpgYGB7cn0NCmNsYXNzKGV4YW1zcXVpelszOjcsMl0pDQpgYGANCkNoYW5naW5nIGRyb3AgZnJvbSB0cnVlIHRvIGZhbHNlIHRoZSBjbGFzcyBvZiB0aGUgb3V0cHV0IGhhcyBjaGFuZ2VkIGZyb20gbnVtZXJpYyB0byBkYXRhZnJhbWUuDQpgYGB7cn0NCmV4YW1zcXVpelszOjcsMixkcm9wPUZBTFNFXQ0KYGBgDQoNCkFzIGNhbiBiZSBzZWVuIGJlbG93IGNsYXNzIG9mIHRoZSBvdXRwdXQgaGFzIGNoYW5nZSB0byBkYXRhIGZyYW1lLg0KYGBge3J9DQpjbGFzcyhleGFtc3F1aXpbMzo3LDIsZHJvcD1GQUxTRV0pDQpgYGANCg0KTm90ZSB0aGF0IGluIHRoYXQgc2Vjb25kIGNhbGwsIHNpbmNlIGV4YW1zcXVpelszOjcsMl0gaXMgYSB2ZWN0b3IsIFIgY3JlYXRlZCBhIHZlY3RvciBpbnN0ZWFkIG9mIGFub3RoZXIgZGF0YSBmcmFtZS4gQnkgc3BlY2lmeWluZyBkcm9wPUZBTFNFLCBhcyBkZXNjcmliZWQgZm9yIHRoZSBtYXRyaXggY2FzZSBpbiBTZWN0aW9uIDMuNiwgd2UgY2FuIGtlZXAgaXQgYXMgYSAob25lY29sdW1uKSBkYXRhIGZyYW1lLg0KDQpXZSBjYW4gYWxzbyBkbyBmaWx0ZXJpbmcuIEhlcmXigJlzIGhvdyB0byBleHRyYWN0IHRoZSBzdWJmcmFtZSBvZiBhbGwgc3R1ZGVudHMgd2hvc2UgZmlyc3QgZXhhbSBzY29yZSB3YXMgYXQgbGVhc3QgMi45Og0KDQoNCmBgYHtyfQ0KZXhhbXNxdWl6W2V4YW1zcXVpeiRFeGFtMSA+PSAyLjgsXQ0KYGBgDQoNCg0KKipNb3JlIG9uIFRyZWF0bWVudCBvZiBOQSBWYWx1ZXMqKg0KU3VwcG9zZSB0aGUgc2Vjb25kIGV4YW0gc2NvcmUgZm9yIHRoZSBmaXJzdCBzdHVkZW50IGhhZCBiZWVuIG1pc3NpbmcuIFRoZW4gd2Ugd291bGQgaGF2ZSB0eXBlZCB0aGUgZm9sbG93aW5nIGludG8gdGhhdCBsaW5lIHdoZW4gd2Ugd2VyZSBwcmVwYXJpbmcgdGhlIGRhdGEgZmlsZToNCg0KDQpgYGB7cn0NCiMyLjAgTkEgNC4wDQpgYGANCg0KDQpJbiBhbnkgc3Vic2VxdWVudCBzdGF0aXN0aWNhbCBhbmFseXNlcywgUiB3b3VsZCBkbyBpdHMgYmVzdCB0byBjb3BlIHdpdGggdGhlIG1pc3NpbmcgZGF0YS4gSG93ZXZlciwgaW4gc29tZSBzaXR1YXRpb25zLCB3ZSBuZWVkIHRvIHNldCB0aGUgb3B0aW9uIG5hLnJtPVRSVUUsIGV4cGxpY2l0bHkgdGVsbGluZyBSIHRvIGlnbm9yZSBOQSB2YWx1ZXMuIEZvciBpbnN0YW5jZSwgd2l0aCB0aGUgbWlzc2luZyBleGFtIHNjb3JlLCBjYWxjdWxhdGluZyB0aGUgbWVhbiBzY29yZSBvbiBleGFtIDIgYnkgY2FsbGluZyBS4oCZcyBtZWFuKCkgZnVuY3Rpb24gd291bGQgc2tpcCB0aGF0IGZpcnN0IHN0dWRlbnQgaW4gZmluZGluZyB0aGUgbWVhbi4gT3RoZXJ3aXNlLCBSIHdvdWxkIGp1c3QgcmVwb3J0IE5BIGZvciB0aGUgbWVhbi4NCg0KSGVyZeKAmXMgYSBsaXR0bGUgZXhhbXBsZToNCg0KU2luY2Ugb25lIG9mIHRoZSB2YWx1ZXMgaXMgbWlzc2luZywgciBjYW4gbm90IGNvbXB1dGUgdGhlIG1lYW4uDQpgYGB7cn0NCnggPC0gYygzLE5BLDksMjcpDQptZWFuKHgpDQpgYGANCg0KRGVsZXRpbmcgdGhlIG1pc3NpbmcgdmFsdWVzIHRoZSBtZWFuIHdvdWxkIGJlICgzKzkrMjcpLzMNCmBgYHtyfQ0KbWVhbih4LG5hLnJtPVRSVUUpDQpgYGANCg0KSW4gU2VjdGlvbiAyLjguMiwgeW91IHdlcmUgaW50cm9kdWNlZCB0byB0aGUgc3Vic2V0KCkgZnVuY3Rpb24sIHdoaWNoIHNhdmVzIHlvdSB0aGUgdHJvdWJsZSBvZiBzcGVjaWZ5aW5nIG5hLnJtPVRSVUUuIFlvdSBjYW4gYXBwbHkgaXQgaW4gZGF0YSBmcmFtZXMgZm9yIHJvdyBzZWxlY3Rpb24uIFRoZSBjb2x1bW4gbmFtZXMgYXJlIHRha2VuIGluIHRoZSBjb250ZXh0IG9mIHRoZSBnaXZlbiBkYXRhIGZyYW1lLiBJbiBvdXIgZXhhbXBsZSwgaW5zdGVhZCBvZiB0eXBpbmcgdGhpczoNCg0KDQpgYGB7cn0NCmV4YW1zcXVpeltleGFtc3F1aXokRXhhbTEgPj0gMy44LF0NCmBgYA0KDQoNCmBgYHtyfQ0Kc3Vic2V0KGV4YW1zcXVpeixFeGFtMSA+PSAzLjgpDQoNCg0KYGBgDQoNCg0KDQoNCg==