Dataframes

A data frame is like a matrix, it has a two-dimensional rows and columns structure. The main difference is that values for different columns can be be of different modes, being one column characters and the other on enumber.

Creating Data Frames

#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
cars <- c("Ford","Ferrari")
horsepower <- c(120,500)
d <- data.frame(cars,horsepower,stringsAsFactors=FALSE)
d # matrix-like viewpoint

stringsAsFactors is FALSE because if it is specified as TRUE, the character values would work as factors and not as vector. Since most of times we work with vector and not with factors this is going to be needed.

#To begin, let’s take another look at our simple data frame example from Section 1.4.5:
cars2 <- c("Ford","Ferrari")
horsepower2 <- c("120","500")
d2 <- data.frame(cars2,horsepower2,stringsAsFactors=TRUE)
d2 # matrix-like viewpoint
class(d2$horsepower2)
[1] "factor"

As could be seen above there is no apparently difference between the output being a factor or a character but may cause problems working with the data.

Accessing Data Frames Lets explore the data frame has been created. d is a list so we can access it in this way:

d[[1]]
[1] "Ford"    "Ferrari"
d$cars
[1] "Ford"    "Ferrari"

It can be output as a matrix too. Below can be seen column 1:

d[,1]
[1] "Ford"    "Ferrari"

This matrix-like quality is also seen when we take d apart using str():

str(d)
'data.frame':   2 obs. of  2 variables:
 $ cars      : chr  "Ford" "Ferrari"
 $ horsepower: num  120 500

This output shows that d has 2 observations (rows) and 2 variables (columns).

Consider three ways to access the first column of our data frame above:d[[1]], d[,1], and d$cars. Of these, the third would generally considered to be clearer and, more importantly, safer than the first two. This better identifies the column and makes it less likely that you will reference the wrong column. But in writing general code—say writing R packages—matrix-like notation d[,1] is needed, and it is especially handy if you are extracting subdata frames.

Extended Example: Regression Analysis of Exam Grades Continued

getwd() function returns where r file has been saved.

getwd()
[1] "C:/Users/raulr/Downloads"
examsquiz <- read.csv("examsquiz.csv",sep=",",header=TRUE)
Warning: cannot open file 'examsquiz.csv': No such file or directoryError in file(file, "rt") : cannot open the connection

First 6 rows of the data frame

Other Matrix-Like Operations

Some of the matrix operations could be used to data frames. With filtering subdata frames of interest could be extracted.

Extracting Subdata Frames As mentioned, a data frame can be viewed in row-and-column terms. In particular, we can extract subdata frames by rows or columns. Here’s an example:

This piece of code returns rows from 3 to 7 and all the columns

Now, rows 3 to 7 of column number 2 (Exam2) are outputted.

examsquiz[3:7,2]
[1] 2 4 2 4 3

The class of the rows 3 to 7 of coulmn number 2 is plotted

class(examsquiz[3:7,2])
[1] "numeric"

Changing drop from true to false the class of the output has changed from numeric to dataframe.

As can be seen below class of the output has change to data frame.

class(examsquiz[3:7,2,drop=FALSE])
[1] "data.frame"

Note that in that second call, since examsquiz[3:7,2] is a vector, R created a vector instead of another data frame. By specifying drop=FALSE, as described for the matrix case in Section 3.6, we can keep it as a (onecolumn) data frame.

We can also do filtering. Here’s how to extract the subframe of all students whose first exam score was at least 2.9:

More on Treatment of NA Values Suppose the second exam score for the first student had been missing. Then we would have typed the following into that line when we were preparing the data file:

#2.0 NA 4.0

In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R’s mean() function would skip that first student in finding the mean. Otherwise, R would just report NA for the mean.

Here’s a little example:

Since one of the values is missing, r can not compute the mean.

x <- c(3,NA,9,27)
mean(x)
[1] NA

Deleting the missing values the mean would be (3+9+27)/3

mean(x,na.rm=TRUE)
[1] 13

In Section 2.8.2, you were introduced to the subset() function, which saves you the trouble of specifying na.rm=TRUE. You can apply it in data frames for row selection. The column names are taken in the context of the given data frame. In our example, instead of typing this:

examsquiz[examsquiz$Exam1 >= 3.8,]
subset(examsquiz,Exam1 >= 3.8)
NA
NA
LS0tDQp0aXRsZTogIlIgRGF0YWZyYW1lcyBwYXJ0IDEiDQphdXRob3I6ICJSYXVsIFJvY2VzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KKipEYXRhZnJhbWVzKioNCg0KQSBkYXRhIGZyYW1lIGlzIGxpa2UgYSBtYXRyaXgsIGl0IGhhcyBhIHR3by1kaW1lbnNpb25hbCByb3dzIGFuZCBjb2x1bW5zIHN0cnVjdHVyZS4gVGhlIG1haW4gZGlmZmVyZW5jZSBpcyB0aGF0IHZhbHVlcyBmb3IgZGlmZmVyZW50IGNvbHVtbnMgY2FuIGJlIGJlIG9mIGRpZmZlcmVudCBtb2RlcywgYmVpbmcgb25lIGNvbHVtbiBjaGFyYWN0ZXJzIGFuZCB0aGUgb3RoZXIgb24gZW51bWJlci4NCg0KDQoqKkNyZWF0aW5nIERhdGEgRnJhbWVzKioNCg0KDQpgYGB7cn0NCiNUbyBiZWdpbiwgbGV04oCZcyB0YWtlIGFub3RoZXIgbG9vayBhdCBvdXIgc2ltcGxlIGRhdGEgZnJhbWUgZXhhbXBsZSBmcm9tIFNlY3Rpb24gMS40LjU6DQpjYXJzIDwtIGMoIkZvcmQiLCJGZXJyYXJpIikNCmhvcnNlcG93ZXIgPC0gYygxMjAsNTAwKQ0KZCA8LSBkYXRhLmZyYW1lKGNhcnMsaG9yc2Vwb3dlcixzdHJpbmdzQXNGYWN0b3JzPUZBTFNFKQ0KZCAjIG1hdHJpeC1saWtlIHZpZXdwb2ludA0KYGBgDQoNCnN0cmluZ3NBc0ZhY3RvcnMgaXMgRkFMU0UgYmVjYXVzZSBpZiBpdCBpcyBzcGVjaWZpZWQgYXMgVFJVRSwgdGhlIGNoYXJhY3RlciB2YWx1ZXMgd291bGQgd29yayBhcyBmYWN0b3JzIGFuZCBub3QgYXMgdmVjdG9yLiBTaW5jZSBtb3N0IG9mIHRpbWVzIHdlIHdvcmsgd2l0aCB2ZWN0b3IgYW5kIG5vdCB3aXRoIGZhY3RvcnMgdGhpcyBpcyBnb2luZyB0byBiZSBuZWVkZWQuDQoNCg0KYGBge3J9DQojVG8gYmVnaW4sIGxldOKAmXMgdGFrZSBhbm90aGVyIGxvb2sgYXQgb3VyIHNpbXBsZSBkYXRhIGZyYW1lIGV4YW1wbGUgZnJvbSBTZWN0aW9uIDEuNC41Og0KY2FyczIgPC0gYygiRm9yZCIsIkZlcnJhcmkiKQ0KaG9yc2Vwb3dlcjIgPC0gYygiMTIwIiwiNTAwIikNCmQyIDwtIGRhdGEuZnJhbWUoY2FyczIsaG9yc2Vwb3dlcjIsc3RyaW5nc0FzRmFjdG9ycz1UUlVFKQ0KZDIgIyBtYXRyaXgtbGlrZSB2aWV3cG9pbnQNCmNsYXNzKGQyJGhvcnNlcG93ZXIyKQ0KYGBgDQoNCg0KQXMgY291bGQgYmUgc2VlbiBhYm92ZSB0aGVyZSBpcyBubyBhcHBhcmVudGx5IGRpZmZlcmVuY2UgYmV0d2VlbiB0aGUgb3V0cHV0IGJlaW5nIGEgZmFjdG9yIG9yIGEgY2hhcmFjdGVyIGJ1dCBtYXkgY2F1c2UgcHJvYmxlbXMgd29ya2luZyB3aXRoIHRoZSBkYXRhLg0KDQoNCioqQWNjZXNzaW5nIERhdGEgRnJhbWVzKioNCkxldHMgZXhwbG9yZSB0aGUgZGF0YSBmcmFtZSBoYXMgYmVlbiBjcmVhdGVkLiBkIGlzIGEgbGlzdCBzbyB3ZSBjYW4gYWNjZXNzIGl0IGluIHRoaXMgd2F5Og0KYGBge3J9DQpkW1sxXV0NCmQkY2Fycw0KYGBgDQoNCkl0IGNhbiBiZSBvdXRwdXQgYXMgYSBtYXRyaXggdG9vLiBCZWxvdyBjYW4gYmUgc2VlbiBjb2x1bW4gMToNCg0KYGBge3J9DQpkWywxXQ0KYGBgDQpUaGlzIG1hdHJpeC1saWtlIHF1YWxpdHkgaXMgYWxzbyBzZWVuIHdoZW4gd2UgdGFrZSBkIGFwYXJ0IHVzaW5nIHN0cigpOg0KDQpgYGB7cn0NCnN0cihkKQ0KYGBgDQpUaGlzIG91dHB1dCBzaG93cyB0aGF0IGQgaGFzIDIgb2JzZXJ2YXRpb25zIChyb3dzKSBhbmQgMiB2YXJpYWJsZXMgKGNvbHVtbnMpLg0KDQpDb25zaWRlciB0aHJlZSB3YXlzIHRvIGFjY2VzcyB0aGUgZmlyc3QgY29sdW1uIG9mIG91ciBkYXRhIGZyYW1lIGFib3ZlOmRbWzFdXSwgZFssMV0sIGFuZCBkJGNhcnMuIE9mIHRoZXNlLCB0aGUgdGhpcmQgd291bGQgZ2VuZXJhbGx5IGNvbnNpZGVyZWQgdG8gYmUgY2xlYXJlciBhbmQsIG1vcmUgaW1wb3J0YW50bHksIHNhZmVyIHRoYW4gdGhlIGZpcnN0IHR3by4gVGhpcyBiZXR0ZXIgaWRlbnRpZmllcyB0aGUgY29sdW1uIGFuZCBtYWtlcyBpdCBsZXNzIGxpa2VseSB0aGF0IHlvdSB3aWxsIHJlZmVyZW5jZSB0aGUgd3JvbmcgY29sdW1uLiBCdXQgaW4gd3JpdGluZyBnZW5lcmFsIGNvZGXigJRzYXkgd3JpdGluZyBSIHBhY2thZ2Vz4oCUbWF0cml4LWxpa2UNCm5vdGF0aW9uIGRbLDFdIGlzIG5lZWRlZCwgYW5kIGl0IGlzIGVzcGVjaWFsbHkgaGFuZHkgaWYgeW91IGFyZSBleHRyYWN0aW5nIHN1YmRhdGEgZnJhbWVzLg0KDQoqKkV4dGVuZGVkIEV4YW1wbGU6IFJlZ3Jlc3Npb24gQW5hbHlzaXMgb2YgRXhhbSBHcmFkZXMgQ29udGludWVkKioNCg0KZ2V0d2QoKSBmdW5jdGlvbiByZXR1cm5zIHdoZXJlIHIgZmlsZSBoYXMgYmVlbiBzYXZlZC4NCg0KYGBge3J9DQpnZXR3ZCgpDQpgYGANCg0KDQpgYGB7cn0NCmV4YW1zcXVpeiA8LSByZWFkLmNzdigiZXhhbXNxdWl6LmNzdiIsc2VwPSIsIixoZWFkZXI9VFJVRSkNCmBgYA0KDQpGaXJzdCA2IHJvd3Mgb2YgdGhlIGRhdGEgZnJhbWUNCmBgYHtyfQ0KaGVhZChleGFtc3F1aXopDQpgYGANCg0KKipPdGhlciBNYXRyaXgtTGlrZSBPcGVyYXRpb25zKioNCg0KU29tZSBvZiB0aGUgbWF0cml4IG9wZXJhdGlvbnMgY291bGQgYmUgdXNlZCB0byBkYXRhIGZyYW1lcy4gV2l0aCBmaWx0ZXJpbmcgc3ViZGF0YSBmcmFtZXMgb2YgaW50ZXJlc3QgY291bGQgYmUgZXh0cmFjdGVkLg0KDQpFeHRyYWN0aW5nIFN1YmRhdGEgRnJhbWVzDQpBcyBtZW50aW9uZWQsIGEgZGF0YSBmcmFtZSBjYW4gYmUgdmlld2VkIGluIHJvdy1hbmQtY29sdW1uIHRlcm1zLiBJbiBwYXJ0aWN1bGFyLCB3ZSBjYW4gZXh0cmFjdCBzdWJkYXRhIGZyYW1lcyBieSByb3dzIG9yIGNvbHVtbnMuIEhlcmXigJlzIGFuIGV4YW1wbGU6DQoNClRoaXMgcGllY2Ugb2YgY29kZSByZXR1cm5zIHJvd3MgZnJvbSAzIHRvIDcgYW5kIGFsbCB0aGUgY29sdW1ucw0KYGBge3J9DQpleGFtc3F1aXpbMzo3LF0NCmBgYA0KDQoNCg0KTm93LCByb3dzIDMgdG8gNyBvZiBjb2x1bW4gbnVtYmVyIDIgKEV4YW0yKSBhcmUgb3V0cHV0dGVkLiANCmBgYHtyfQ0KZXhhbXNxdWl6WzM6NywyXQ0KYGBgDQpUaGUgY2xhc3Mgb2YgdGhlIHJvd3MgMyB0byA3IG9mIGNvdWxtbiBudW1iZXIgMiBpcyBwbG90dGVkDQpgYGB7cn0NCmNsYXNzKGV4YW1zcXVpelszOjcsMl0pDQpgYGANCkNoYW5naW5nIGRyb3AgZnJvbSB0cnVlIHRvIGZhbHNlIHRoZSBjbGFzcyBvZiB0aGUgb3V0cHV0IGhhcyBjaGFuZ2VkIGZyb20gbnVtZXJpYyB0byBkYXRhZnJhbWUuDQpgYGB7cn0NCmV4YW1zcXVpelszOjcsMixkcm9wPUZBTFNFXQ0KYGBgDQoNCkFzIGNhbiBiZSBzZWVuIGJlbG93IGNsYXNzIG9mIHRoZSBvdXRwdXQgaGFzIGNoYW5nZSB0byBkYXRhIGZyYW1lLg0KYGBge3J9DQpjbGFzcyhleGFtc3F1aXpbMzo3LDIsZHJvcD1GQUxTRV0pDQpgYGANCg0KTm90ZSB0aGF0IGluIHRoYXQgc2Vjb25kIGNhbGwsIHNpbmNlIGV4YW1zcXVpelszOjcsMl0gaXMgYSB2ZWN0b3IsIFIgY3JlYXRlZCBhIHZlY3RvciBpbnN0ZWFkIG9mIGFub3RoZXIgZGF0YSBmcmFtZS4gQnkgc3BlY2lmeWluZyBkcm9wPUZBTFNFLCBhcyBkZXNjcmliZWQgZm9yIHRoZSBtYXRyaXggY2FzZSBpbiBTZWN0aW9uIDMuNiwgd2UgY2FuIGtlZXAgaXQgYXMgYSAob25lY29sdW1uKSBkYXRhIGZyYW1lLg0KDQpXZSBjYW4gYWxzbyBkbyBmaWx0ZXJpbmcuIEhlcmXigJlzIGhvdyB0byBleHRyYWN0IHRoZSBzdWJmcmFtZSBvZiBhbGwgc3R1ZGVudHMgd2hvc2UgZmlyc3QgZXhhbSBzY29yZSB3YXMgYXQgbGVhc3QgMi45Og0KDQoNCmBgYHtyfQ0KZXhhbXNxdWl6W2V4YW1zcXVpeiRFeGFtMSA+PSAyLjgsXQ0KYGBgDQoNCg0KKipNb3JlIG9uIFRyZWF0bWVudCBvZiBOQSBWYWx1ZXMqKg0KU3VwcG9zZSB0aGUgc2Vjb25kIGV4YW0gc2NvcmUgZm9yIHRoZSBmaXJzdCBzdHVkZW50IGhhZCBiZWVuIG1pc3NpbmcuIFRoZW4gd2Ugd291bGQgaGF2ZSB0eXBlZCB0aGUgZm9sbG93aW5nIGludG8gdGhhdCBsaW5lIHdoZW4gd2Ugd2VyZSBwcmVwYXJpbmcgdGhlIGRhdGEgZmlsZToNCg0KDQpgYGB7cn0NCiMyLjAgTkEgNC4wDQpgYGANCg0KDQpJbiBhbnkgc3Vic2VxdWVudCBzdGF0aXN0aWNhbCBhbmFseXNlcywgUiB3b3VsZCBkbyBpdHMgYmVzdCB0byBjb3BlIHdpdGggdGhlIG1pc3NpbmcgZGF0YS4gSG93ZXZlciwgaW4gc29tZSBzaXR1YXRpb25zLCB3ZSBuZWVkIHRvIHNldCB0aGUgb3B0aW9uIG5hLnJtPVRSVUUsIGV4cGxpY2l0bHkgdGVsbGluZyBSIHRvIGlnbm9yZSBOQSB2YWx1ZXMuIEZvciBpbnN0YW5jZSwgd2l0aCB0aGUgbWlzc2luZyBleGFtIHNjb3JlLCBjYWxjdWxhdGluZyB0aGUgbWVhbiBzY29yZSBvbiBleGFtIDIgYnkgY2FsbGluZyBS4oCZcyBtZWFuKCkgZnVuY3Rpb24gd291bGQgc2tpcCB0aGF0IGZpcnN0IHN0dWRlbnQgaW4gZmluZGluZyB0aGUgbWVhbi4gT3RoZXJ3aXNlLCBSIHdvdWxkIGp1c3QgcmVwb3J0IE5BIGZvciB0aGUgbWVhbi4NCg0KSGVyZeKAmXMgYSBsaXR0bGUgZXhhbXBsZToNCg0KU2luY2Ugb25lIG9mIHRoZSB2YWx1ZXMgaXMgbWlzc2luZywgciBjYW4gbm90IGNvbXB1dGUgdGhlIG1lYW4uDQpgYGB7cn0NCnggPC0gYygzLE5BLDksMjcpDQptZWFuKHgpDQpgYGANCg0KRGVsZXRpbmcgdGhlIG1pc3NpbmcgdmFsdWVzIHRoZSBtZWFuIHdvdWxkIGJlICgzKzkrMjcpLzMNCmBgYHtyfQ0KbWVhbih4LG5hLnJtPVRSVUUpDQpgYGANCg0KSW4gU2VjdGlvbiAyLjguMiwgeW91IHdlcmUgaW50cm9kdWNlZCB0byB0aGUgc3Vic2V0KCkgZnVuY3Rpb24sIHdoaWNoIHNhdmVzIHlvdSB0aGUgdHJvdWJsZSBvZiBzcGVjaWZ5aW5nIG5hLnJtPVRSVUUuIFlvdSBjYW4gYXBwbHkgaXQgaW4gZGF0YSBmcmFtZXMgZm9yIHJvdyBzZWxlY3Rpb24uIFRoZSBjb2x1bW4gbmFtZXMgYXJlIHRha2VuIGluIHRoZSBjb250ZXh0IG9mIHRoZSBnaXZlbiBkYXRhIGZyYW1lLiBJbiBvdXIgZXhhbXBsZSwgaW5zdGVhZCBvZiB0eXBpbmcgdGhpczoNCg0KDQpgYGB7cn0NCmV4YW1zcXVpeltleGFtc3F1aXokRXhhbTEgPj0gMy44LF0NCmBgYA0KDQoNCmBgYHtyfQ0Kc3Vic2V0KGV4YW1zcXVpeixFeGFtMSA+PSAzLjgpDQoNCg0KYGBgDQoNCg0KDQoNCg==