Data frames

A data frame is a matrix-like object with columns that can have different data types.

A data frame is different from a matrix (an R data structure that we will not cover). Matrices, like vectors, only accept columns with the same data type. This is a limitation because, what if we want the person’s name in column 1 and the person’s salary in column 2? A matrix cannot handle this, but a data frame can.

Example: Create a data frame with the presidents’ height data.

# Just copy the data in the file and paste it here. Then, run this code chunk to create the vectors

presid_name= c("Obama","Bush","Bush","Clinton","Clinton","Bush Father","Reagan","Reagan","Carter","Nixon","Nixon","Johnson","Kennedy","Eisenhower","Eisenhower","Truman")

winner = c(185, 182, 182, 188, 188, 188, 185, 185, 177, 182, 182, 193, 183, 179, 179, 175)

opponent = c(175, 193, 185, 187, 188, 173, 180, 177, 183, 185, 180, 180, 182, 178, 178, 173)

Let’s say we want to create another vector to store the election year. The year starts in 2008 and goes all the way back until 1948.

Use the next code chunk to create a vector called year with the year values that satisfy this problem.

year= seq (from= 2008, to= 1948, by=-4)

Create a vector called “isWinnerTaller”, which is True when Winner > Opponent and False otherwise. Then, use “isWinnerTaller” to count how many times the winner has been taller than the opponent.

isWinnerTaller= winner > opponent

isWinnerTaller
##  [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE
# Count how many times the winner has been taller than the opponent.

sum(isWinnerTaller)
## [1] 11

Use all the previous vectors to create a data frame with all the data. We can use the data.frame() function to create R data frames.

df_presidents= data.frame(presid_name, year, winner, opponent, isWinnerTaller)
df_presidents

Let’s learn some handy functions to work with data frames

head(df_presidents)
str (df_presidents)
## 'data.frame':    16 obs. of  5 variables:
##  $ presid_name   : chr  "Obama" "Bush" "Bush" "Clinton" ...
##  $ year          : num  2008 2004 2000 1996 1992 ...
##  $ winner        : num  185 182 182 188 188 188 185 185 177 182 ...
##  $ opponent      : num  175 193 185 187 188 173 180 177 183 185 ...
##  $ isWinnerTaller: logi  TRUE FALSE FALSE TRUE FALSE TRUE ...
colnames(df_presidents)
## [1] "presid_name"    "year"           "winner"         "opponent"      
## [5] "isWinnerTaller"
ncol(df_presidents)
## [1] 5
nrow(df_presidents)
## [1] 16

Let’s learn how to add a new column to an existing data frame.

For example, add a column named difference as the last column. This column stores the difference in height between winners and opponents.

df_presidents$difference = winner - opponent
df_presidents

Now, let’s learn how to delete a column by using the index of the column in the data frame.

For example, delete the difference column by using its index.

df_presidents [ , -6]
df_presidents

The right way is to re-assign the deletion to the data frame

df_presidents = df_presidents [, -6]
df_presidents

A better way of deleting the last column of a data frame IF you want to do it more programmatically is:

df_presidents = df_presidents [, -ncol(df_presidents)]

DO NOT RUN THIS STATEMENT. I just want you to know that this is an alternative way of deleting the the last column of a data frame.

Let’s add the column difference again, so we can delete it one more time using a different method.

df_presidents$difference = winner - opponent
df_presidents

Before, we deleted the column difference by using its index. Now, let’s delete by using its name:

df_presidents[ , colnames(df_presidents) != 'difference' ]
df_presidents= df_presidents[ , colnames(df_presidents) != 'difference' ]
df_presidents

Indexing a data frame

Get the second column

df_presidents[ , 2]
##  [1] 2008 2004 2000 1996 1992 1988 1984 1980 1976 1972 1968 1964 1960 1956 1952
## [16] 1948
# alternative 2: df_presidents [, 'year']

# alternative 3: df_presidents$year

Get the last column

df_presidents[ , ncol(df_presidents)]
##  [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE
# alternative: df_presidents[ , 5]

Get the winner’s and the opponent’s heights, only for the first three rows in the data set.

df_presidents [c(1,2,3), c(3,4)]
# alternative: df_presidents [1:3 , c(3,4)]

Some useful functions to work with data frames

The subset() function

Example: Use the subset() function to get the rows from the data frame where the Winner > Opponent

subset (df_presidents, df_presidents$isWinnerTaller==TRUE)
# or, more succinct: subset (df_presidents, isWinnerTaller)

Example: Get the winners’ names only for cases when Winner > Opponent

subset (df_presidents$presid_name, df_presidents$isWinnerTaller==TRUE)
##  [1] "Obama"       "Clinton"     "Bush Father" "Reagan"      "Reagan"     
##  [6] "Nixon"       "Johnson"     "Kennedy"     "Eisenhower"  "Eisenhower" 
## [11] "Truman"

Other useful functions are apply() and tapply(). To be able to practice these functions, lets’ add a column with the party of the winner (i.e., the party of the president).

party= c("Dem", "Rep", "Rep", "Dem", "Dem", "Rep", "Rep", "Rep", "Dem","Rep","Rep","Dem","Dem","Rep","Rep","Dem")
df_presidents$presid_party= party
df_presidents

Use tapply() to compute the mean height for the presidents from each party

tapply (df_presidents$winner, df_presidents$presid_party, mean)
##      Dem      Rep 
## 184.1429 182.6667

Use tapply() to compute the max height for the presidents from each party

tapply (df_presidents$winner, df_presidents$presid_party, max)
## Dem Rep 
## 193 188

Use apply() to compute the mean height for the winners and opponents

apply (df_presidents [, c("winner", "opponent")] ,2, mean)
##   winner opponent 
## 183.3125 181.0625
# The 2 means that we want to apply the mean by columns

We can also use the colMeans () function

colMeans (df_presidents [, c("winner", "opponent")])
##   winner opponent 
## 183.3125 181.0625

Use apply() to compute the sd of the height for the winners and opponents

apply (df_presidents [, c("winner", "opponent")] , 2, sd)
##   winner opponent 
## 4.629165 5.579352
sd(df_presidents [, "winner"])
## [1] 4.629165
sd(df_presidents [, "opponent"])
## [1] 5.579352