A data frame is a matrix-like object with columns that can have different data types.
A data frame is different from a matrix (an R data structure that we will not cover). Matrices, like vectors, only accept columns with the same data type. This is a limitation because, what if we want the person’s name in column 1 and the person’s salary in column 2? A matrix cannot handle this, but a data frame can.
Example: Create a data frame with the presidents’ height data.
# Just copy the data in the file and paste it here. Then, run this code chunk to create the vectors
presid_name= c("Obama","Bush","Bush","Clinton","Clinton","Bush Father","Reagan","Reagan","Carter","Nixon","Nixon","Johnson","Kennedy","Eisenhower","Eisenhower","Truman")
winner = c(185, 182, 182, 188, 188, 188, 185, 185, 177, 182, 182, 193, 183, 179, 179, 175)
opponent = c(175, 193, 185, 187, 188, 173, 180, 177, 183, 185, 180, 180, 182, 178, 178, 173)
Let’s say we want to create another vector to store the election year. The year starts in 2008 and goes all the way back until 1948.
Use the next code chunk to create a vector called year with the year values that satisfy this problem.
year= seq (from= 2008, to= 1948, by=-4)
Create a vector called “isWinnerTaller”, which is True when Winner > Opponent and False otherwise. Then, use “isWinnerTaller” to count how many times the winner has been taller than the opponent.
isWinnerTaller= winner > opponent
isWinnerTaller
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE
# Count how many times the winner has been taller than the opponent.
sum(isWinnerTaller)
## [1] 11
Use all the previous vectors to create a data frame with all the data. We can use the data.frame() function to create R data frames.
df_presidents= data.frame(presid_name, year, winner, opponent, isWinnerTaller)
df_presidents
Let’s learn some handy functions to work with data frames
head(df_presidents)
str (df_presidents)
## 'data.frame': 16 obs. of 5 variables:
## $ presid_name : chr "Obama" "Bush" "Bush" "Clinton" ...
## $ year : num 2008 2004 2000 1996 1992 ...
## $ winner : num 185 182 182 188 188 188 185 185 177 182 ...
## $ opponent : num 175 193 185 187 188 173 180 177 183 185 ...
## $ isWinnerTaller: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
colnames(df_presidents)
## [1] "presid_name" "year" "winner" "opponent"
## [5] "isWinnerTaller"
ncol(df_presidents)
## [1] 5
nrow(df_presidents)
## [1] 16
Let’s learn how to add a new column to an existing data frame.
For example, add a column named difference as the last column. This column stores the difference in height between winners and opponents.
df_presidents$difference = winner - opponent
df_presidents
Now, let’s learn how to delete a column by using the index of the column in the data frame.
For example, delete the difference column by using its index.
df_presidents [ , -6]
df_presidents
The right way is to re-assign the deletion to the data frame
df_presidents = df_presidents [, -6]
df_presidents
A better way of deleting the last column of a data frame IF you want to do it more programmatically is:
df_presidents = df_presidents [, -ncol(df_presidents)]
DO NOT RUN THIS STATEMENT. I just want you to know that this is an alternative way of deleting the the last column of a data frame.
Let’s add the column difference again, so we can delete it one more time using a different method.
df_presidents$difference = winner - opponent
df_presidents
Before, we deleted the column difference by using its index. Now, let’s delete by using its name:
df_presidents[ , colnames(df_presidents) != 'difference' ]
df_presidents= df_presidents[ , colnames(df_presidents) != 'difference' ]
df_presidents
Get the second column
df_presidents[ , 2]
## [1] 2008 2004 2000 1996 1992 1988 1984 1980 1976 1972 1968 1964 1960 1956 1952
## [16] 1948
# alternative 2: df_presidents [, 'year']
# alternative 3: df_presidents$year
Get the last column
df_presidents[ , ncol(df_presidents)]
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE
# alternative: df_presidents[ , 5]
Get the winner’s and the opponent’s heights, only for the first three rows in the data set.
df_presidents [c(1,2,3), c(3,4)]
# alternative: df_presidents [1:3 , c(3,4)]
The subset() function
Example: Use the subset() function to get the rows from the data frame where the Winner > Opponent
subset (df_presidents, df_presidents$isWinnerTaller==TRUE)
# or, more succinct: subset (df_presidents, isWinnerTaller)
Example: Get the winners’ names only for cases when Winner > Opponent
subset (df_presidents$presid_name, df_presidents$isWinnerTaller==TRUE)
## [1] "Obama" "Clinton" "Bush Father" "Reagan" "Reagan"
## [6] "Nixon" "Johnson" "Kennedy" "Eisenhower" "Eisenhower"
## [11] "Truman"
Other useful functions are apply() and tapply(). To be able to practice these functions, lets’ add a column with the party of the winner (i.e., the party of the president).
party= c("Dem", "Rep", "Rep", "Dem", "Dem", "Rep", "Rep", "Rep", "Dem","Rep","Rep","Dem","Dem","Rep","Rep","Dem")
df_presidents$presid_party= party
df_presidents
Use tapply() to compute the mean height for the presidents from each party
tapply (df_presidents$winner, df_presidents$presid_party, mean)
## Dem Rep
## 184.1429 182.6667
Use tapply() to compute the max height for the presidents from each party
tapply (df_presidents$winner, df_presidents$presid_party, max)
## Dem Rep
## 193 188
Use apply() to compute the mean height for the winners and opponents
apply (df_presidents [, c("winner", "opponent")] ,2, mean)
## winner opponent
## 183.3125 181.0625
# The 2 means that we want to apply the mean by columns
We can also use the colMeans () function
colMeans (df_presidents [, c("winner", "opponent")])
## winner opponent
## 183.3125 181.0625
Use apply() to compute the sd of the height for the winners and opponents
apply (df_presidents [, c("winner", "opponent")] , 2, sd)
## winner opponent
## 4.629165 5.579352
sd(df_presidents [, "winner"])
## [1] 4.629165
sd(df_presidents [, "opponent"])
## [1] 5.579352