Lab 4: Stacking data frames, comparing means, plotting, `ifelse()`, tabulating frequencies

Learning objectives:

appending (as opposed to merging) data frames using bind_rows() function
comparing means
plotting two variables
creating a new variable using ifelse() function
tabulating frequencies using table() and crosstab()

0. Introduction

Let’s begin by reading in our NHL regular 2016 season data and the 2016 playoffs. Check out the variables and the number of observations.

library(dplyr)
library(lubridate)
library(ggplot2)
library(stargazer)
nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
playoffs <- read.csv("https://www.dropbox.com/s/cmmhug6525m8w0u/NHLplayoffs2016.csv?raw=1")

1. Appending (as opposed to merging) data frames using `bind_rows()` function

In Lab 3 we learned how to merge data, i.e. putting two data frames side by side. If we wanted to bring the regular season games and the playoffs together into one data frame, we would want to stack one data frame on top of the other, i.e. append, or bind rows. However, before we do that, let’s create a new variable that will mark whether the game was part of the regular season or playoffs. (Let’s call it regvsplay.)

nhl$regvsplay <- "regular"
playoffs$regvsplay <- "playoffs"
nhl <- bind_rows(nhl, playoffs)

2. Testing for differences in means

Suppose we are interested in the difference between average attendance at regular season games vs attendance at playoff games. We can use feed stargazer filtered NHL data, but if we want to also see if the difference is statistically significant, we can use t.test(). The argument t.test() takes is a formula. The syntax for entering a formula is y ~ x where y is a numeric variable and x is character variable that identifies the two groups.

stargazer(filter(nhl, regvsplay=="regular"), type="text")

## 
## ===================================================
## Statistic     N      Mean    St. Dev.   Min   Max  
## ---------------------------------------------------
## goals_home  1,230   2.815      1.597     0     8   
## goals_visit 1,230   2.609      1.551     0     9   
## attendance  1,229 17,514.940 2,587.211 9,021 32,767
## ---------------------------------------------------

stargazer(filter(nhl, regvsplay=="playoffs"), type="text")

## 
## =================================================
## Statistic   N     Mean    St. Dev.   Min    Max  
## -------------------------------------------------
## goals_home  91   2.670      1.484     0      6   
## goals_visit 91   2.582      1.585     0      6   
## attendance  91 18,512.510 1,243.239 15,795 22,260
## -------------------------------------------------

t.test(nhl$attendance ~ nhl$regvsplay)

## 
##  Welch Two Sample t-test
## 
## data:  nhl$attendance by nhl$regvsplay
## t = 6.6606, df = 155.8, p-value = 4.427e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   701.7184 1293.4082
## sample estimates:
## mean in group playoffs  mean in group regular 
##               18512.51               17514.94

Do you remember how to interpret a t-stat, or the p-value? Is the difference in attendance statistically significant?

3. Plotting two lines on one graph (adding layers vs using a variable to map color)

In Lab 3 we plotted the SP500 and NASDAQ in one graph. We did that by including two geom_line() layers, and asking each layer to use a different variable for what should be mapped to y axis (..+geom_line(aes(y=SP500)) + geom_line(aes(y=NASDQ))). We used two layers and two geom_ functions because we had two variables. Suppose we have just one variable, X, that we want to plot, but there is another variable, Y, that identifies different groups of observations. If we want to plot X by different groups identified by Y, we can map color to Y. For example, if we want to plot attendance by whether or not the game is a regular season game or a playoff game, we can do the following.

ggplot(nhl) + geom_density(aes(x=attendance, color=regvsplay), adjust=3)

Notice that we included regvsplay in the aes() function telling ggplot to map attendance to x axis and regvsplay to color.

In contrast, suppose we wanted to plot the density of goals by home team vs the density of goals by visiting team. The way our data is organized we have two variables, so we are going to use two geom_ functions.

ggplot(nhl) + geom_density(aes(x=goals_home), color="blue", adjust=3) +
              geom_density(aes(x=goals_visit), color="red", adjust=3) +
              xlab("goals")

IN-CLASS EXERCISE:

Use function wday(nhl$Date, label=TRUE) from package lubridate to create day of the week variable in your NHL data.
Plot densities of attendance by day of the week. Which days have the highest attendance? Which days have the most reliable attendance?

4. Creating a new variable using the `ifelse()` function

Suppose I wanted to count wins by the home team. Right now, all we have is goals by each team. Let’s create a variable that equals 1 if the home team wins and 0 if the home team loses. (We would have to rethink this if there were ties in NHL.) We can use the ifelse() function. Its first argument is a logical condition, the second argument is the value the function returns when the logical condition is true, the third argument is the value the function returns when the logical condition is false.

nhl$win <- ifelse(nhl$goals_home-nhl$goals_visit>0,1,0)

5. Tabulating frequencies of different values a variable takes on (using `table()` and `crosstab()`)

Often times we want to tabulate the values that a variable takes on. Function table() does that.

table(nhl$win)

## 
##   0   1 
## 624 697

Function crosstab() from package descr can do nice cross-tabulations (i.e. cross tabulating frequencies across two variables)

library(descr)
crosstab(nhl$win, nhl$regvsplay, plot=FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## =====================================
##            nhl$regvsplay
## nhl$win    playoffs   regular   Total
## -------------------------------------
## 0                45       579     624
## -------------------------------------
## 1                46       651     697
## -------------------------------------
## Total            91      1230    1321
## =====================================

Exercises

Load in the merged NYC weather and S&P500 data from the previous lab. You can run your previous lab and add function write.csv() at the end to save your data. At the beginning of this lab, you can use read.csv() to load the data in. Make sure that you are in your business analytics project so that R know which folder to read the data from.
Create a new variable that takes on value “yes” if there was some precipitation and “no” otherwise. What is the mean return on days when there was precipitation compared to when there was no precipitation?
Is the difference between mean S&P 500 return with or without precipitation statistically significant? (Hint: You have to do a difference in means test here. If you remember the formula, you can do it manually. If you don’t remember the formula or don’t feel like punching the numbers into the calculator, you can look up how to get R do it for you.)
Plot the densities of daily returns when there is no precipitation overlayed with density of returns when there is precipitation. (Hint: Returns are volatile and sometimes can take extreme values. If you want your plot to only display density withing a certain range you can do so buy adding this to your plot + coord_cartesian(xlim = c(-2, 2))).
What do the two densities show - what is your interpretation of this analysis?

Lab 4: Stacking data frames, comparing means, plotting, ifelse(), tabulating frequencies