ifelse(), tabulating frequenciesbind_rows() functionifelse() functiontable() and crosstab()Let’s begin by reading in our NHL regular 2016 season data and the 2016 playoffs. Check out the variables and the number of observations.
library(dplyr)
library(lubridate)
library(ggplot2)
library(stargazer)
nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
playoffs <- read.csv("https://www.dropbox.com/s/cmmhug6525m8w0u/NHLplayoffs2016.csv?raw=1")
bind_rows() functionIn Lab 3 we learned how to merge data, i.e. putting two data frames side by side. If we wanted to bring the regular season games and the playoffs together into one data frame, we would want to stack one data frame on top of the other, i.e. append, or bind rows. However, before we do that, let’s create a new variable that will mark whether the game was part of the regular season or playoffs. (Let’s call it regvsplay.)
nhl$regvsplay <- "regular"
playoffs$regvsplay <- "playoffs"
nhl <- bind_rows(nhl, playoffs)
Suppose we are interested in the difference between average attendance at regular season games vs attendance at playoff games. We can use feed stargazer filtered NHL data, but if we want to also see if the difference is statistically significant, we can use t.test(). The argument t.test() takes is a formula. The syntax for entering a formula is y ~ x where y is a numeric variable and x is character variable that identifies the two groups.
stargazer(filter(nhl, regvsplay=="regular"), type="text")
##
## ===================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------
## goals_home 1,230 2.815 1.597 0 8
## goals_visit 1,230 2.609 1.551 0 9
## attendance 1,229 17,514.940 2,587.211 9,021 32,767
## ---------------------------------------------------
stargazer(filter(nhl, regvsplay=="playoffs"), type="text")
##
## =================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------
## goals_home 91 2.670 1.484 0 6
## goals_visit 91 2.582 1.585 0 6
## attendance 91 18,512.510 1,243.239 15,795 22,260
## -------------------------------------------------
t.test(nhl$attendance ~ nhl$regvsplay)
##
## Welch Two Sample t-test
##
## data: nhl$attendance by nhl$regvsplay
## t = 6.6606, df = 155.8, p-value = 4.427e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 701.7184 1293.4082
## sample estimates:
## mean in group playoffs mean in group regular
## 18512.51 17514.94
Do you remember how to interpret a t-stat, or the p-value? Is the difference in attendance statistically significant?
In Lab 3 we plotted the SP500 and NASDAQ in one graph. We did that by including two geom_line() layers, and asking each layer to use a different variable for what should be mapped to y axis (..+geom_line(aes(y=SP500)) + geom_line(aes(y=NASDQ))). We used two layers and two geom_ functions because we had two variables. Suppose we have just one variable, X, that we want to plot, but there is another variable, Y, that identifies different groups of observations. If we want to plot X by different groups identified by Y, we can map color to Y. For example, if we want to plot attendance by whether or not the game is a regular season game or a playoff game, we can do the following.
ggplot(nhl) + geom_density(aes(x=attendance, color=regvsplay), adjust=3)
Notice that we included regvsplay in the aes() function telling ggplot to map attendance to x axis and regvsplay to color.
In contrast, suppose we wanted to plot the density of goals by home team vs the density of goals by visiting team. The way our data is organized we have two variables, so we are going to use two geom_ functions.
ggplot(nhl) + geom_density(aes(x=goals_home), color="blue", adjust=3) +
geom_density(aes(x=goals_visit), color="red", adjust=3) +
xlab("goals")
wday(nhl$Date, label=TRUE) from package lubridate to create day of the week variable in your NHL data.ifelse() functionSuppose I wanted to count wins by the home team. Right now, all we have is goals by each team. Let’s create a variable that equals 1 if the home team wins and 0 if the home team loses. (We would have to rethink this if there were ties in NHL.) We can use the ifelse() function. Its first argument is a logical condition, the second argument is the value the function returns when the logical condition is true, the third argument is the value the function returns when the logical condition is false.
nhl$win <- ifelse(nhl$goals_home-nhl$goals_visit>0,1,0)
table() and crosstab())Often times we want to tabulate the values that a variable takes on. Function table() does that.
table(nhl$win)
##
## 0 1
## 624 697
Function crosstab() from package descr can do nice cross-tabulations (i.e. cross tabulating frequencies across two variables)
library(descr)
crosstab(nhl$win, nhl$regvsplay, plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## =====================================
## nhl$regvsplay
## nhl$win playoffs regular Total
## -------------------------------------
## 0 45 579 624
## -------------------------------------
## 1 46 651 697
## -------------------------------------
## Total 91 1230 1321
## =====================================
Load in the merged NYC weather and S&P500 data from the previous lab. You can run your previous lab and add function write.csv() at the end to save your data. At the beginning of this lab, you can use read.csv() to load the data in. Make sure that you are in your business analytics project so that R know which folder to read the data from.
Create a new variable that takes on value “yes” if there was some precipitation and “no” otherwise. What is the mean return on days when there was precipitation compared to when there was no precipitation?
Is the difference between mean S&P 500 return with or without precipitation statistically significant? (Hint: You have to do a difference in means test here. If you remember the formula, you can do it manually. If you don’t remember the formula or don’t feel like punching the numbers into the calculator, you can look up how to get R do it for you.)
Plot the densities of daily returns when there is no precipitation overlayed with density of returns when there is precipitation. (Hint: Returns are volatile and sometimes can take extreme values. If you want your plot to only display density withing a certain range you can do so buy adding this to your plot + coord_cartesian(xlim = c(-2, 2))).
What do the two densities show - what is your interpretation of this analysis?