This code through explores several functions in R that the econometrician will find useful when first beginning to analyze a new dataset. It is not intended to delve into the nuance of statistical analysis, but rather to show how basic tests commonly done in Stata or Excel can be done in R.
Throughout the lesson, we’ll explain and demonstrate how to visualize
data in a useful way, check for correlations among variables, conduct
t-tests, run linear regressions, and output those regressions into
shareable tables. The intention is to provide the basic building blocks
of statistical analysis. Once the basics are mastered, R offers several
options to increase the complexity and appeal of each function.
This topic is valuable because in addition to providing a great format for visualizing data and creating interesting, interactive graphs, R is an extremely capable data processor and comes with several advanced statistical analysis packages and base functions. This combination of data analysis and publishing makes R extremely powerful for the data analyst.
Specifically, we’ll examine how to leverage the ggplot function to visualize data. We will use a simple visualization - the scatterplot - to look for relationships between variables. We will also learn to color code our output. We will then test for correlations using the cor function. Next, we will leverage the t.test function to compare the means of two samples. Then, we will then use the lm function to set up simple, multiple, and binary regressions to more thoroughly understand the significance of relationships. Finally, we will execute the modelsummary function to output the results of our regressions.
Using the Lahman data set, we will run through the basic steps of statistical analysis.
When analysts are first exploring datasets, it is often easiest to
identify relationships by visualizing data. A common way to do so is to
plot two variables against one another and look for strength and
direction of a potential relationship. In R we use the
ggplot function and geom_point to
create a scatter plot of two variables: Wins and Hits. The result shows
a very strong, positive relationship between the two. This makes
intuitive sense: the more hits a team has, the more we would expect them
to win.
We can also use data visualization to distinguish between samples. In this dataset, the teams are separated into divisions. By color coding each division we can more easily check for differences across division.
ggplot(data = Teams) +
geom_point(mapping = aes(x = H, y = W, color = divID))After identifying two variables that seem to have a strong
interaction, we want to calculate the magnitude of that interaction. We
can do so by testing for correlation using the cor
function in R. A correlation of 1 indicates a perfect, positive
relationship. A -1 indicates a perfect, negative relationship. 0 means
no relationship at all.
In the first example we run a pairwise correlation between Hits and Runs and find a very strong, positive relationship. In the second example we create a correlations matrix of a new, filtered dataset. Note that cor does not handle missing variables well, so we exclude them. We then limit our dataset to a few numeric variables.
cor(Teams$W, Teams$H)## [1] 0.7368386
Teams_Small <- filter(Teams, !is.na(W) & !is.na(H) & !is.na(HR))
Teams_Small <- select(Teams_Small, W, H, HR)
cor(Teams_Small)## W H HR
## W 1.0000000 0.7368386 0.4772234
## H 0.7368386 1.0000000 0.4881452
## HR 0.4772234 0.4881452 1.0000000
T-tests compare the means of two samples, with a null hypothesis that
the difference between the means is zero, indicating no significant
difference. If the test results reject the null hypothesis, then we can
claim that two groups are, indeed, different. In this example, we can
test the average number of Wins across division by using the
t.test function. Note that this type of t.test can only
operate on a binomial column. So before running the test, we will create
three new columns, one per division, and mark each team with a 1 if it
is in that division. We then run the t.test on Wins by Division.
In the first test, we find a p-value of 0.004, indicating a statistically significant difference in Win rate between the Central division and its peers. In the second, we find a much higher p-value, and therefore cannot reject the null hypothesis that the West division has a different Win mean than its peers.
Teams$Central <- ifelse(Teams$divID == "C", 1, 0)
Teams$East <- ifelse(Teams$divID == "E", 1, 0)
Teams$West <- ifelse(Teams$divID == "W", 1, 0)
t.test(W ~ Central, Teams)##
## Welch Two Sample t-test
##
## data: W by Central
## t = 2.9284, df = 433.89, p-value = 0.003586
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.9221352 4.6862166
## sample estimates:
## mean in group 0 mean in group 1
## 79.27536 76.47119
t.test(W ~ West, Teams)##
## Welch Two Sample t-test
##
## data: W by West
## t = -0.17738, df = 1265.1, p-value = 0.8592
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.609656 1.342718
## sample estimates:
## mean in group 0 mean in group 1
## 78.65957 78.79304
After establising a relationship between variables using visualization and correlations, we may want to develop a model that allows us to use that information to make predictions. To do that we can employ the lm function to create a linear regression. In this example, we will regress number of Wins on Hits. We can see that the result is a coefficient of +0.06, indicating that every hit is worth about 6% of a win.
reg <- lm(W ~ H, data = Teams)
summary (reg)##
## Call:
## lm(formula = W ~ H, data = Teams)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.242 -8.178 0.322 8.511 42.734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.3163050 1.3113783 -1.766 0.0774 .
## H 0.0574333 0.0009648 59.526 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.17 on 2983 degrees of freedom
## Multiple R-squared: 0.5429, Adjusted R-squared: 0.5428
## F-statistic: 3543 on 1 and 2983 DF, p-value: < 2.2e-16
We often want to regress on several variables at once, in order to identify the impact of a given variable while controlling for others. To do this, we simply add a + sign to the list of independent variables. In this example, we will add Home Runs to our original model. Here we find that the power of hits has diminished slightly, but is still positive. We also see that Home Runs have a positive effect on Wins.
lm(W ~ H + HR , data = Teams)##
## Call:
## lm(formula = W ~ H + HR, data = Teams)
##
## Coefficients:
## (Intercept) H HR
## 0.95295 0.05156 0.04340
Sometimes, our dependent variable is, or needs to be, binomial. In that case, we will use the glm function with the binomial family. As an example, we first create a new binary variable based on whether or not a team has higher than average number of Wins. We then regress this new outcome on our independent variables Hits and Home Runs.
Teams$HighWin <- ifelse(Teams$W > mean(Teams$W), 1, 0)
glm(HighWin ~ H + HR, family = binomial(link = "logit"), data = Teams)##
## Call: glm(formula = HighWin ~ H + HR, family = binomial(link = "logit"),
## data = Teams)
##
## Coefficients:
## (Intercept) H HR
## -9.190521 0.006569 0.005021
##
## Degrees of Freedom: 2984 Total (i.e. Null); 2982 Residual
## Null Deviance: 4082
## Residual Deviance: 3246 AIC: 3252
Finally, we can leverage a function called modelsummary to output the results of several regressions in one table. We simply name and run several regressions, store them into a model (called models in this example), and then call that model with the modelsummary function.
models <- list(
"Base" = lm(W ~ H, data = Teams),
"W/ Home Runs" = lm(W ~ H + HR, data = Teams),
"W/ At Bats" = lm(W ~ H + HR + AB, data = Teams)
)
modelsummary(models)| Base | W/ Home Runs | W/ At Bats | |
|---|---|---|---|
| (Intercept) | −2.316 | 0.953 | −0.543 |
| (1.311) | (1.319) | (1.489) | |
| H | 0.057 | 0.052 | 0.047 |
| (0.001) | (0.001) | (0.003) | |
| HR | 0.043 | 0.041 | |
| (0.004) | (0.004) | ||
| AB | 0.002 | ||
| (0.001) | |||
| Num.Obs. | 2985 | 2985 | 2985 |
| R2 | 0.543 | 0.561 | 0.562 |
| R2 Adj. | 0.543 | 0.561 | 0.561 |
| AIC | 23393.0 | 23274.1 | 23271.5 |
| BIC | 23411.0 | 23298.2 | 23301.5 |
| Log.Lik. | −11693.505 | −11633.073 | −11630.747 |
| F | 3543.369 | 1905.885 | 1273.695 |
| RMSE | 12.16 | 11.92 | 11.91 |
Learn more about statistical analysis in R with the following:
Tidyverse.org: Graphing in ggplot
Data-flair.com: T-tests
Statology.org: Linear Modeling
CRAN.org: Model Summary