Introduction

This code through explores several functions in R that the econometrician will find useful when first beginning to analyze a new dataset. It is not intended to delve into the nuance of statistical analysis, but rather to show how basic tests commonly done in Stata or Excel can be done in R.


Content Overview

Throughout the lesson, we’ll explain and demonstrate how to visualize data in a useful way, check for correlations among variables, conduct t-tests, run linear regressions, and output those regressions into shareable tables. The intention is to provide the basic building blocks of statistical analysis. Once the basics are mastered, R offers several options to increase the complexity and appeal of each function.

This topic is valuable because in addition to providing a great format for visualizing data and creating interesting, interactive graphs, R is an extremely capable data processor and comes with several advanced statistical analysis packages and base functions. This combination of data analysis and publishing makes R extremely powerful for the data analyst.


Learning Objectives

Specifically, we’ll examine how to leverage the ggplot function to visualize data. We will use a simple visualization - the scatterplot - to look for relationships between variables. We will also learn to color code our output. We will then test for correlations using the cor function. Next, we will leverage the t.test function to compare the means of two samples. Then, we will then use the lm function to set up simple, multiple, and binary regressions to more thoroughly understand the significance of relationships. Finally, we will execute the modelsummary function to output the results of our regressions.



Basic Econometrics in R

Using the Lahman data set, we will run through the basic steps of statistical analysis.


Data Visualization

When analysts are first exploring datasets, it is often easiest to identify relationships by visualizing data. A common way to do so is to plot two variables against one another and look for strength and direction of a potential relationship. In R we use the ggplot function and geom_point to create a scatter plot of two variables: Wins and Hits. The result shows a very strong, positive relationship between the two. This makes intuitive sense: the more hits a team has, the more we would expect them to win.

We can also use data visualization to distinguish between samples. In this dataset, the teams are separated into divisions. By color coding each division we can more easily check for differences across division.

ggplot(data = Teams) +
  geom_point(mapping = aes(x = H, y = W, color = divID))


Correlation

After identifying two variables that seem to have a strong interaction, we want to calculate the magnitude of that interaction. We can do so by testing for correlation using the cor function in R. A correlation of 1 indicates a perfect, positive relationship. A -1 indicates a perfect, negative relationship. 0 means no relationship at all.

In the first example we run a pairwise correlation between Hits and Runs and find a very strong, positive relationship. In the second example we create a correlations matrix of a new, filtered dataset. Note that cor does not handle missing variables well, so we exclude them. We then limit our dataset to a few numeric variables.

cor(Teams$W, Teams$H)
## [1] 0.7368386
Teams_Small <- filter(Teams, !is.na(W) & !is.na(H) & !is.na(HR))
Teams_Small <- select(Teams_Small, W, H, HR)
cor(Teams_Small)
##            W         H        HR
## W  1.0000000 0.7368386 0.4772234
## H  0.7368386 1.0000000 0.4881452
## HR 0.4772234 0.4881452 1.0000000


T-tests

T-tests compare the means of two samples, with a null hypothesis that the difference between the means is zero, indicating no significant difference. If the test results reject the null hypothesis, then we can claim that two groups are, indeed, different. In this example, we can test the average number of Wins across division by using the t.test function. Note that this type of t.test can only operate on a binomial column. So before running the test, we will create three new columns, one per division, and mark each team with a 1 if it is in that division. We then run the t.test on Wins by Division.

In the first test, we find a p-value of 0.004, indicating a statistically significant difference in Win rate between the Central division and its peers. In the second, we find a much higher p-value, and therefore cannot reject the null hypothesis that the West division has a different Win mean than its peers.

Teams$Central <- ifelse(Teams$divID == "C", 1, 0)
Teams$East <- ifelse(Teams$divID == "E", 1, 0)
Teams$West <- ifelse(Teams$divID == "W", 1, 0)

t.test(W ~ Central, Teams)
## 
##  Welch Two Sample t-test
## 
## data:  W by Central
## t = 2.9284, df = 433.89, p-value = 0.003586
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.9221352 4.6862166
## sample estimates:
## mean in group 0 mean in group 1 
##        79.27536        76.47119
t.test(W ~ West, Teams)
## 
##  Welch Two Sample t-test
## 
## data:  W by West
## t = -0.17738, df = 1265.1, p-value = 0.8592
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.609656  1.342718
## sample estimates:
## mean in group 0 mean in group 1 
##        78.65957        78.79304


Linear Regression

After establising a relationship between variables using visualization and correlations, we may want to develop a model that allows us to use that information to make predictions. To do that we can employ the lm function to create a linear regression. In this example, we will regress number of Wins on Hits. We can see that the result is a coefficient of +0.06, indicating that every hit is worth about 6% of a win.

reg <- lm(W ~ H, data = Teams)
summary (reg)
## 
## Call:
## lm(formula = W ~ H, data = Teams)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.242  -8.178   0.322   8.511  42.734 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.3163050  1.3113783  -1.766   0.0774 .  
## H            0.0574333  0.0009648  59.526   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.17 on 2983 degrees of freedom
## Multiple R-squared:  0.5429, Adjusted R-squared:  0.5428 
## F-statistic:  3543 on 1 and 2983 DF,  p-value: < 2.2e-16


Multiple Regression

We often want to regress on several variables at once, in order to identify the impact of a given variable while controlling for others. To do this, we simply add a + sign to the list of independent variables. In this example, we will add Home Runs to our original model. Here we find that the power of hits has diminished slightly, but is still positive. We also see that Home Runs have a positive effect on Wins.

lm(W ~ H + HR , data = Teams)
## 
## Call:
## lm(formula = W ~ H + HR, data = Teams)
## 
## Coefficients:
## (Intercept)            H           HR  
##     0.95295      0.05156      0.04340


Binomial Regression

Sometimes, our dependent variable is, or needs to be, binomial. In that case, we will use the glm function with the binomial family. As an example, we first create a new binary variable based on whether or not a team has higher than average number of Wins. We then regress this new outcome on our independent variables Hits and Home Runs.

Teams$HighWin <- ifelse(Teams$W > mean(Teams$W), 1, 0)

glm(HighWin ~ H + HR, family = binomial(link = "logit"), data = Teams)
## 
## Call:  glm(formula = HighWin ~ H + HR, family = binomial(link = "logit"), 
##     data = Teams)
## 
## Coefficients:
## (Intercept)            H           HR  
##   -9.190521     0.006569     0.005021  
## 
## Degrees of Freedom: 2984 Total (i.e. Null);  2982 Residual
## Null Deviance:       4082 
## Residual Deviance: 3246  AIC: 3252


Model Summary

Finally, we can leverage a function called modelsummary to output the results of several regressions in one table. We simply name and run several regressions, store them into a model (called models in this example), and then call that model with the modelsummary function.

models <- list(
  "Base" = lm(W ~ H, data = Teams),
  "W/ Home Runs" = lm(W ~ H + HR, data = Teams),
  "W/ At Bats" = lm(W ~ H + HR + AB, data = Teams)
)
modelsummary(models)
Base W/ Home Runs W/ At Bats
(Intercept) −2.316 0.953 −0.543
(1.311) (1.319) (1.489)
H 0.057 0.052 0.047
(0.001) (0.001) (0.003)
HR 0.043 0.041
(0.004) (0.004)
AB 0.002
(0.001)
Num.Obs. 2985 2985 2985
R2 0.543 0.561 0.562
R2 Adj. 0.543 0.561 0.561
AIC 23393.0 23274.1 23271.5
BIC 23411.0 23298.2 23301.5
Log.Lik. −11693.505 −11633.073 −11630.747
F 3543.369 1905.885 1273.695
RMSE 12.16 11.92 11.91



Further Resources

Learn more about statistical analysis in R with the following: