You may be asking yourselves: why are we going to talk about statistics when studying social media and political participation? The answer is multifaceted:
The goal of this lecture is to get us working with how to answer questions from data. In general, we will think that we have some relationship of interest between an independent variable (\(x\)) and a dependent variable (\(y\)) which are thought to be bound up in some causal relationship.
\[ x \rightarrow y\]
What is the relationship between the unemployment rate and the % votes for an incumbent? I.e.
\[ \text{Unemployment Rate} \rightarrow \text{% Vote for Incumbent ?} \] A quotation of interest: “No American president since Franklin Delano Roosevelt has won a second term in office when the unemployment rate on Election Day topped 7.2 percent.” Binyamin Appelbaum (NYT) 2011.
So what is the effect of the unemployment rate on incumbent vote share? Do we even need to ask? How might we go about answering?
FIRST we need to establish what data we need to collect to take a first pass at the relationship.
Lucky for us, this data was made easily available by FiveThirtyEight! Let’s load it into our current R environment:
unemp <- read.csv("https://www.dropbox.com/s/k4wq9idqoxa2ha0/unemployment.csv?dl=1")
head(unemp)
## year incumbent_party incumbent_president nominee unemployment_rate_start
## 1 1912 R Taft Taft 5.1
## 2 1916 D Wilson Wilson 4.9
## 3 1920 D Wilson Cox 5.2
## 4 1924 R Harding Coolidge 8.7
## 5 1928 R Coolidge Hoover 4.9
## 6 1932 R Hoover Hoover 4.6
## unemployment_rate_election election_margin
## 1 5.3 -18.6
## 2 5.6 3.1
## 3 5.2 -26.2
## 4 5.8 25.2
## 5 5.0 17.4
## 6 18.8 -17.7
Let’s take a look at a scatterplot of the two variables of interest, the unemployment rate at the start of the election versus the election margin:
library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
geom_point() +
ggtitle("Unemployment Rate and Incumbent Vote Margin") +
xlab("Unemployment (%) at Time of Election") +
ylab("Margin of Victory or Defeat (%)")
At least visually, there doesn’t appear to be a particularly clear relationship. What if we looked at the change in unemployment instead?
unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start
ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
geom_point() +
ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
ylab("Margin of Victory or Defeat (%)")
Obviously neither relationship is particularly clear visually. What can we do to take a more rigorous look?
Linear regression is the work-horse statistical model within political science and beyond. It essentially models the relationship between a dependent variable and one (or more) independent variables as being linear, i.e. able to be adequately modeled by a straight line. The goal, in other terms, is to find the “best-fitting” line for the data as described by the equation
\[ y = X \beta + \epsilon \]
where \(y\) is the independent variable, \(X\) is the (matrix of) independent variable(s), and \(\beta\) is the (vector of) slop coefficient(s) describing the impact of \(X\) on \(y\), recognizing that there will be some \(\epsilon\) (random) error in the model.
Before going back to the unemployment example, it is useful to think of this in a simulated setting. First, let’s remind ourselves what a perfect linear relationship might look like:
vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,100,length=6)
ex1 <- data.frame(vote_share,
unemp_rate,
ex = "ex1")
ggplot(ex1, aes(x=unemp_rate,y=vote_share)) +
geom_point() +
geom_line()
Intuitively, the relationship given here takes the form
\[ y = 100 - 1 x \]
where the slope/coefficient value is -1 and the intercept value is 100. We might also imagine some other relationship as follows…
vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,50,length=6)
ex2 <- data.frame(vote_share,
unemp_rate,
ex = "ex2")
ex <- rbind(ex1,ex2)
ggplot(ex, aes(x=unemp_rate,y=vote_share, color = ex)) +
geom_point() +
geom_line()
where now the (blue) relationship is
\[ y = 100 - 2 x \]
where the slope/coefficient value is -2 and the intercept value is 100. How might these two relationships be interpreted?
A regression coefficient tells us how changing the independent variable \(x\) by one unit changes the dependent variable \(y\). Technically, it is the slope of the best-fitting line that we can draw through all data points where we minimize the sum of squared errors of that regression line.
Here is a bit more of a “realistic” simulated example.
nobs <- 150
x <- rnorm(nobs)
y <- 70 - 1.3*x + rnorm(nobs)
ex <- data.frame(y,x)
ggplot(ex, aes(x=x,y=y)) +
geom_point()
Here we have a clear negative relationship but, supposing we did not generate the data ourselves, have no immediate knowledge of the function which generated it. What to do? Minimize the sum of squared residuals! You won’t have to do this by hand in the course, but the solution to this minimization problem is of the form
\[ \hat{\beta} = (X'X)^{-1}X'y \]
which can be solved for directly in R:
X <- cbind(1,ex$x)
beta <- solve(t(X) %*% X) %*% t(X) %*% ex$y
beta
## [,1]
## [1,] 70.020312
## [2,] -1.383377
Not a bad estimate!
ggplot(ex, aes(x=x, y=y)) +
geom_point() +
geom_abline(slope = beta[2,1], intercept = beta[1,1])
We can even verify that these parameters minimize the sum of squared errors fairly easily!
intercepts <- seq(60,80,length=10)
slopes <- seq(0,-2,length=100)
params <- expand.grid(intercepts = intercepts,
slopes = slopes)
out <- list()
for(i in 1:nrow(params)){
betas <- c(params[i,"intercepts"],params[i,"slopes"])
errs <- y - X %*% betas
out[[i]] <- sum(errs^2)
}
params$errs <- unlist(out)
params$grp <- "Not Minimum"
params[which.min(params$errs),"grp"] <- "Minumum"
params[which.min(params$errs),]
## intercepts slopes errs grp
## 676 71.11111 -1.353535 295.1866 Minumum
install.packages("plotly", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/4n/jrzd6fkx7mn9rgtyb7vz7c4c0000gn/T//RtmpZHtKAm/downloaded_packages
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(params, x = ~intercepts, y = ~slopes, z= ~errs, color= ~grp, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
Pretty cool! But what is the relationship between these (slope) coefficients and correlations?
If we standardize the relationship between two variables, we get the linear correlation coefficient:
ex$std_x <- ex$x/sd(ex$x)
ex$std_y <- ex$y/sd(ex$y)
lm(std_y ~ std_x, data = ex)
##
## Call:
## lm(formula = std_y ~ std_x, data = ex)
##
## Coefficients:
## (Intercept) std_x
## 43.2332 -0.8372
cor(ex$std_x,ex$std_y)
## [1] -0.8371839
This seems nice, but herein lies danger! These sort of summary statistics, like correlation coefficients, can be generated from a number of fundamentally different relationships! Take The Datasaurus Dozen as a prime example:
Each of the above images have the exact same basic summary statistics, and so would have the same regression line, but have fundamentally different visual relationships! Here are some additional examples from Wikipedia:
Now that we have those basics out of the way, what do we learn from the first relationship?
library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
geom_point() +
ggtitle("Unemployment Rate and Incumbent Vote Margin") +
xlab("Unemployment (%) at Time of Election") +
ylab("Margin of Victory or Defeat (%)") +
geom_smooth(method="lm")
What is the direction of the correlation? The magnitude? How close does the data fit the line?
lm(election_margin ~ unemployment_rate_election, data = unemp)
##
## Call:
## lm(formula = election_margin ~ unemployment_rate_election, data = unemp)
##
## Coefficients:
## (Intercept) unemployment_rate_election
## 4.15243 -0.07257
cor(unemp$election_margin,unemp$unemployment_rate_election)
## [1] -0.02119523
What about for the change in unemployment rate?
unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start
ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
geom_point() +
ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
ylab("Margin of Victory or Defeat (%)") +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
lm(election_margin ~ unemployment_change, data = unemp)
##
## Call:
## lm(formula = election_margin ~ unemployment_change, data = unemp)
##
## Coefficients:
## (Intercept) unemployment_change
## 3.704 -1.144
cor(unemp$election_margin, unemp$unemployment_change)
## [1] -0.349543
So what else could be going on?
Mathematically, this leads to multiple regression which takes the form
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon = X \beta + \epsilon \] Now each coefficient shows the effect of a one unit change in \(x\) on \(y\) “irrespective of” or “controlling for” the other variables. Let’s think about this through simulation for a moment…
library(MASS)
nobs <- 100
corm <- matrix(c(1,0.5,0.5,1),nrow=2,byrow=T)
X <- mvrnorm(nobs,rep(0,2),corm)
betas <- c(-1,3)
y <- rnorm(nobs, X %*% betas)
d <- data.frame(y,X)
X <- cbind(1,X)
solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## [1,] 0.003932194
## [2,] -0.892947797
## [3,] 2.987222819
lm(y ~ ., data = d)
##
## Call:
## lm(formula = y ~ ., data = d)
##
## Coefficients:
## (Intercept) X1 X2
## 0.003932 -0.892948 2.987223
Great, we are able to get the correct coefficients! What happens if we don’t include X2?
lm(y ~ X1, data = d)
##
## Call:
## lm(formula = y ~ X1, data = d)
##
## Coefficients:
## (Intercept) X1
## 0.3046 0.6605
Oh no! That’s terribly wrong, our first pitfall and a great example of omitted variable bias a topic for a different course.
Our second pitfall has to do with the nature of the data itself – with statistical analysis we believe that there is some random noise in our data generating the error term. This has a direct bearing on how confident we are in our estimates! Consider the following example:
set.seed(12345)
nobs <- 1000
x <- rnorm(nobs)
y1 <- rnorm(nobs, x, sd = 1)
y2 <- rnorm(nobs, x, sd = 100)
d <- data.frame(y1,y2,x)
ggplot(d,aes(y=y1,x=x)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(y1~x,data=d))
##
## Call:
## lm(formula = y1 ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2315 -0.6975 0.0011 0.7156 3.3577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03220 0.03191 -1.009 0.313
## x 1.03915 0.03193 32.540 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.008 on 998 degrees of freedom
## Multiple R-squared: 0.5148, Adjusted R-squared: 0.5143
## F-statistic: 1059 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(d,aes(y=y2,x=x)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(y2~x,data=d))
##
## Call:
## lm(formula = y2 ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -311.384 -64.608 1.067 64.018 278.335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4982 3.0301 -0.824 0.410
## x -0.1649 3.0321 -0.054 0.957
##
## Residual standard error: 95.72 on 998 degrees of freedom
## Multiple R-squared: 2.963e-06, Adjusted R-squared: -0.000999
## F-statistic: 0.002957 on 1 and 998 DF, p-value: 0.9566
In both, the “true” relationship between X and y is exactly the same except that y is more or less stochastic. In the first analysis, we are able to accurately estimate the target coefficient. In the second, however, the estimated coefficient isn’t even in the correct direction!
This leads us directly to the question of how confident can we be in our results, and how do we access statistical significance etc.
This question is relevant because…
The basic solution to this problem comes in the notion of a standard error. This has the basic form:
\[ SE_{est} = \sqrt{\frac{\text{Sum of Squared Residuals}}{(\text{Sample Size - Number of Parameters}) * \sum_i(x_i - \bar{x})^2 }} \]
For the x variable of interest. Consider the last simulated example:
m <- lm(y1 ~ x, data = d)
summary(m)
##
## Call:
## lm(formula = y1 ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2315 -0.6975 0.0011 0.7156 3.3577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03220 0.03191 -1.009 0.313
## x 1.03915 0.03193 32.540 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.008 on 998 degrees of freedom
## Multiple R-squared: 0.5148, Adjusted R-squared: 0.5143
## F-statistic: 1059 on 1 and 998 DF, p-value: < 2.2e-16
Let’s calculate the term by hand…
se1 <- sqrt(sum(m$residuals^2)/((nobs - 2)*sum((d$x - mean(d$x))^2)))
se1
## [1] 0.03193485
Excellent! But what does this really represent???
I think that the best way to understand this quantity is by simulating the process we are trying to account for. In particular, we will take 1000 random samples from the above process and calculate the standard error of those coefficient values. This is the standard error we are targeting! First, let’s write a simple function:
set.seed(12345)
one_sim <- function(nobs,beta,sd_e){
x <- rnorm(nobs)
y <- rnorm(nobs, beta*x, sd_e)
X <- cbind(1,x)
b <- solve( t(X) %*% X) %*% t(X) %*% y
b[2,1]
}
Now let’s get 1000 coefficient estimates for each of our cases:
out1 <- sapply(1:1000,function(x)one_sim(1000,1,1))
out2 <- sapply(1:1000,function(x)one_sim(1000,1,100))
sims <- data.frame(process1 = out1, process2 = out2)
Here is the sampling distribution of the coefficient for the first process:
ggplot(sims, aes(x = process1)) +
geom_density()
sd(sims$process1)
## [1] 0.03132773
mean(sims$process1)
## [1] 0.9996647
We see a few things…
Now the second scenario…
ggplot(sims, aes(x = process2)) +
geom_density()
sd(sims$process2)
## [1] 3.22746
mean(sims$process2)
## [1] 0.7846619
We see a few things…
This leads us directly to our notions of statistical hypothesis testing. Generally speaking, we suppose that if the true value of the coefficient is 0 in the world, what is the chance that my sample data gave me this coefficient value? This is the null hypothesis against which we will compare our results.
Mechanically, this is how it works. First, we take the estimated coefficient value and divide it by it’s standard error. This gives the t-statistic of the effect.
s <- summary(m)
s
##
## Call:
## lm(formula = y1 ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2315 -0.6975 0.0011 0.7156 3.3577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03220 0.03191 -1.009 0.313
## x 1.03915 0.03193 32.540 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.008 on 998 degrees of freedom
## Multiple R-squared: 0.5148, Adjusted R-squared: 0.5143
## F-statistic: 1059 on 1 and 998 DF, p-value: < 2.2e-16
t1 <- s$coefficients[,1]/s$coefficients[,2]
t1
## (Intercept) x
## -1.009025 32.539627
Next, we take this t-statistic and ask ourselves “for a t-distribution of mean zero and degrees of freedom equal to our regression degrees of freedom, with what probability would we observe a coefficient value this extreme or more?” If this probability is sufficiently small, we declare a “significant” result. Otherwise we call it a “null” result. Here is what a t-distribution looks like with 998 degrees of freedom:
curve(dt(x,df=nobs-2),from=-4,to=4)
And so because the t-statistic of our first example is massive, the effect is highly significant. For a more illustrative example, consider calculating the p-value of the intercept coefficient entirely by hand (which you won’t have to do, but it’s good to see that it is simple!):
# Calculate Regression Coefficients
coefs <- solve(t(cbind(1,x)) %*% cbind(1,x)) %*% t(cbind(1,x)) %*% y1
# Calculate Sum of Squared Residuals
ssr <- sum((y1 - cbind(1,x) %*% coefs)^2)
# Calculate Standard Errors
ses <- sqrt(ssr/(nobs-2) * diag(solve(t(cbind(1,x)) %*% cbind(1,x))))
# Calculate the t-statistic
t_stat <- coefs/ses
# Calculate the p-values
(1-pt(abs(t_stat), nobs-2)) * 2
## [,1]
## 0.313207
## x 0.000000
Essentially what we have done is calculated the shaded area of the below curve and found it to be too large to be “significant”:
ggplot(data.frame(x = c(4,-4)),aes(x)) +
stat_function(fun = dt, args=list(df=nobs-2)) +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(abs(t_stat)[1],4),
geom = "area",
fill = "red") +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(-abs(t_stat)[1],-4),
geom = "area",
fill = "red") +
theme_bw()
What about for the null effect we found?
m2 <- lm(y2 ~ x, data = d)
s2 <- summary(m2)
s2
##
## Call:
## lm(formula = y2 ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -311.384 -64.608 1.067 64.018 278.335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4982 3.0301 -0.824 0.410
## x -0.1649 3.0321 -0.054 0.957
##
## Residual standard error: 95.72 on 998 degrees of freedom
## Multiple R-squared: 2.963e-06, Adjusted R-squared: -0.000999
## F-statistic: 0.002957 on 1 and 998 DF, p-value: 0.9566
First let’s replicate the results by hand…
# Calculate Regression Coefficients
coefs <- solve(t(cbind(1,x)) %*% cbind(1,x)) %*% t(cbind(1,x)) %*% y2
# Calculate Sum of Squared Residuals
ssr <- sum((y2 - cbind(1,x) %*% coefs)^2)
# Calculate Standard Errors
ses <- sqrt(ssr/(nobs-2) * diag(solve(t(cbind(1,x)) %*% cbind(1,x))))
# Calculate the t-statistic
t_stat <- coefs/ses
# Calculate the p-values
(1-pt(abs(t_stat), nobs-2)) * 2
## [,1]
## 0.4098616
## x 0.9566473
And visualize…
ggplot(data.frame(x = c(4,-4)),aes(x)) +
stat_function(fun = dt, args=list(df=nobs-2)) +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(abs(t_stat)[1],4),
geom = "area",
fill = "red") +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(-abs(t_stat)[1],-4),
geom = "area",
fill = "red") +
theme_bw() +
ggtitle("Null Effect of Model 2 Intercept")
ggplot(data.frame(x = c(4,-4)),aes(x)) +
stat_function(fun = dt, args=list(df=nobs-2)) +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(abs(t_stat)[2],4),
geom = "area",
fill = "red") +
stat_function(fun = dt, args=list(df=nobs-2),
xlim = c(-abs(t_stat)[2],-4),
geom = "area",
fill = "red") +
theme_bw() +
ggtitle("Null Effect of Model 2 x-variable")
Excellent. As a final thought, let’s think about what exactly these confidence intervals say seeing as there are many misinterpretations. We have taken as tacit, following convention, that we are looking for a 95% confidence interval which – insofar as zero is not within the interval – is interpreted as finding a significant difference at the 5% level.
So what does this actually mean? Perhaps the most straightforward interpretation is the following: were we to re-sample the data and repeat the analysis infinitely many times, the ‘true’ value of the parameter will be captured by confidence intervals constructed in such a way 95% of the time. This is known as the ‘coverage rate’ and is straightforward to demonstrate:
one_sim <- function(nobs,truth){
x <- rnorm(nobs)
y <- truth*x + rnorm(nobs)
m <- lm(y ~ x)
interval <- confint(m)[2,]
interval
}
set.seed(1234)
nsims <- 5000
nobs <- 2000
truth <- 1
sims <- lapply(1:nsims,
function(x)one_sim(nobs,truth))
out <- do.call("rbind",sims)
out <- as.data.frame(out)
names(out) <- c("lower","upper")
out$truth_in_interval <- ifelse(out$lower < truth & out$upper > truth,TRUE,FALSE)
mean(out$truth_in_interval)
## [1] 0.95
Pretty cool!
Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:
Write a script which…
Save and email your working R script to aae322@nyu.edu by the end of the lab session.
Hint! These data are saved as .csv files so you’ll have to read them in with something like the following:
d <- read.csv("https://www.dropbox.com/s/kyzun4zl7c12j5z/lab2_nyu_data.csv?dl=1")
This lab is partially based off of materials provided by Sean Kates, Pablo Barbera, and Drew Dimmery.↩︎