Lab 2: Statistics in R1

Introduction

The goal of this lecture is to get us working with how to answer questions from data. In general, we will think that we have some relationship of interest between an independent variable (\(x\)) and a dependent variable (\(y\)) which are thought to be bound up in some causal relationship.

\[ x \rightarrow y\]

  1. We want to move from concrete questions to the appropriate data
  2. The primary workhorse we will work up to is linear regression
    • If we change \(x\), what happens to \(y\)?
    • What is the (partial) correlation between these variables?
  3. Confidence and Hypothesis Testing
    • What can we say (and what can we not say) about any found relationship?
  4. Can we go back to the data to find better answers after learning empirical relationships?
  5. Causality: Does \(x\) cause \(y\) or are they only correlated?

A Concrete Example

What is the relationship between the unemployment rate and the % votes for an incumbent? I.e.

\[ \text{Unemployment Rate} \rightarrow \text{% Vote for Incumbent ?} \] A quotation of interest: “No American president since Franklin Delano Roosevelt has won a second term in office when the unemployment rate on Election Day topped 7.2 percent.” Binyamin Appelbaum (NYT) 2011.

So what is the effect of the unemployment rate on incumbent vote share? Do we even need to ask? How might we go about answering?

FIRST we need to establish what data we need to collect to take a first pass at the relationship.

  • Dependent Variable: Incumbent Vote Share
  • Independent Variable: Unemployment Rate

Lucky for us, this data was made easily available by FiveThirtyEight! Let’s load it into our current R environment:

unemp <- read.csv("https://www.dropbox.com/s/k4wq9idqoxa2ha0/unemployment.csv?dl=1")
head(unemp)
##   year incumbent_party incumbent_president  nominee unemployment_rate_start
## 1 1912               R                Taft     Taft                     5.1
## 2 1916               D              Wilson   Wilson                     4.9
## 3 1920               D              Wilson      Cox                     5.2
## 4 1924               R             Harding Coolidge                     8.7
## 5 1928               R            Coolidge   Hoover                     4.9
## 6 1932               R              Hoover   Hoover                     4.6
##   unemployment_rate_election election_margin
## 1                        5.3           -18.6
## 2                        5.6             3.1
## 3                        5.2           -26.2
## 4                        5.8            25.2
## 5                        5.0            17.4
## 6                       18.8           -17.7

Let’s take a look at a scatterplot of the two variables of interest, the unemployment rate at the start of the election versus the election margin:

library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
  geom_point() +
  ggtitle("Unemployment Rate and Incumbent Vote Margin") +
  xlab("Unemployment (%) at Time of Election") +
  ylab("Margin of Victory or Defeat (%)")

At least visually, there doesn’t appear to be a particularly clear relationship. What if we looked at the change in unemployment instead?

unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start

ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
  geom_point() +
  ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
  xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
  ylab("Margin of Victory or Defeat (%)")

Obviously neither relationship is particularly clear visually. What can we do to take a more rigorous look?

Linear Regression

Linear regression is the work-horse statistical model within political science and beyond. It essentially models the relationship between a dependent variable and one (or more) independent variables as being linear, i.e. able to be adequately modeled by a straight line. The goal, in other terms, is to find the “best-fitting” line for the data as described by the equation

\[ y = X \beta + \epsilon \]

where \(y\) is the dependent variable, \(X\) is the (matrix of) independent variable(s), and \(\beta\) is the (vector of) slop coefficient(s) describing the impact of \(X\) on \(y\), recognizing that there will be some \(\epsilon\) (random) error in the model.

Before going back to the unemployment example, it is useful to think of this in a simulated setting. First, let’s remind ourselves what a perfect linear relationship might look like:

vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,100,length=6)

ex1 <- data.frame(vote_share,
                 unemp_rate,
                 ex = "ex1")

ggplot(ex1, aes(x=unemp_rate,y=vote_share)) +
  geom_point() +
  geom_line()

Intuitively, the relationship given here takes the form

\[ y = 100 - 1 x \]

where the slope/coefficient value is -1 and the intercept value is 100. We might also imagine some other relationship as follows…

vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,50,length=6)

ex2 <- data.frame(vote_share,
                 unemp_rate,
                 ex = "ex2")

ex <- rbind(ex1,ex2)

ggplot(ex, aes(x=unemp_rate,y=vote_share, color = ex)) +
  geom_point() +
  geom_line()

where now the (blue) relationship is

\[ y = 100 - 2 x \]

where the slope/coefficient value is -2 and the intercept value is 100. How might these two relationships be interpreted?

Interpretation

A regression coefficient tells us how changing the independent variable \(x\) by one unit changes the dependent variable \(y\). Technically, it is the slope of the best-fitting line that we can draw through all data points where we minimize the sum of squared errors of that regression line.

Here is a bit more of a “realistic” simulated example.

nobs <- 150
x <- rnorm(nobs)
y <- 70 - 1.3*x + rnorm(nobs)

ex <- data.frame(y,x)

ggplot(ex, aes(x=x,y=y)) +
  geom_point()

install.packages("plotly", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/4n/jrzd6fkx7mn9rgtyb7vz7c4c0000gn/T//RtmpBCiNvW/downloaded_packages

Now that we have those basics out of the way, what do we learn from the first relationship?

library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
  geom_point() +
  ggtitle("Unemployment Rate and Incumbent Vote Margin") +
  xlab("Unemployment (%) at Time of Election") +
  ylab("Margin of Victory or Defeat (%)") +
  geom_smooth(method="lm")

What is the direction of the correlation? The magnitude? How close does the data fit the line?

lm(election_margin ~ unemployment_rate_election, data = unemp)
## 
## Call:
## lm(formula = election_margin ~ unemployment_rate_election, data = unemp)
## 
## Coefficients:
##                (Intercept)  unemployment_rate_election  
##                    4.15243                    -0.07257
cor(unemp$election_margin,unemp$unemployment_rate_election)
## [1] -0.02119523

What about for the change in unemployment rate?

unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start

ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
  geom_point() +
  ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
  xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
  ylab("Margin of Victory or Defeat (%)") +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

my_own_creation <- lm(election_margin ~ unemployment_change, data = unemp)
cor(unemp$election_margin, unemp$unemployment_change)
## [1] -0.349543

Multiple Regression (and some pitfalls)

So what else could be going on?

  • We may have omitted a relevant independent variable!
    • Cant (easily) plot the relationship…
    • BUT … we can still calculate slope coefficients, correlations, etc

Mathematically, this leads to multiple regression which takes the form

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon = X \beta + \epsilon \] Now each coefficient shows the effect of a one unit change in \(x\) on \(y\) “irrespective of” or “controlling for” the other variables. Let’s think about this through simulation for a moment…

library(MASS)
nobs <- 100
corm <- matrix(c(1,0.5,0.5,1),nrow=2,byrow=T)

X <- mvrnorm(nobs,rep(0,2),corm)
betas <- c(-1,3)

y <- rnorm(nobs, X %*% betas)

d <- data.frame(y,X)

X <- cbind(1,X)

solve(t(X) %*% X) %*% t(X) %*% y
##             [,1]
## [1,] -0.09773007
## [2,] -0.96591876
## [3,]  3.05344238
summary(lm(y ~ ., data = d))
## 
## Call:
## lm(formula = y ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34068 -0.78611  0.04115  0.54040  2.80682 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09773    0.10080   -0.97    0.335    
## X1          -0.96592    0.11167   -8.65  1.1e-13 ***
## X2           3.05344    0.10649   28.67  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9821 on 97 degrees of freedom
## Multiple R-squared:  0.8984, Adjusted R-squared:  0.8963 
## F-statistic: 428.9 on 2 and 97 DF,  p-value: < 2.2e-16

Great, we are able to get the correct coefficients! What happens if we don’t include X2?

lm(y ~ X1, data = d)
## 
## Call:
## lm(formula = y ~ X1, data = d)
## 
## Coefficients:
## (Intercept)           X1  
##     -0.5763       0.5838

Oh no! That’s terribly wrong, our first pitfall and a great example of omitted variable bias a topic for a different course.

Our second pitfall has to do with the nature of the data itself – with statistical analysis we believe that there is some random noise in our data generating the error term. This has a direct bearing on how confident we are in our estimates!

Exercises

Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:

Write a script which…

  1. Downloads the following data:
  2. Run different commands that help you answer the following questions:
    • How many status updates has NYU Abu Dhabi posted on its page?
    • What is the average number of likes and comments that its post receive?
    • What is the maximum number of likes and comments that its posts receive?
    • What was the content of the last status update of 2014?
  3. Download the following data:
  4. Run different commands that help you answer the following questions:
    • What type of post (photo, status, link, video) is the most common on Donald Trump’s Facebook page?
    • Using linear regression, do posts that contain photos receive more likes on average than links?
    • Using linear regression, are more liked posts also more shared?
    • Create a plot that displays the total number of comments on the page each month. Do you notice any trends? Come up with an explanation (and support it with data).

Save and email your working R script to aae322@nyu.edu by the end of the lab session.

Hint! These data are saved as .csv files so you’ll have to read them in with something like the following:

d <- read.csv("https://www.dropbox.com/s/kyzun4zl7c12j5z/lab2_nyu_data.csv?dl=1")

  1. This lab is partially based off of materials provided by Sean Kates, Pablo Barbera, and Drew Dimmery.↩︎

---
title: ""
author: "Christopher Schwarz, edits by Alia ElKattan"
pages:
  extra: true
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Lab 2: Statistics in R^[This lab is partially based off of materials provided by Sean Kates, Pablo Barbera, and Drew Dimmery.]

## Introduction

The goal of this lecture is to get us working with how to answer questions from data.   In general, we will think that we have some relationship of interest between an **independent variable** ($x$) and a **dependent variable** ($y$) which are thought to be bound up in some causal relationship.

$$ x \rightarrow y$$

1. We want to move from concrete questions to the appropriate data
2. The primary workhorse we will work up to is *linear regression*
    * If we change $x$, what happens to $y$?
    * What is the (partial) correlation between these variables?
3. Confidence and Hypothesis Testing
    * What can we say (and what can we not say) about any found relationship?
4. Can we go back to the data to find better answers after learning empirical relationships?
5. Causality: Does $x$ cause $y$ or are they only correlated?


## A Concrete Example

What is the relationship between the unemployment rate and the % votes for an incumbent?  I.e.

$$ \text{Unemployment Rate} \rightarrow \text{% Vote for Incumbent ?} $$
A quotation of interest: "No American president since Franklin Delano Roosevelt has won a second term in office when the unemployment rate on Election Day topped 7.2 percent."  [Binyamin Appelbaum (NYT) 2011](https://www.nytimes.com/2011/06/02/business/economy/02jobs.html).

So what is the effect of the unemployment rate on incumbent vote share?  Do we even need to ask?  How might we go about answering?

**FIRST** we need to establish what *data* we need to collect to take a first pass at the relationship.

* **Dependent Variable**: Incumbent Vote Share
* **Independent Variable**: Unemployment Rate

Lucky for us, this data was made easily available by [FiveThirtyEight](https://fivethirtyeight.com/features/on-the-maddeningly-inexact-relationship-between-unemployment-and-re-election/)!  Let's load it into our current R environment:

```{r}
unemp <- read.csv("https://www.dropbox.com/s/k4wq9idqoxa2ha0/unemployment.csv?dl=1")
head(unemp)
```
Let's take a look at a scatterplot of the two variables of interest, the unemployment rate at the start of the election versus the election margin:

```{r message=FALSE, warning=FALSE}
library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
  geom_point() +
  ggtitle("Unemployment Rate and Incumbent Vote Margin") +
  xlab("Unemployment (%) at Time of Election") +
  ylab("Margin of Victory or Defeat (%)")
```

At least visually, there doesn't appear to be a particularly clear relationship.  What if we looked at the change in unemployment instead?

```{r}
unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start

ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
  geom_point() +
  ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
  xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
  ylab("Margin of Victory or Defeat (%)")
```

Obviously neither relationship is particularly clear visually.  What can we do to take a more rigorous look?

## Linear Regression

Linear regression is the work-horse statistical model within political science and beyond. It essentially models the relationship between a dependent variable and one (or more) independent variables as being linear, i.e. able to be adequately modeled by a straight line. The goal, in other terms, is to find the "best-fitting" line for the data as described by the equation

$$ y = X \beta + \epsilon $$

where $y$ is the dependent variable, $X$ is the (matrix of) independent variable(s), and $\beta$ is the (vector of) slop coefficient(s) describing the impact of $X$ on $y$, recognizing that there will be some $\epsilon$ (random) error in the model.

Before going back to the unemployment example, it is useful to think of this in a simulated setting.  First, let's remind ourselves what a perfect linear relationship might look like:

```{r}
vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,100,length=6)

ex1 <- data.frame(vote_share,
                 unemp_rate,
                 ex = "ex1")

ggplot(ex1, aes(x=unemp_rate,y=vote_share)) +
  geom_point() +
  geom_line()

```

Intuitively, the relationship given here takes the form

$$ y = 100 - 1 x $$

where the slope/coefficient value is -1 and the intercept value is 100.  We might also imagine some other relationship as follows...

```{r}
vote_share <- seq(100,0,length=6)
unemp_rate <- seq(0,50,length=6)

ex2 <- data.frame(vote_share,
                 unemp_rate,
                 ex = "ex2")

ex <- rbind(ex1,ex2)

ggplot(ex, aes(x=unemp_rate,y=vote_share, color = ex)) +
  geom_point() +
  geom_line()

```

where now the (blue) relationship is 

$$ y = 100 - 2 x $$

where the slope/coefficient value is -2 and the intercept value is 100.  How might these two relationships be interpreted?

### Interpretation

A regression coefficient tells us how changing the independent variable $x$ **by one unit** changes the dependent variable $y$.  Technically, it is the **slope** of the best-fitting line that we can draw through all data points where we minimize the **sum of squared errors** of that regression line.

Here is a bit more of a "realistic" simulated example.

```{r}
nobs <- 150
x <- rnorm(nobs)
y <- 70 - 1.3*x + rnorm(nobs)

ex <- data.frame(y,x)

ggplot(ex, aes(x=x,y=y)) +
  geom_point()

```



```{r message=FALSE, warning=FALSE}
install.packages("plotly", repos = "http://cran.us.r-project.org")
```

Now that we have those basics out of the way, what do we learn from the first relationship?

```{r message=FALSE, warning=FALSE}
library(ggplot2)
ggplot(unemp, aes(x=unemployment_rate_election, y=election_margin)) +
  geom_point() +
  ggtitle("Unemployment Rate and Incumbent Vote Margin") +
  xlab("Unemployment (%) at Time of Election") +
  ylab("Margin of Victory or Defeat (%)") +
  geom_smooth(method="lm")
```

What is the direction of the correlation?  The magnitude?  How close does the data fit the line?

```{r}
lm(election_margin ~ unemployment_rate_election, data = unemp)
cor(unemp$election_margin,unemp$unemployment_rate_election)
```

What about for the change in unemployment rate?

```{r}
unemp$unemployment_change <- unemp$unemployment_rate_election - unemp$unemployment_rate_start

ggplot(unemp, aes(x = unemployment_change, y = election_margin)) +
  geom_point() +
  ggtitle("Change in Unemployment Rate and Incumbent Vote Margin") +
  xlab("Change in Unemployment (%) at Time of Election vs Start of Term") +
  ylab("Margin of Victory or Defeat (%)") +
  geom_smooth(method="lm")
```

```{r}
my_own_creation <- lm(election_margin ~ unemployment_change, data = unemp)
cor(unemp$election_margin, unemp$unemployment_change)
```

### Multiple Regression (and some pitfalls)

So what else could be going on?

+ We may have omitted a relevant independent variable!
    + Cant (easily) plot the relationship...
    + BUT ... we can still calculate slope coefficients, correlations, etc
    
Mathematically, this leads to **multiple regression** which takes the form

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon = X \beta + \epsilon $$
Now each coefficient shows the effect of a one unit change in $x$ on $y$ "irrespective of" or "controlling for" the other variables.  Let's think about this through simulation for a moment...

```{r message=FALSE, warning=FALSE}
library(MASS)
nobs <- 100
corm <- matrix(c(1,0.5,0.5,1),nrow=2,byrow=T)

X <- mvrnorm(nobs,rep(0,2),corm)
betas <- c(-1,3)

y <- rnorm(nobs, X %*% betas)

d <- data.frame(y,X)

X <- cbind(1,X)

solve(t(X) %*% X) %*% t(X) %*% y

summary(lm(y ~ ., data = d))

```

Great, we are able to get the correct coefficients!  What happens if we don't include X2?

```{r}
lm(y ~ X1, data = d)
```

Oh no!  That's terribly wrong, our **first pitfall** and a great example of **omitted variable bias** a topic for a different course.  


Our **second pitfall** has to do with the nature of the data itself -- with statistical analysis we believe that there is some random noise in our data generating the error term.  This has a direct bearing on how confident we are in our estimates! 


## Exercises

Before we get ahead of ourselves, we want to make sure that you have fundamentals in order.  Do the following:

Write a script which...

1. Downloads the following data:
    + https://www.dropbox.com/s/kyzun4zl7c12j5z/lab2_nyu_data.csv?dl=1
2. Run different commands that help you answer the following questions:
    + How many status updates has NYU Abu Dhabi posted on its page?
    + What is the average number of likes and comments that its post receive?
    + What is the maximum number of likes and comments that its posts receive?
    + What was the content of the last status update of 2014?
3. Download the following data:
    + https://www.dropbox.com/s/hnqmb8fyqla3egx/lab2_trump_data.csv?dl=1
4. Run different commands that help you answer the following questions:
    + What type of post (photo, status, link, video) is the most common on Donald Trump's Facebook page?
    + Using linear regression, do posts that contain photos receive more likes on average than links?
    + Using linear regression, are more liked posts also more shared?  
    + Create a plot that displays the total number of comments on the page each month.  Do you notice any trends?  Come up with an explanation (and support it with data).
    
Save and email your working R script to aae322\@nyu.edu by the end of the lab session.


Hint!  These data are saved as .csv files so you'll have to read them in with something like the following:
```{r}
d <- read.csv("https://www.dropbox.com/s/kyzun4zl7c12j5z/lab2_nyu_data.csv?dl=1")


```

