Bivariate Linear Regression

Joe Ripberger

So far, we have covered…

Two-sample t-test (difference of means)
- DV: continuous variable
- IV: dichotomous categorical variable
Two-proportion z-test (difference of proportions)
- DV: dichotomous categorical variable
- IV: dichotomous categorical variable
Covariance and correlation
- DV: continuous variable
- IV: continuous variable

Today we add…

Two-sample t-test (difference of means)
- DV: continuous variable
- IV: dichotomous categorical variable
Two-proportion z-test (difference of proportions)
- DV: dichotomous categorical variable
- IV: dichotomous categorical variable
Covariance and correlation
- DV: continuous variable
Bivariate linear (simple) regression
- DV: continuous variable
- IV: continuous variable

Example Research

Research question: why are some countries democracies whereas others are not?
Theory: economic development causes democratization
- Modernization theory (Lipset 1963; Przeworski et al 2000)
Hypothesis: there is a positive relationship between economic development and democracy—more “developed” countries will be more democratic than less developed countries (and vice versa)
Data:
- Democracy (dependent variable): 0-100 rating based on the Freedom House composite index
- Economic development (independent variable): per capita GDP (in US $), based on UN statistics
Unit of Analysis: Country in 2000 (n = 188)

Descriptive Statistics

# A tibble: 2 × 8
  name       min   max   mean median     sd skewness kurtosis
  <chr>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 fh_index  14.3   100   64.6   67.8   28.3   -0.271     1.68
2 gdp       92   45001 6060.  1728   9297.     2.00      6.24

Descriptive Statistics

# A tibble: 3 × 8
  name       min     max    mean  median      sd skewness kurtosis
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
1 fh_index 14.3    100     64.6    67.8    28.3    -0.271     1.68
2 gdp      92    45001   6060.   1728    9297.      2.00      6.24
3 log_gdp   4.52    10.7    7.55    7.45    1.62    0.167     2.00

Development and Democracy

ds %>% 
  summarise(cov = cov(fh_index,  log_gdp), 
            cor = cor(fh_index,  log_gdp))

# A tibble: 1 × 2
    cov   cor
  <dbl> <dbl>
1  25.1 0.550

cor.test(ds$fh_index,  ds$log_gdp)


    Pearson's product-moment correlation

data:  ds$fh_index and ds$log_gdp
t = 8.9739, df = 186, p-value = 0.0000000000000003101
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4412788 0.6422632
sample estimates:
      cor 
0.5496762

Development and Democracy

On average, there is a positive and statistically significant relationship between development and democracy—as development increases, democracy seems to increase.

By how much?

Bivariate Linear Regression

We use bivariate (simple) linear regression to estimate the slope (and intercept) of the line that “best fits” the data; we use this information to:
1. Identify the presence (or absence) of a relationship between variables
2. Identify the direction of the relationship (positive or negative)
3. Identify the strength of the relationship—how much does y change when x changes?
4. Make predictions—if x is 7, what should y be?

Bivariate Linear Regression

Bivariate linear regression models take the following form: y_i=\alpha+\beta{x}_i+\varepsilon_i, where:
- y is the dependent variable
- i is the unit of analysis
- \alpha is the intercept
- \beta is the slope
- x is the independent variable
- \varepsilon is the error term

Bivariate Linear Regression: Slope and Intercept

y_i=\alpha+\beta{x}_i
- \beta = \frac{\Delta{y}}{\Delta{x}} = \frac{change_y}{change_x} = \frac{rise}{run}

Bivariate Linear Regression: Residuals (Errors)

y_i = \alpha + \beta{x}_i + \varepsilon_i
- \varepsilon_i = y_i-\hat{y} = y_i-(\alpha+\beta{x}_i)

Bivariate Linear Regression: Estimating the Slope and Intercept

The goal of bivariate linear regression is to estimate a line (slope and intercept) that minimizes the errors (residuals)
We accomplish this using the ordinary least squares (OLS) method to find the \hat{\alpha} and \hat{\beta} that minimize the sum of squared errors (SSE)
\min_{\hat{\alpha}, \hat{\beta}} \,\operatorname{SSE}\left(\hat{\alpha}, \hat{\beta}\right) \equiv \min_{\hat{\alpha}, \hat{\beta}} \sum_{i=1}^n \left(y_{i} - (\hat{\alpha} + \hat{\beta}x_i)\right)^2
- \hat{\beta}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}=\frac{Cov(x,y)}{Var(x)}
- \hat{\alpha}=\bar{y}-\hat{\beta}{\bar{x}}

Bivariate Linear Regression: Try it Out!

What is the slope and intercept of this line?
- x = \{1.5, 3.0, 4.0, 6.2, 9.4\}
- y = \{1.2, 2.5, 8.0, 7.4, 8.3\}

Bivariate Linear Regression: Try it Out!

What is the slope and intercept of this line?

(beta <- cov(data$x, data$y) / var(data$x))

[1] 0.8744721

(alpha <- mean(data$y) - (beta * mean(data$x)))

[1] 1.265044

lm(y ~ x, data = data)


Call:
lm(formula = y ~ x, data = data)

Coefficients:
(Intercept)            x  
     1.2650       0.8745

\hat{y} = 1.27 + 0.87x
- If x is 7, what is the prediction for y?

Development and Democracy

On average, there is a positive and statistically significant relationship between development and democracy—as development increases, democracy seems to increase.

By how much?

Development and Democracy

(beta <- cov(ds$log_gdp, ds$fh_index) / var(ds$log_gdp))

[1] 9.607476

(alpha <- mean(ds$fh_index) - (beta * mean(ds$log_gdp)))

[1] -7.871157

lm(fh_index ~ log_gdp, data = ds)


Call:
lm(formula = fh_index ~ log_gdp, data = ds)

Coefficients:
(Intercept)      log_gdp  
     -7.871        9.607

\hat{y} = -7.87 + 9.61x
There is a positive relationship between development and democracy—the estimate indicates that a one unit increase in log GDP corresponds with a 9.61 point increase on the Freedom House index.
If log_gdp is 5, what is the prediction for fh_index?
- What about when log_gdp is 10?

Bivariate Linear Regression: Model Fit (RSE)

How well does the line “fit” (describe) the data?
- To answer this question, we analyze the error term (residuals)—how close are the values we observe to the values we predict (the regression line)?
- Residual standard error (RSE): SE(\hat{r_i})=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-\hat{y})^2}

Bivariate Linear Regression: Model Fit (r^2)

Coefficient of determination: r^2=1-\frac{SS_{res}}{SS_{tot}}
- Residual sum of squares: SS_{res}=\sum_{i=1}^n(y_i-\hat{y})^2
- Total sum of squares: SS_{tot}=\sum_{i=1}^n(y_i-\bar{y})^2
Measure of how well the regression line approximates the real data points
- r^2 of 1 indicates that the regression line perfectly fits the data
Often interpreted as the proportion of variation in y that is “explained” by x; an r^2 of 1 indicates that x explains 100% of the variation in the y; why?
- Ask Sal Khan from Khan Academy

Bivariate Linear Regression: Model Fit

RSE = 0.00; R^2 = 1.00
RSE = 2.87; R^2 = 0.87
RSE = 6.72; R^2 = 0.50
RSE = 43.42; R^2 = 0.06

Development and Democracy

\widehat{FH_i} = -7.87 + 9.61 * {log(GDP)_i}
How well does the model fit the data?

Development and Democracy

fit <- lm(fh_index ~ log_gdp, data = ds)
summary(fit)


Call:
lm(formula = fh_index ~ log_gdp, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.778 -15.198   5.812  16.900  46.838 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   -7.871      8.263  -0.953               0.342    
log_gdp        9.607      1.071   8.974 0.00000000000000031 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.67 on 186 degrees of freedom
Multiple R-squared:  0.3021,    Adjusted R-squared:  0.2984 
F-statistic: 80.53 on 1 and 186 DF,  p-value: 0.0000000000000003101

Development and Democracy

There is a positive relationship between development and democracy—a one unit increase in log(GDP) corresponds with a 9.6 unit increase in Freedom House score.
The model fits the data reasonably well—approximately 30% of the variation in Freedom House scores is “explained” by log(GDP) and the majority (~68%) of the Freedom House scores that we observe in the data fall within 24 points of the score that the model predicts.

Bivariate Linear Regression: Inference

Goal: estimate unknown population parameters using sample statistics as point estimates for the unknown population parameters
In bivariate linear regression, the intercept (\alpha) and slope (\beta) of the regression line are the unknown population parameters that we have to estimate; as always, we estimate these parameters in two steps:
1. Calculate point estimates for \alpha (\hat{\alpha}) and \beta (\hat{\beta})
2. Quantify the uncertainty around the point estimates by calculating standard errors for \hat{\alpha} and \hat{\beta}
With this information, we can calculate CIs, p-values, and test hypotheses
- H_0:\beta=0
- H_A:\beta \neq 0

Bivariate Linear Regression: Standard Errors

SE(\hat{\varepsilon})=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-\hat{y})^2}
SE(\hat{\beta})=\sqrt{\frac{1}{n-2}\frac{\sum_{i=1}^n(y_i-\hat{y})^2}{\sum_{i=1}^n (x_i -\bar{x})^2}}
SE(\hat{\alpha})=SE(\hat{\beta})\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}

Bivariate Linear Regression: Inference

To draw inference (and test hypotheses) using bivariate regression coefficients, we follow these steps:
1. Calculate the point estimate (\hat{\beta})
2. Calculate the standard error of the point estimate (SE(\hat{\beta}))
3. Calculate the confidence interval: 95\%CI = \hat{\beta} \pm t_{n-2} * SE(\hat{\beta})
4. Calculate the t-statistic: t=\frac{\hat{\beta}-0}{SE(\hat{\beta})}
5. Calculate the p-value
Is the coefficient statistically different than zero (“statistically significant”)?
- H_0:\beta=0
- H_A:\beta \neq 0

Development and Democracy

\widehat{FH_i} = -7.87 + 9.61 * {log(GDP)_i}
How well does the model fit the data?

Development and Democracy: Inference

(beta <- 9.607)

[1] 9.607

(beta_se <- 1.071)

[1] 1.071

(beta_ci <- c(beta - 1.96 * beta_se, beta + 1.96 * beta_se))

[1]  7.50784 11.70616

(beta_t <- (beta - 0) / beta_se)

[1] 8.970121

(pnorm(beta_t, lower.tail = FALSE))

[1] 0.0000000000000000001480939

Development and Democracy: Inference

fit <- lm(fh_index ~ log_gdp, data = ds)
summary(fit)


Call:
lm(formula = fh_index ~ log_gdp, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.778 -15.198   5.812  16.900  46.838 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   -7.871      8.263  -0.953               0.342    
log_gdp        9.607      1.071   8.974 0.00000000000000031 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.67 on 186 degrees of freedom
Multiple R-squared:  0.3021,    Adjusted R-squared:  0.2984 
F-statistic: 80.53 on 1 and 186 DF,  p-value: 0.0000000000000003101

confint(fit)

                 2.5 %    97.5 %
(Intercept) -24.172589  8.430274
log_gdp       7.495388 11.719564

Development and Democracy: Prediction

We use bivariate (simple) linear regression to estimate the slope (and intercept) of the line that “best fits” the data; we use this information to:
1. Identify the presence (or absence) of a relationship between variables
2. Identify the direction of the relationship (positive or negative)
3. Identify the strength of the relationship
4. Make predictions

Development and Democracy: Prediction

fit <- lm(fh_index ~ log_gdp, data = ds)
augment(fit, newdata = tibble(log_gdp = c(5, 10)), se_fit = TRUE) %>% 
  mutate(gdp = exp(log_gdp))

# A tibble: 2 × 4
  log_gdp .fitted .se.fit    gdp
    <dbl>   <dbl>   <dbl>  <dbl>
1       5    40.2    3.23   148.
2      10    88.2    3.14 22026.

augment(fit, newdata = tibble(log_gdp = 4:11), se_fit = TRUE) %>% 
  mutate(gdp = exp(log_gdp))

# A tibble: 8 × 4
  log_gdp .fitted .se.fit     gdp
    <int>   <dbl>   <dbl>   <dbl>
1       4    30.6    4.17    54.6
2       5    40.2    3.23   148. 
3       6    49.8    2.39   403. 
4       7    59.4    1.82  1097. 
5       8    69.0    1.79  2981. 
6       9    78.6    2.32  8103. 
7      10    88.2    3.14 22026. 
8      11    97.8    4.08 59874.

Development and Democracy: Prediction

Development and Democracy: Findings

Consistent with H1, these findings indicate that there is a positive relationship between development and democracy. The relationship is both statistically and substantively significant. On average, countries that have a low per capita GDP (y = 5, $148) score relatively low on the Freedom House index (x = 40.2), whereas countries that have a per capita high GDP (y = 10, $22,026) score significantly higher (x = 88.2), a mean difference of roughly 48 points on the 100 point scale of democracy. These results suggest that economic development (or lack thereof) may explain why some countries are democracies whereas others are not.