Bivariate (Ecological) Regression

Ray Block Jr., Penn State University

Winter 2021

Overview

Last class, tested for differences in voting patterns across precinct-types (homogenous vs. not)

Difference tests are analogous to associations tests

We will therefore talk extensively about associations today. Specifically, we cover:

The logic of correlation analysis
The logic of (bivariate) regression
Some examples (in RStudio) using Handley’s data

1. The Logic of Correlation

The logic of correlation

Bivariate correlations determine:

If 2 variables are related
(If related), how are they related

The goal: make predictions from one variable to another

The logic of correlation

Theoretically, correlation \(=\) association

Knowing something about one variable tells you something about another variable

Statistically, correlation \(=\) co-variation

Variables are co-related if they co-vary
Changes in one accompany changes in other one

The logic of correlation

How correlations work

The logic of correlation

It goes without saying, but correlation \(\neq\) causation!

Correlations only look for relationship between variables
It does not determine which variable causes the other

The logic of correlation

Pearson’s correlation coefficient (r_XY)

The most commonly used measure of (linear) association
r_XY = Numerical index representing the strength and direction of relatedness between X and Y

The logic of correlation

Pearson’s correlation coefficient (r_XY)

r_XY always ranges from -1 and 1
- -1 \(=\) perfect inverse correlation between X and Y
- 0 \(=\) no relationship between X and Y
- +1 \(=\) perfect positive correlation between X and Y
r_XY can take on any value between these extremes

The logic of correlation

Pearson’s correlation coefficient (r_XY)

r_XY does not change if the independent (X) and dependent (Y) variables are interchanged
The ordering of variables does not matter because this is a measure of correlation, not causation

The logic of correlation

Interpreting r_XY:

Sign represents the direction of the relationship
Absolute value represents strength of the relationship

The logic of correlation

Interpreting r_XY:

Scatterplots are great for visualizing correlations

2. The Logic of (Bivariate) Regression

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

r_XY tells us the strength and direction of the relationship between two variables.
- If r_XY is strong, then we can use information about values of X to predict values of Y.
- If r_XY has a positive (negative) sign, then the direction of relationship is positive (inverse).

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

The shape of the relationship modeled in r_XY is linear
- r_XY describes how well a straight line describes the values of the Y variable across the range of X values

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

If the absolute value of r_XY is close to 1, then the observed Y values all lie close to the best-fitting line
- As a result, we can use a line to predict what the values of the Y variable will be for any given value of X
- To make such a prediction, we need to know how to create the best-fitting (regression) line

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

X (horizontal axis) is a predictor of Y (vertical axis)
Best fitting line minimizes differences between data points and straight line
We call points not falling directly on the line residuals

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

Residual: gap btw value predicted by model and actual value of variables
Data points above line are positive residuals (model under-predicts Y across X)
Points below line are over-predicted (negative residuals)

The Logic of (Bivariate) Regression

Line that fits best \(\equiv\) line that minimizes residuals. One way to draw such a line is the method of least squares:

\[\hat{Y} = b + mX\]

where,

\(\hat{Y}\) represents values of Y predicted by the linear model
b is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
m is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable

3. Some Examples of Correlation and Regression (Using the Handley Data)

Some Examples

First things first: load the necessary packages

library(readr)
library(dplyr)
library(ggplot2)

readr is for importing datasets (called “dataframes” in R)
dplyr is for manipulating variables and running analyses
ggplot2 is for data visualization

Some Examples

Now, get the Handley dataset

library(readr)
my_data <- read_csv("C:/Users/rjb6233/Google Drive/Political Science Classes (Taught or Being Prepped)/PSU Courses/RPV Crash Course/Data/Handley/PracticeData-ReCoded.csv")
View(my_data)
data(my_data)

I saved a re-coded version to my computer [filename: PracticeData-ReCoded.csv" and uploaded that
I put the re-coded data into an object called “my_data,” and I will use that from now on

Some Examples

Scatterplots (Candidate A votes by Black vs. White VAP)

ggplot(my_data, aes(x = pBlackVAP, y = pVoteA)) +
    geom_point()

ggplot(my_data, aes(x = pWhiteVAP, y = pVoteA)) +
    geom_point()

Some Examples

Scatterplots (Candidate B votes by Black vs. White VAP)

ggplot(my_data, aes(x = pBlackVAP, y = pVoteB)) +
    geom_point()

ggplot(my_data, aes(x = pWhiteVAP, y = pVoteB)) +
    geom_point()

Some Examples

Calculating correlation coefficients

H₀: r_XY \(=\) 0
H₁: r_XY \(\neq\) 0

where,

X represents our precinct characteristics variables {Black VAP vs. White VAP}
Y represents our voting patterns variables {Candididate A vs. Candidate B}

Some Examples

Calculating correlation coefficients (for Candidate A only)

cor.test(my_data$pVoteA, my_data$pBlackVAP)

cor.test(my_data$pVoteA, my_data$pWhiteVAP)

Some Examples

Based on the correlation analyses, we can reject H₀

There is a statistically significant relationship between precinct characteristics and voting patterns
Put differently, our analyses confirm that r_XY \(\ne\) 0

Since our correlations are mirror images of each other (results for Candidate A is the polar opposite of results for Candidate B), we will focus from now on the relationship between Black VAP levels and support for Candidate A.

Some Examples

Let’s re-write the regression equation:

\[\hat{Y} = \beta_{0} + \beta_{1}X + \epsilon\]

where,

\(\hat{Y}\) represents values of Y predicted by the linear model
\(\beta_{0}\) is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
\(\beta_{1}\) is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable (% Black VAP)
\(\epsilon\) is the error (“resudual”) term

Some Examples

We can now test new hypotheses

H₀: \(\beta_{1}\) \(=\) 0
H₁: \(\beta_{1}\) \(\neq\) 0

The goal: Check to see if there is a statistically significant relationhip between % Black voting-age population in a precinct and the vote share for A in that precinct

Some Examples

Adding a regression line to the scatterplot

ggplot(my_data, aes(x = pBlackVAP, y = pVoteA)) +
  geom_point() +
  geom_smooth(method = lm) +
  labs(title="Support for  A", x="% Black VAP", y="Vote Share")

“geom_smooth” tells R to add the regression line
the equation R uses is “linear model” (“lm”)

Some Examples

Adding a regression line to the scatterplot

Some Examples

Calculating regression coefficients

reg_model <- lm(pVoteA ~ pBlackVAP, data = my_data)
summary(reg_model)

Some Examples

Interpretation: reject H₀ (i.e., that \(\beta_{1}\) \(=\) 0)

\(\beta_{0}\): when the Black VAP in a precinct is zero percent, the average vote share for Candidate A is 6.03 percent
\(\beta_{1}\): a for every one-percent increase in Black VAP, the vote share for Candidate A increases by .92 percent.
\(R^2\): roughly 97% of the information in the vote share variable for Candidate A can be explained by the information in the Black VAP variable
- In the bivariate case: \(R^2\) \(=\) \(r_{XY}\) \(\times\) \(r_{XY}\)

References (FYI)

Berry, William and Stanley Feldman. 1993. Regression in Practice. Thousand Oaks, CA: Sage.

Field, Andy. 2000. Discovering Statistics Using SPSS for Windows. Thousand Oaks, CA: Sage Publications.

Gujarati, Damodar N. 2003. Basic Econometrics. New York, NY: McGraw Hill.

Hays, William L. and Robert L. Winkler. 1971. Statistics: Probability, Inference, and Decision. New York, NY: Holt, Rinehart, and Winston, Inc.

Lewis-Beck, Michael S. 1995. Data Analysis: An Introduction. Thousand Oaks, CA: Sage Publications.