Bivariate (Ecological) Regression

Ray Block Jr., Penn State University

Winter 2021

Overview

Last class, tested for differences in voting patterns across precinct-types (homogenous vs. not)

Difference tests are analogous to associations tests

We will therefore talk extensively about associations today. Specifically, we cover:

  1. The logic of correlation analysis

  2. The logic of (bivariate) regression

  3. Some examples (in RStudio) using Handley’s data

1. The Logic of Correlation

The logic of correlation

Bivariate correlations determine:

  • If 2 variables are related

  • (If related), how are they related

The goal: make predictions from one variable to another

The logic of correlation

Theoretically, correlation \(=\) association

  • Knowing something about one variable tells you something about another variable

Statistically, correlation \(=\) co-variation

  • Variables are co-related if they co-vary

  • Changes in one accompany changes in other one

The logic of correlation

How correlations work

The logic of correlation

It goes without saying, but correlation \(\neq\) causation!

  • Correlations only look for relationship between variables

  • It does not determine which variable causes the other

The logic of correlation

Pearson’s correlation coefficient (rXY)

  • The most commonly used measure of (linear) association

  • rXY = Numerical index representing the strength and direction of relatedness between X and Y

The logic of correlation

Pearson’s correlation coefficient (rXY)

  • rXY always ranges from -1 and 1

    • -1 \(=\) perfect inverse correlation between X and Y
    • 0 \(=\) no relationship between X and Y
    • +1 \(=\) perfect positive correlation between X and Y
  • rXY can take on any value between these extremes

The logic of correlation

Pearson’s correlation coefficient (rXY)

  • rXY does not change if the independent (X) and dependent (Y) variables are interchanged

  • The ordering of variables does not matter because this is a measure of correlation, not causation

The logic of correlation

Interpreting rXY:

  • Sign represents the direction of the relationship

  • Absolute value represents strength of the relationship

The logic of correlation

Interpreting rXY:

Scatterplots are great for visualizing correlations

2. The Logic of (Bivariate) Regression

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

  • rXY tells us the strength and direction of the relationship between two variables.

    • If rXY is strong, then we can use information about values of X to predict values of Y.

    • If rXY has a positive (negative) sign, then the direction of relationship is positive (inverse).

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

  • The shape of the relationship modeled in rXY is linear

    • rXY describes how well a straight line describes the values of the Y variable across the range of X values

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

  • If the absolute value of rXY is close to 1, then the observed Y values all lie close to the best-fitting line

    • As a result, we can use a line to predict what the values of the Y variable will be for any given value of X

    • To make such a prediction, we need to know how to create the best-fitting (regression) line

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

  • X (horizontal axis) is a predictor of Y (vertical axis)

  • Best fitting line minimizes differences between data points and straight line

  • We call points not falling directly on the line residuals

The Logic of (Bivariate) Regression

Applying correlation statistics to prediction problems

  • Residual: gap btw value predicted by model and actual value of variables

  • Data points above line are positive residuals (model under-predicts Y across X)

  • Points below line are over-predicted (negative residuals)

The Logic of (Bivariate) Regression

Line that fits best \(\equiv\) line that minimizes residuals. One way to draw such a line is the method of least squares:

\[\hat{Y} = b + mX\]

where,

  • \(\hat{Y}\) represents values of Y predicted by the linear model

  • b is intercept (the value of \(\hat{Y}\) when X \(=\) 0)

  • m is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)

  • X is the value of the predictor variable

3. Some Examples of Correlation and Regression (Using the Handley Data)

Some Examples

First things first: load the necessary packages

  • readr is for importing datasets (called “dataframes” in R)

  • dplyr is for manipulating variables and running analyses

  • ggplot2 is for data visualization

Some Examples

Now, get the Handley dataset

  • I saved a re-coded version to my computer [filename: PracticeData-ReCoded.csv" and uploaded that

  • I put the re-coded data into an object called “my_data,” and I will use that from now on

Some Examples

Scatterplots (Candidate A votes by Black vs. White VAP)

Some Examples

Scatterplots (Candidate B votes by Black vs. White VAP)

Some Examples

Calculating correlation coefficients

  • H0: rXY \(=\) 0

  • H1: rXY \(\neq\) 0

where,

  • X represents our precinct characteristics variables {Black VAP vs. White VAP}

  • Y represents our voting patterns variables {Candididate A vs. Candidate B}

Some Examples

Calculating correlation coefficients (for Candidate A only)

Some Examples

Based on the correlation analyses, we can reject H0

  • There is a statistically significant relationship between precinct characteristics and voting patterns

  • Put differently, our analyses confirm that rXY \(\ne\) 0

Since our correlations are mirror images of each other (results for Candidate A is the polar opposite of results for Candidate B), we will focus from now on the relationship between Black VAP levels and support for Candidate A.

Some Examples

Let’s re-write the regression equation:

\[\hat{Y} = \beta_{0} + \beta_{1}X + \epsilon\]

where,

  • \(\hat{Y}\) represents values of Y predicted by the linear model

  • \(\beta_{0}\) is intercept (the value of \(\hat{Y}\) when X \(=\) 0)

  • \(\beta_{1}\) is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)

  • X is the value of the predictor variable (% Black VAP)

  • \(\epsilon\) is the error (“resudual”) term

Some Examples

We can now test new hypotheses

  • H0: \(\beta_{1}\) \(=\) 0

  • H1: \(\beta_{1}\) \(\neq\) 0

The goal: Check to see if there is a statistically significant relationhip between % Black voting-age population in a precinct and the vote share for A in that precinct

Some Examples

Adding a regression line to the scatterplot

  • “geom_smooth” tells R to add the regression line
  • the equation R uses is “linear model” (“lm”)

Some Examples

Adding a regression line to the scatterplot

Some Examples

Calculating regression coefficients

Some Examples

Interpretation: reject H0 (i.e., that \(\beta_{1}\) \(=\) 0)

  • \(\beta_{0}\): when the Black VAP in a precinct is zero percent, the average vote share for Candidate A is 6.03 percent

  • \(\beta_{1}\): a for every one-percent increase in Black VAP, the vote share for Candidate A increases by .92 percent.

  • \(R^2\): roughly 97% of the information in the vote share variable for Candidate A can be explained by the information in the Black VAP variable

    • In the bivariate case: \(R^2\) \(=\) \(r_{XY}\) \(\times\) \(r_{XY}\)

References (FYI)

Berry, William and Stanley Feldman. 1993. Regression in Practice. Thousand Oaks, CA: Sage.

Field, Andy. 2000. Discovering Statistics Using SPSS for Windows. Thousand Oaks, CA: Sage Publications.

Gujarati, Damodar N. 2003. Basic Econometrics. New York, NY: McGraw Hill.

Hays, William L. and Robert L. Winkler. 1971. Statistics: Probability, Inference, and Decision. New York, NY: Holt, Rinehart, and Winston, Inc.

Lewis-Beck, Michael S. 1995. Data Analysis: An Introduction. Thousand Oaks, CA: Sage Publications.