Last class, tested for differences in voting patterns across precinct-types (homogenous vs. not)
Difference tests are analogous to associations tests
We will therefore talk extensively about associations today. Specifically, we cover:
The logic of correlation analysis
The logic of (bivariate) regression
Some examples (in RStudio) using Handley’s data
Bivariate correlations determine:
If 2 variables are related
(If related), how are they related
The goal: make predictions from one variable to another
Theoretically, correlation \(=\) association
Statistically, correlation \(=\) co-variation
Variables are co-related if they co-vary
Changes in one accompany changes in other one
How correlations work
It goes without saying, but correlation \(\neq\) causation!
Correlations only look for relationship between variables
It does not determine which variable causes the other
Pearson’s correlation coefficient (rXY)
The most commonly used measure of (linear) association
rXY = Numerical index representing the strength and direction of relatedness between X and Y
Pearson’s correlation coefficient (rXY)
rXY always ranges from -1 and 1
rXY can take on any value between these extremes
Pearson’s correlation coefficient (rXY)
rXY does not change if the independent (X) and dependent (Y) variables are interchanged
The ordering of variables does not matter because this is a measure of correlation, not causation
Interpreting rXY:
Sign represents the direction of the relationship
Absolute value represents strength of the relationship
Interpreting rXY:
Scatterplots are great for visualizing correlations
Applying correlation statistics to prediction problems
rXY tells us the strength and direction of the relationship between two variables.
If rXY is strong, then we can use information about values of X to predict values of Y.
If rXY has a positive (negative) sign, then the direction of relationship is positive (inverse).
Applying correlation statistics to prediction problems
The shape of the relationship modeled in rXY is linear
Applying correlation statistics to prediction problems
If the absolute value of rXY is close to 1, then the observed Y values all lie close to the best-fitting line
As a result, we can use a line to predict what the values of the Y variable will be for any given value of X
To make such a prediction, we need to know how to create the best-fitting (regression) line
Applying correlation statistics to prediction problems
X (horizontal axis) is a predictor of Y (vertical axis)
Best fitting line minimizes differences between data points and straight line
We call points not falling directly on the line residuals
Applying correlation statistics to prediction problems
Residual: gap btw value predicted by model and actual value of variables
Data points above line are positive residuals (model under-predicts Y across X)
Points below line are over-predicted (negative residuals)
Line that fits best \(\equiv\) line that minimizes residuals. One way to draw such a line is the method of least squares:
\[\hat{Y} = b + mX\]
where,
\(\hat{Y}\) represents values of Y predicted by the linear model
b is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
m is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable
First things first: load the necessary packages
readr is for importing datasets (called “dataframes” in R)
dplyr is for manipulating variables and running analyses
ggplot2 is for data visualization
Now, get the Handley dataset
library(readr)
my_data <- read_csv("C:/Users/rjb6233/Google Drive/Political Science Classes (Taught or Being Prepped)/PSU Courses/RPV Crash Course/Data/Handley/PracticeData-ReCoded.csv")
View(my_data)
data(my_data)
I saved a re-coded version to my computer [filename: PracticeData-ReCoded.csv" and uploaded that
I put the re-coded data into an object called “my_data,” and I will use that from now on
Scatterplots (Candidate A votes by Black vs. White VAP)
Scatterplots (Candidate B votes by Black vs. White VAP)
Calculating correlation coefficients
H0: rXY \(=\) 0
H1: rXY \(\neq\) 0
where,
X represents our precinct characteristics variables {Black VAP vs. White VAP}
Y represents our voting patterns variables {Candididate A vs. Candidate B}
Calculating correlation coefficients (for Candidate A only)
Based on the correlation analyses, we can reject H0
There is a statistically significant relationship between precinct characteristics and voting patterns
Put differently, our analyses confirm that rXY \(\ne\) 0
Since our correlations are mirror images of each other (results for Candidate A is the polar opposite of results for Candidate B), we will focus from now on the relationship between Black VAP levels and support for Candidate A.
Let’s re-write the regression equation:
\[\hat{Y} = \beta_{0} + \beta_{1}X + \epsilon\]
where,
\(\hat{Y}\) represents values of Y predicted by the linear model
\(\beta_{0}\) is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
\(\beta_{1}\) is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable (% Black VAP)
\(\epsilon\) is the error (“resudual”) term
We can now test new hypotheses
H0: \(\beta_{1}\) \(=\) 0
H1: \(\beta_{1}\) \(\neq\) 0
The goal: Check to see if there is a statistically significant relationhip between % Black voting-age population in a precinct and the vote share for A in that precinct
Adding a regression line to the scatterplot
ggplot(my_data, aes(x = pBlackVAP, y = pVoteA)) +
geom_point() +
geom_smooth(method = lm) +
labs(title="Support for A", x="% Black VAP", y="Vote Share")
Adding a regression line to the scatterplot
Calculating regression coefficients
Interpretation: reject H0 (i.e., that \(\beta_{1}\) \(=\) 0)
\(\beta_{0}\): when the Black VAP in a precinct is zero percent, the average vote share for Candidate A is 6.03 percent
\(\beta_{1}\): a for every one-percent increase in Black VAP, the vote share for Candidate A increases by .92 percent.
\(R^2\): roughly 97% of the information in the vote share variable for Candidate A can be explained by the information in the Black VAP variable
Berry, William and Stanley Feldman. 1993. Regression in Practice. Thousand Oaks, CA: Sage.
Field, Andy. 2000. Discovering Statistics Using SPSS for Windows. Thousand Oaks, CA: Sage Publications.
Gujarati, Damodar N. 2003. Basic Econometrics. New York, NY: McGraw Hill.
Hays, William L. and Robert L. Winkler. 1971. Statistics: Probability, Inference, and Decision. New York, NY: Holt, Rinehart, and Winston, Inc.
Lewis-Beck, Michael S. 1995. Data Analysis: An Introduction. Thousand Oaks, CA: Sage Publications.