Liu provides (among other things) an intellectual history of ecological inference (EI) in RPV.
I talk briefly about RPV analyses in RStudio, and I go through a step-by-step example.
I pass the mic to Liu who will talk specifically about EI software.
Confessions of an SPSS and Stata user…
read in (i.e, import) the data…
library(readxl)
Morgan <- read_excel("C:/Users/rjb6233/Google Drive/Political Science Classes (Taught or Being Prepped)/PSU Courses/RPV Crash Course/Data/Morgan County (Alabama)/Morgan.xlsx")
View(Morgan)Install the readxl package and load it to your “library”
Find the “Morgan.xlsx” file
Use the “read_excel” command to translate the file into R
Use View command to check if the data imported right
Generally, we are looking for relationships between jurisdiction characteristics and voting patterns
Some ways of checking for these relationships:
What is it?
How you do it?
Jurisdiction characteristics variables:
These are theoretically-grounded measures of “racial political context” (see. e.g., McClerking 2008)
Jurisdiction characteristics variables:
Voting patterns variables:
Manipulating variables
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Use dplyr’s “mutate” and “ifelse” commands to create homogeneous precinct variables
Morgan_RC <- (Morgan %>%
mutate(pBlackVAP=(blkvap/vap)*100) %>%
mutate(pWhiteVAP=(whtvap/vap)*100) %>%
mutate(Obama_Vote=black_percent*100) %>%
mutate(McCain_Vote=white_percent*100) %>%
mutate(MostlyBlack=ifelse(pBlackVAP >= 70, 1, 0),
MostlyWhite=ifelse(pWhiteVAP >= 90, 1, 0))
)
data(Morgan_RC)I saved these new variables in a data file called “Morgan_RC” (“RC” stands for “re-coded”)
Use the “filter” command (in dplyr) to sort through the data
Use the “filter” command (in dplyr) to sort through the data
Interpretation: the evidence suggests racial polarization
residents in mostly Black precincts prefer Obama
residents in mostly White precincts prefer McCain
Limitations of Homogeneous Precinct Analysis
Strict threshold: In many jurisdictions there are no precincts that can be classified as homogeneous (based on a 90% cutoff point)
Limited information: Homogeneous precincts are often only a small, possibly unrepresentative, sample of the population
Homogeneous precinct analyses test for differences in voting patterns across jurisdiction characteristics (homogeneously Black vs. White)
Analyzing difference is analogous to analyzing associations
Correlations are tests of associations
Bi-variate correlations determine:
if 2 variables are related
(if related), how are they related
The goal: make predictions from one variable to another
Theoretically, correlation \(=\) association
Statistically, correlation \(=\) co-variation
variables are co-related if they co-vary
changes in one accompany changes in other one
How correlations work
It goes without saying, but correlation \(\neq\) causation!
correlations only look for relationship between variables
they do not determine which variable causes the other
Pearson’s correlation coefficient (rXY)
most commonly used measure of (linear) association
rXY = index representing the strength and direction of relatedness between X and Y
Pearson’s correlation coefficient (rXY)
rXY always ranges from -1 to +1
rXY can take on any value between these extremes
Pearson’s correlation coefficient (rXY)
rXY does not change if the independent (X) and dependent (Y) variables are interchanged
ordering of variables does not matter (this is a measure of correlation, not causation)
Interpreting rXY:
sign represents the direction of relationship
absolute value represents strength of relationship
Interpreting rXY:
scatterplots are great for visualizing correlations
Creating scatterplots (Obama votes by Black vs. White VAP)
Interpretation: racial difference in candidate preference
as the Black VAP increases, so does Obama’s vote share
Obama’s vote share decreases with rising White VAP
Limitations:
correlations are based on a continuous (rather than dichotomous) measure of jurisdiction characteristics
but conclusions are still based on aggregate patterns (can’t really talk about what individuals are doing)
Applying correlation statistics to prediction problems
rXY tells us the strength and direction of the relationship between two variables
If rXY is strong, then we can use information about values of X to predict values of Y
If rXY has a positive (negative) sign, then the direction of relationship is positive (inverse)
Applying correlation statistics to prediction problems
The shape of the relationship modeled in rXY is linear
Applying correlation statistics to prediction problems
If the absolute value of rXY is close to 1, then the observed Y values all lie close to the best-fitting line
As a result, we can use a line to predict what the values of the Y variable will be for any given value of X
To make such a prediction, we need to know how to create the best-fitting (regression) line
Applying correlation statistics to prediction problems
X (horizontal axis) is a predictor of Y (vertical axis)
Best fitting line minimizes differences between data points and straight line
We call points not falling directly on the line residuals
Applying correlation statistics to prediction problems
Residual: gap btw value predicted by model and actual value of variables
Data points above line are positive residuals (model under-predicts Y across X)
Points below line are over-predicted (negative residuals)
Line that fits best \(\equiv\) line that minimizes residuals. One way to draw such a line is the method of least squares:
\[\hat{Y} = b + mX\]
where,
\(\hat{Y}\) represents values of Y predicted by the linear model
b is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
m is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable
Let’s re-write the regression equation:
\[\hat{Y} = \beta_{0} + \beta_{1}X + \epsilon\]
where,
\(\hat{Y}\) represents values of Y predicted by the linear model
\(\beta_{0}\) is intercept (the value of \(\hat{Y}\) when X \(=\) 0)
\(\beta_{1}\) is the slope of the regression line (i.e., the change in \(\hat{Y}\) associated with a one-unit shift in X)
X is the value of the predictor variable (% Black VAP)
\(\epsilon\) is the error (“resudual”) term
The Goodman approach makes a lot of assumptions (and has its flaws), but the general idea is that:
intercept (\(\beta_{0}\)) predicts how Whites in a precinct vote
slope + intercept ( \(\beta_{0}\) + \(\beta_{1}\)) predicts how Blacks vote
Here’s (brief discussion of) logic behind all this…
Voting rights litigation relies on the Gingles Test, which requires the following conditions to be satisfied for a Voting Rights Act violation:
The group of minority voters is sufficiently large and geographically compact.
Minority voters are politically cohesive in supporting their candidate of choice.
The majority votes in a bloc to usually defeat the minority’s preferred candidate.
The data we typically have gives us information about:
who voted (often obtained from the voter file), and
which candidates received votes (we learn this from the election results).
Knowing these totals isn’t good enough to meet the requirements of the Gingles test
We actually need to know which candidates voters supported
Why? So we can determine whether White voters and minority voters were cohesive blocks supporting different candidates (criteria 2 and 3)
unobserved quantities of interest (\(\beta_{i}^{b}\) and \(\beta_{i}^{w}\))
from the aggregate observed variables (turnout per precinct, Black VAP, total VAP, etc.)
The issue with Goodman’s regression
Assumes equal number of residents across precincts (\(N_{i}\))
If \(N\) differs across \(i\), Goodman’s regression cannot estimate correct quantities of interest
Widely acknowledged to be an improvement over the Goodman ’s approach