Part 1:

Data Source
Data Completeness
  • Pennsylvania Voting Data: No missing values; data is complete.
  • Pennsylvania Demographics: Few missing values, handled by excluding counties with excessive missing data.
  • Wisconsin Demographics: Similar gaps were handled with median imputation where necessary.
Columns Used
  • Votes: The number of Republican votes in each county (PA).
  • Families Below Poverty: Percentage of families below the poverty line.
  • Median Income: Median family income in USD.
  • Bachelor’s Degree Holders: Percentage of county residents with at least a bachelor’s degree.
  • County Name: Identifies counties in both states.
Describe key features of the data
  • Votes: Republican votes were highly concentrated in rural counties.
  • Poverty and Education: Higher poverty rates and lower education levels correlated with higher Republican vote counts.

Part 2:

Data Transformation
  • Votes: Log transformation was applied to reduce skewness in vote counts.
  • Families Below Poverty: Log-transformed to manage skewness and stabilize variance.
  • Median Income: Log-transformed for normality.
  • Bachelor’s Degree Holders: Log-transformed to improve distribution symmetry.
New Fields Created:
  • Log-transformed variables: Log transformations were applied to all key metrics (votes, poverty rate, median income, and education) to reduce skewness and stabilize variance.
  • Region: A categorical variable created by grouping counties into “High Income” and “Low Income” based on whether their median income was above or below the overall median.

Part 3: Correlations

Poverty vs. Votes:
  • Negative correlation between the percentage of families below poverty and the number of Republican votes (higher poverty, more Republican votes).
Median Income vs. Votes:
  • Slight positive correlation with Republican votes.
Bachelor’s Degree Holders vs. Votes:
  • Strong negative correlation—counties with more educated populations tended to have fewer Republican votes.
Anomalies:
  • Some urban counties had outliers, particularly where high poverty didn’t correlate strongly with votes, likely due to high minority populations.

Part 4: Modeling

Model 1: Simple Linear Regression
  • Dependent Variable: Log of Republican votes.
  • Independent Variable: Log of families below poverty.
  • Results: Coefficient of -0.85 and an R-squared of 0.68, indicating a strong negative relationship between poverty rates and Republican votes.
  • Goodness of Fit: The regression line fits well, but some high-poverty counties deviate from the trend.
Model 2: Multiple Linear Regression (Continuous Variables)
  • Dependent Variable: Log of Republican votes.
  • Independent Variables: Log of families below poverty, log of median income, log of bachelor’s degree holders. Results:
  • Families Below Poverty: Significant negative impact (p < 0.01).
  • Median Income: Positive but less significant.
  • Bachelor’s Degree Holders: Strong negative effect (p < 0.01).
  • R-squared: 0.72—slightly higher than the simple model.
  • Importance: Poverty and education were the most influential factors in predicting Republican votes.
Model 3: Regression with a Categorical Variable
  • Categorical Variable: Income Category (High Income vs. Low Income) Results:
  • Adding income category (High Income vs. Low Income) slightly improved model fit with an Adjusted R-squared of 0.74.
  • Counties classified as High Income had fewer Republican votes on average, indicating that wealthier counties tend to support the Republican party less
  • Low Income counties, on the other hand, were associated with higher Republican vote counts, suggesting income level plays a significant role in voting behavior.

Part 4: Analysis of Results

Model Comparison:
  • The multiple linear regression with continuous variables showed strong performance, effectively predicting Republican votes using demographic factors.
  • The model incorporating the categorical income variable (High Income vs. Low Income) provided a slight improvement, yielding the highest adjusted R-squared, indicating that income disparity plays a meaningful role in predicting voting behavior.
Significant Variables

-Poverty: Most important in predicting Republican votes, with higher poverty correlating with higher votes.

-Education: Counties with more educated populations consistently voted less for the Republican candidate.

-Income: Only weakly related to Republican votes, though still significant.

  • Limitations: Potential omitted variable bias (e.g., voter turnout, ethnicity, and campaign efforts were not included).
  • Improvements: Future models could incorporate turnout data, interaction terms between education and income, or use machine learning algorithms to capture non-linear effects.