Voting Data Analysis Assignment

Part 1: Data Source

What datasets are you using? Give the source URLs

I am using the following datasets:

https://www.cnn.com/election/2016/results/president

https://projects.fivethirtyeight.com/2016-election-forecast/

https://www.cnn.com/election/2020/results/president

https://projects.fivethirtyeight.com/polls/president-general/2020/national/

https://projects.fivethirtyeight.com/polls/president-general/2024/national/

Review each dataset for completeness, checking for missing values and outliers.

The data has a lot of missing values for mcmullin and some missing values for johnson.

Also, the percentages per poll often do not add up to 100.

I will need to remove unrealistic outlier and data that is not complete.

The percentages per poll often do not add up to 100, so I will need to adjust for this issue.

Also, there are some outdated candidates and outliers that need to be adjusted for.

The percentages per poll often do not add up to 100, so I will need to adjust for this issue.

Also, there are some outdated candidates and outliers that need to be adjusted for.

Summarize in bullet list each column you plan on using.

- year

- candidate

- actual

- cycle

- state

- percentage_clinton

- percentage_trump

- percentage_johnson

- percentage_mcmullin

- candidate_name

- actual_percentage

- cycle

- state_name

- pollster_id

- candidate_id

- candidate_name

- percentage

- cycle

- state_name

- pollster_id

- candidate_id

- candidate_name

- percentage

Describe in a single sentence the column

- year: The year in which the presidential election was held.

- candidate: The name of the presidential candidate.

- cycle: The year in which the presidential election was held.

- state: The state in which the poll was conducted for.

- percentage_clinton: The percentage of people in a poll that voted for Clinton.

- percentage_trump: The percentage of people in a poll that voted for Trump.

- percentage_johnson: The percentage of people in a poll that voted for Johnson.

- percentage_mcmullin: The percentage of people in a poll that voted for McMullin.

- candidate_name: The name of the presidential candidate.

- cycle: The year in which the presidential election was held.

- state_name: The state in which the poll was conducted for.

- pollster_id: The id of the pollster.

- candidate_id: The specific id of the candidate.

- candidate_name: The name of the presidential candidate.

- cycle: The year in which the presidential election was held.

- state_name: The state in which the poll was conducted for.

- pollster_id: The id of the pollster.

- candidate_id: The specific id of the candidate.

- candidate_name: The name of the presidential candidate.

The 2016 and 2020 actual results for the presidential election are exact and complete numbers.

- Republican: 46.4%

- Democrat: 48.5%

- Other: 5.1%

- Republican: 46.9%

- Democrat: 51.3%

- Other: 1.8%

- Republican: 39.6%

- Democrat: 44.0%

- Other: 16.4%

- Republican: 42.1%

- Democrat: 50.2%

- Other: 7.7%

- Republican: 43.3%

- Democrat: 45.4%

- Other: 11.3%

Part 2: Data Transformation

Describe any transformations of the data.

Have one bullet point per field.

For example, missing values, binned data, 1/0 conversions, normalizing continuous data, creating categorical features, etc…

Here are the transformations of the data:

- I cleaned all of the data with janitor.

- In each tibble, I selected only the columns of data that I wanted.

- I took all state names that were NA and changed them to “US”.

Describe any new fields that you created.

For example, converting from Biden / Trump votes to an over/under field.

Here are the transformations of the data:

- I converted all percentages to actual decimals to help out with calculations.

- I made any candidates that were not the main Republican or Democrat candidate an other candidate.

- I converted NAs in percent of votes to 0 so that I could perform calculations.

- The other candidate was 1 minus the sum of the Republican and Democrat candidates.

Part 3: Correlations

Describe the results of relationships between key variables (e.g., voter turnout vs. age, education).

It appears that the 538 data on other candidates of each year is overestimated compared to the actual results.

Also, when other candidates are overestimated, Republican candidates are underestimated the most.

There is a solid negative correlation between the other candidate predicted votes and the Republican candidate predicted votes.

Identify any anomalies or patterns that could affect analysis (e.g., outliers, multicollinearity).

It seems that I have lower percentages of Democrat and Republican votes and higher percentages of other votes than makes actual sense.

I will need to be aware of this and adjust my data for this to get useful data.

Part 4: Modeling

Document the following models:

Model 1: Simple Linear Regression

Select a single continuous variable (e.g., age) to predict voter turnout.

Run the regression and document the results, including coefficient and R-squared values.

Visualize the regression line and assess the goodness of fit.

lm_mod_1 <- lm(actual_percentages ~ percentage, data = t_president_polls_US_2020) summary(lm_mod_1)

Model 2: Multiple Linear Regression (Continuous Variables)

Include several continuous variables (e.g., age, income, education) in the model to predict voter turnout.

Analyze the coefficients, p-values, and R-squared of the model.

Evaluate the importance of each variable and the overall model fit.

Run the linear regression model

model2 <- lm(actual_percentages ~ percentage + pollster_id, data = t_president_polls_US_2020)

Summarize the model results

summary(model2)

Interpret the model coefficients and metrics

The coefficient for percentage remains consistent, while the intercept has turned negative.

The coefficient for pollster_id is small, indicating it may not significantly impact the prediction.

The R-squared value is 0.2982, indicating a modest fit and a positive correlation.

However, the line of best fit suggests that there are many outliers in the data.

Model 3: Regression with a Categorical Variable

Add a categorical variable (e.g., region, gender) into the regression along with continuous variables.

Interpret the results, focusing on the impact of the categorical variable on voting behavior.

Compare this model’s performance with previous ones using adjusted R-squared and other metrics.

Run the linear regression model with a categorical variable

model3 <- lm(actual_percentages ~ percentage + pollster_id + candidate_name, data = t_president_polls_US_2020)

Summarize the model results

summary(model3)

Interpret the model coefficients and metrics

The coefficient for percentage remains roughly the same, and the intercept is now positive.

The coefficient for pollster_id is still small, suggesting limited significance.

The R-squared value is 1, indicating a perfect fit, which is unrealistic and suggests overfitting.

The line of best fit shows an unnaturally strong correlation.

Part 4: Analysis of Results

Compare the performance of the three models based on key metrics (e.g., R-squared, residuals).

Each model improved understanding of the relationship between predicted and actual percentages.

However, the final model’s R-squared value of 1 indicates overfitting, capturing noise instead of true relationships.

The first model had an R-squared value of approximately 0.2974, and the second model improved slightly to 0.2982.

These scores reflect a modest relationship with many outliers impacting the models.

Interpret the significance of each variable across the models. What factors seem most important in predicting voter turnout?

In the first model, the percentage coefficient was significant, showing a positive correlation with actual_percentages.

The second model maintained this significance but introduced pollster_id, which had minimal impact.

The third model added candidate_name without enhancing predictive ability, complicating interpretation due to high R-squared.

Discuss any limitations or potential biases in the models (e.g., omitted variable bias, overfitting).

Limitations include potential omitted variable bias, as not all relevant factors affecting voter turnout are included.

The perfect fit in the third model suggests overfitting, capturing noise rather than a true relationship.

Outliers in the data may also skew results, leading to misleading interpretations.

Suggest improvements or further steps for analysis (e.g., adding interaction terms, using different algorithms).

To improve models, consider adding interaction terms to explore how predictors influence each other regarding voter turnout.

Exploring different algorithms, such as decision trees or random forests, could offer additional insights and help reduce overfitting.

Implementing cross-validation techniques can further validate the robustness of the models and ensure generalizability of findings.