Voting Data Analysis Assignment
Part 1: Data Source
What datasets are you using? Give the source URLs
I am using the following datasets:
2016 popular vote actual results:
2016 538 popular vote forecast:
2020 popular vote actual results:
2020 538 popular vote forecast:
2024 538 popular vote forecast:
Review each dataset for completeness, checking for missing values
and outliers.
2016 popular vote actual results:
The data is complete. It was easy to check due to it only showing
the actual popular vote of the 2016 presidential election.
2016 538 popular vote forecast:
The data has a lot of missing values for mcmullin and some missing
values for johnson.
Also, the percentages per poll often do not add up to 100.
I will need to remove unrealistic outlier and data that is not
complete.
2020 popular vote actual results:
The data is complete. It was easy to check due to it only showing
the actual popular vote of the 2020 presidential election.
2020 538 popular vote forecast:
The percentages per poll often do not add up to 100, so I will need
to adjust for this issue.
Also, there are some outdated candidates and outliers that need to
be adjusted for.
2024 538 popular vote forecast:
The percentages per poll often do not add up to 100, so I will need
to adjust for this issue.
Also, there are some outdated candidates and outliers that need to
be adjusted for.
Summarize in bullet list each column you plan on using.
2016 popular vote actual results:
- year
- candidate
- actual
2016 538 popular vote forecast:
- cycle
- state
- percentage_clinton
- percentage_trump
- percentage_johnson
- percentage_mcmullin
2020 popular vote actual results:
- candidate_name
- actual_percentage
2020 538 popular vote forecast:
- cycle
- state_name
- pollster_id
- candidate_id
- candidate_name
- percentage
2024 538 popular vote forecast:
- cycle
- state_name
- pollster_id
- candidate_id
- candidate_name
- percentage
Describe in a single sentence the column
2016 popular vote actual results:
- year: The year in which the presidential election was held.
- candidate: The name of the presidential candidate.
- actual: The actual percentage of the popular vote that the
presidential candidate received.
2016 538 popular vote forecast:
- cycle: The year in which the presidential election was held.
- state: The state in which the poll was conducted for.
- percentage_clinton: The percentage of people in a poll that voted
for Clinton.
- percentage_trump: The percentage of people in a poll that voted
for Trump.
- percentage_johnson: The percentage of people in a poll that voted
for Johnson.
- percentage_mcmullin: The percentage of people in a poll that voted
for McMullin.
2020 popular vote actual results:
- candidate_name: The name of the presidential candidate.
- actual_percentage: The actual percentage of the popular vote that
the presidential candidate received.
2020 538 popular vote forecast:
- cycle: The year in which the presidential election was held.
- state_name: The state in which the poll was conducted for.
- pollster_id: The id of the pollster.
- candidate_id: The specific id of the candidate.
- candidate_name: The name of the presidential candidate.
- percentage: The forecasted percent of the popular vote that a
candidate will receive.
2024 538 popular vote forecast:
- cycle: The year in which the presidential election was held.
- state_name: The state in which the poll was conducted for.
- pollster_id: The id of the pollster.
- candidate_id: The specific id of the candidate.
- candidate_name: The name of the presidential candidate.
- percentage: The forecasted percent of the popular vote that a
candidate will receive.
Describe key features of the data (e.g., distributions, averages,
correlations).
The 2016 and 2020 actual results for the presidential election are
exact and complete numbers.
In 2016, the popular vote was distributed as follows:
- Republican: 46.4%
- Democrat: 48.5%
- Other: 5.1%
In 2020, the popular vote was distributed as follows:
- Republican: 46.9%
- Democrat: 51.3%
- Other: 1.8%
In 2016, when filtering for just national pollster data, the average
percent of the popular vote was forecasted as follows:
- Republican: 39.6%
- Democrat: 44.0%
- Other: 16.4%
In 2020, when filtering for just national pollster data, the average
percent of the popular vote was forecasted as follows:
- Republican: 42.1%
- Democrat: 50.2%
- Other: 7.7%
In 2024, when filtering for just national pollster data, the average
percent of the popular vote was forecasted as follows:
- Republican: 43.3%
- Democrat: 45.4%
Have one bullet point per field.
For example, missing values, binned data, 1/0 conversions,
normalizing continuous data, creating categorical features, etc…
Here are the transformations of the data:
- I cleaned all of the data with janitor.
- I created an actual popular vote tibble for 2016 and 2020.
- I created an average popular vote for the Republican, Democrat,
and other candidates for 2016, 2020, and 2024.
- In each tibble, I selected only the columns of data that I
wanted.
- I took all state names that were NA and changed them to “US”.
Describe any new fields that you created.
For example, converting from Biden / Trump votes to an over/under
field.
Here are the transformations of the data:
- I converted all percentages to actual decimals to help out with
calculations.
- I made any candidates that were not the main Republican or
Democrat candidate an other candidate.
- I converted NAs in percent of votes to 0 so that I could perform
calculations.
- The other candidate was 1 minus the sum of the Republican and
Democrat candidates.
Part 3: Correlations
Describe the results of relationships between key variables (e.g.,
voter turnout vs. age, education).
It appears that the 538 data on other candidates of each year is
overestimated compared to the actual results.
Also, when other candidates are overestimated, Republican candidates
are underestimated the most.
There is a solid negative correlation between the other candidate
predicted votes and the Republican candidate predicted votes.
Identify any anomalies or patterns that could affect analysis (e.g.,
outliers, multicollinearity).
It seems that I have lower percentages of Democrat and Republican
votes and higher percentages of other votes than makes actual
sense.
I will need to be aware of this and adjust my data for this to get
useful data.
Part 4: Modeling
Document the following models:
Model 1: Simple Linear Regression
Select a single continuous variable (e.g., age) to predict voter
turnout.
Run the regression and document the results, including coefficient
and R-squared values.
Visualize the regression line and assess the goodness of fit.
lm_mod_1 <- lm(actual_percentages ~ percentage, data =
t_president_polls_US_2020) summary(lm_mod_1)
Model 2: Multiple Linear Regression (Continuous Variables)
Include several continuous variables (e.g., age, income, education)
in the model to predict voter turnout.
Analyze the coefficients, p-values, and R-squared of the model.
Evaluate the importance of each variable and the overall model
fit.
Run the linear regression model
model2 <- lm(actual_percentages ~ percentage + pollster_id, data =
t_president_polls_US_2020)
Summarize the model results
summary(model2)
Interpret the model coefficients and metrics
The coefficient for percentage remains consistent, while the
intercept has turned negative.
The coefficient for pollster_id is small, indicating it may not
significantly impact the prediction.
The R-squared value is 0.2982, indicating a modest fit and a
positive correlation.
However, the line of best fit suggests that there are many outliers
in the data.
Model 3: Regression with a Categorical Variable
Add a categorical variable (e.g., region, gender) into the
regression along with continuous variables.
Interpret the results, focusing on the impact of the categorical
variable on voting behavior.
Compare this model’s performance with previous ones using adjusted
R-squared and other metrics.
Run the linear regression model with a categorical variable
model3 <- lm(actual_percentages ~ percentage + pollster_id +
candidate_name, data = t_president_polls_US_2020)
Summarize the model results
summary(model3)
Interpret the model coefficients and metrics
The coefficient for percentage remains roughly the same, and the
intercept is now positive.
The coefficient for pollster_id is still small, suggesting limited
significance.
The R-squared value is 1, indicating a perfect fit, which is
unrealistic and suggests overfitting.
The line of best fit shows an unnaturally strong correlation.
Part 4: Analysis of Results
Compare the performance of the three models based on key metrics
(e.g., R-squared, residuals).
Each model improved understanding of the relationship between
predicted and actual percentages.
However, the final model’s R-squared value of 1 indicates
overfitting, capturing noise instead of true relationships.
The first model had an R-squared value of approximately 0.2974, and
the second model improved slightly to 0.2982.
These scores reflect a modest relationship with many outliers
impacting the models.
Interpret the significance of each variable across the models. What
factors seem most important in predicting voter turnout?
In the first model, the percentage coefficient was significant,
showing a positive correlation with actual_percentages.
The second model maintained this significance but introduced
pollster_id, which had minimal impact.
The third model added candidate_name without enhancing predictive
ability, complicating interpretation due to high R-squared.
Discuss any limitations or potential biases in the models (e.g.,
omitted variable bias, overfitting).
Limitations include potential omitted variable bias, as not all
relevant factors affecting voter turnout are included.
The perfect fit in the third model suggests overfitting, capturing
noise rather than a true relationship.
Outliers in the data may also skew results, leading to misleading
interpretations.
Suggest improvements or further steps for analysis (e.g., adding
interaction terms, using different algorithms).
To improve models, consider adding interaction terms to explore how
predictors influence each other regarding voter turnout.
Exploring different algorithms, such as decision trees or random
forests, could offer additional insights and help reduce
overfitting.
Implementing cross-validation techniques can further validate the
robustness of the models and ensure generalizability of findings.