Project2Outline

Part 1: Data Source

https://projects.fivethirtyeight.com/polls/

I used the presidential polls CSV File, the file that represents presidential general elect polls (current). I need to include the same dataset from 2020 to help better show the regression and help me get a better idea for a prediction.

I am planning to use the website provided, FiveThirtyEight. From that website, I aim to find the New York Times samples and track them for any potential bias, especially in swing states. It appears the data I used for my first draft missed a good portion of surveys from the New York Times, only using the first NYT survey I could find. This is because I used the URL column instead of the display_name category, which limited the size of my survey.

I plan on using many of the columns from this dataset, including the following:

State: Describes the state that the survey comes from.
display_name: Shows the name of the survey group
answer or candidate_name: Shows either the last name of the poll winner answer or the full name of the poll winner candidate_name
pct: The percentage of the poll who voted for that respective candidate
nyt_state_avg_2024: This takes the average of both Trump and Harris in each state since the beginning of the pollsters in the dataset.

There are other columns that I may use before the final submission, including…

methodology: The method used to deliver the survey (ex: Live Phone, Text, Online Panel)
sample_size: The number of responses to the survey sent out.
start_date and end_date: The start and end date of each survey.

Part 2: Data Transformation

Describe any transformations of the data. Have one bullet point per field.
- For example, missing values, binned data, 1/0 conversions, normalizing continuous data, creating categorical features, etc…
Describe any new fields that you created. For example, converting from Biden / Trump votes to a over/under field.

There are a lot of NA values throughout the dataset. In the first few rows of the president_polls, you can see some NA values in sponsor and sponsor ID. The reason for these NA values is that the polls do not have sponsors, and therefore those imaginary sponsors cannot be given a sponsor ID. I decided that these sponsors were not a good column to look at because of the irregularity of sponsors and sponsor IDs. I noticed there are a lot of columns that have NAs involving the sponsor, especially columns like the “endorsed” (_party, _candidate, etc). I imagine that most sponsors (or the ones that are included) would not want to endorse a candidate as it may skew their results. The poll may reach the wrong people, gather data they aren’t looking for, and highlight data findings they may not be happy with.

When I first formulated the t_state tibble, the state of Nebraska showed up as “Nebraska CD-2”, so I used a mutate statement to change the name from Nebraska CD-2 to plain old “Nebraska”.

mutate(state = ifelse(state == “Nebraska CD-2”, “Nebraska”, state))
I also went ahead and found the voting averages of both Trump and Harris for each state. This narrows down some data points and, after combing through the data, helps show that there are little to no outliers that would effect the correlation and regressions of the data. I used summarize to do so, using the mean function to then find the average from each of the states.
summarize(nyt_state_avg_2024 = mean(pct, na.rm = TRUE))
One issue with this column is the date parameter it follows. This is an issue because of the fact that pollsters can have a revolving door of data that can change, so it may be a good idea to lock the data down to the span of a couple months or so. Then, do the same with the 2020 file which should make for a nice correlation.

Part 3: Correlations

At this point in my project, I do not have many variables that correlate to each other. Sometime during this week I hope to utilize the election data from the same time period in the 2020 election to see if there are any trends. This is what will show correlation to help my thesis statement.

With my knowledge of elections, the candidates, and the state of the United States right now, I do not expect to see drastic changes between the two elections. This means that the correlations should be somewhat similar and will hopefully boast a solid correlation, aimed near that .25 correlation level.

It might be beneficial to rework or recreate some of the tibbles in my project. It is hard to establish predictions because of the limited amount of columns and data I have in my t_state tibble. I may end up scrapping the New York Times filter for more data, as the New York Times only includes 310 of the 15,536 records included. I have also limited my final table to 3 variables down from 52, which is a mere 10% of the columns included in the csv file that I imported.

Part 4: Modeling

Document the following models:

My Models aren’t working at this time. I am looking for suggestions of what models I could run and I will work on this 10/23 during the day, but it has been somewhat difficult finding time for research purposes. I will leave these models below so I can reference when I work on the models in the coming future. My idea is to find the same months sampled in both 2020 and 2024 and find the differences in the regressions in which I will draw predictions from.

Model 1: Simple Linear Regression - Select a single continuous variable (e.g., age) to predict voter turnout. - Run the regression and document the results, including coefficient and R-squared values. - Visualize the regression line and assess the goodness of fit. Model 2: Multiple Linear Regression (Continuous Variables) - Include several continuous variables (e.g., age, income, education) in the model to predict voter turnout. - Analyze the coefficients, p-values, and R-squared of the model. - Evaluate the importance of each variable and the overall model fit. Model 3: Regression with a Categorical Variable - Add a categorical variable (e.g., region, gender) into the regression along with continuous variables. - Interpret the results, paying attention to the impact of the categorical variable on voting behavior. - Compare this model’s performance with the previous ones, using adjusted R-squared and other metrics.

Part 5: Analysis of Results

Of what I have as of now, it looks like Trump has the advantage in states such as Montana, Florida, and Texas. Using the nyt_state_avg_2024 column in the t_state tibble can show us that Trump has a marginal lead in these states compared to Kamala Harris. However, there are other key states such as Nebraska, Wisconsin, and my home state of Pennsylvania that Harris has a substantial lead in. Other states overtime have much closer average pollings

One bit of information that makes all of this analysis a bit more interesting is how Nebraska uses all of their electoral votes and how they can be split up amongst the state. Instead of giving all electoral votes to the popular vote in that state, Nebraska (and also Maine) will give out electoral votes to the popular vote by district. Nebraska was split in 2020 with most of the electoral college votes turning Red, but could Kamala potentially flip the state of Nebraska based on New York Times polls? The New York Times shows that she has had a lead over the past few months, may this be the result of bias considering Trump won the majority of electoral votes in Nebraska? This is what I hope to predict for next Tuesday’s submission.