https://rpubs.com/schen0181/1235785

Part 1: Data Source

pca_states.csv, president_polls.csv, president_polls_historical.csv, pollster-ratings-combined.csv https://projects.fivethirtyeight.com/polls/president-primary-r/2024/national/ https://www.fec.gov/resources/cms-content/documents/2020presgeresults.pdf

For the pca_states data, I plan to use it as independent variables that influences the voting prediction. The variables that I will primarily be using is : “over_15_and_never_married” “bachelors_degree_or_higher” “households_number” I believe that marital status and level of education impacts the way that someone would vote. I’m still in the process of trying whether other variable would work better with randomness and chance.

For the president_polls data, I plan to use this as the testing data because the current election result needs to be predicted.

For the president_polls_historical, I plan to use it as training data. This is also the dependent variable as it has info on the party and percentage of people polled that voted for any party. Th columns that I’ll use are: “party” “state” “pct”

The state_wins data shows the result of the 2020 election with number of votes for candidates by state. “state” “votes” “party”

Part 2: Data Transformation

-New fields would be dem_win to show where the democrats had won the state. Having a 1/0 makes it easier to create analysis.

Part 3: Correlations

The relationship between percent that voted Democrats when polled with bachelors/higher and over_15/never_married is strong because rsquared is 0.7338 while pvalues are less than 0.05. This means that the variation that my model accounted for is high as 0.7 is close to 1 , which is a perfect model. The pvalues mean that there is a low chance that the result is due to randomness.

Some outliers would be Maine CD and the District of Columbia, which I had deleted from the data because they would skew the analysis. It also doesn’t place much importance on result as there’s already a column for Maine which District of Columbia isn’t really a state.

Part 4: Modeling

Document the following models:

Model 1: Simple Linear Regression - Select a single continuous variable (e.g., age) to predict voter turnout. - Run the regression and document the results, including coefficient and R-squared values. - Visualize the regression line and assess the goodness of fit.

I choose education level to predict voter turnout. The coefficient shows that there is a positive correlation and the rsquared shows that it’s a good model(0.6) and that it has a good chance of predicting the value of response variable. It’s a positive, linear regression line.

Model 2: Multiple Linear Regression (Continuous Variables) - Include several continuous variables (e.g., age, income, education) in the model to predict voter turnout. - Analyze the coefficients, p-values, and R-squared of the model. - Evaluate the importance of each variable and the overall model fit.

-The variables are households_number and votes. The pvalues show that there is low randomness for votes because they’re smaller than 0.05 but randomness for bachelors becasue its 0.09. R-squared is 0.3921 which means that it’s an okay model with most variance accounted for. It’s a good fit because there is a positive, linear relationship.

Model 3: Regression with a Categorical Variable - Add a categorical variable (e.g., region, gender) into the regression along with continuous variables. - Interpret the results, paying attention to the impact of the categorical variable on voting behavior. - Compare this model’s performance with the previous ones, using adjusted R-squared and other metrics.

The variables used are bachelors, marital status, and dem_win. This summary is unreliable. Pvalues are all greater than 0.05, meaning that the data is more extreme than what’s expected from ranom chance while r squared is 0.066, which states that it’s not a good model as it’s too low.

Part 4: Analysis of Results

Performance is best for the simple linear regression because there is a pvalues less than 0.05 while rsquared was 0.6 which means that it has most variance accounted for. The factor that’s most important in predicting voter turnout is education and marital status.

I’m not sure but I believe the dem_wins might cause limitations as it’s a 0 or 1. It would cause skews and I might need to use log regressions to get a better picture for analysis. I still need to do training and testing because I ran into some errors when coding it. I would also need to be more detailed in reasonings.