hw10_regression

Part 1: Data Source

What datasets are you using? Give the source URLs

pca_states.csv, president_polls.csv, president_polls_historical.csv, pollster-ratings-combined.csv https://projects.fivethirtyeight.com/polls/president-primary-r/2024/national/ https://www.fec.gov/resources/cms-content/documents/2020presgeresults.pdf

Review each dataset for completeness, checking for missing values and outliers.
Summarize in bullet list each column you plan on using.
Describe in a single sentence the column
Describe key features of the data (e.g., distributions, averages, correlations).

For the pca_states data, I plan to use it as independent variables that influences the voting prediction. The variables that I will primarily be using is : “over_15_and_never_married” “bachelors_degree_or_higher” “households_number” I believe that marital status and level of education impacts the way that someone would vote. I’m still in the process of trying whether other variable would work better with randomness and chance.

For the president_polls data, I plan to use this as the testing data because the current election result needs to be predicted.

For the president_polls_historical, I plan to use it as training data. This is also the dependent variable as it has info on the party and percentage of people polled that voted for any party. Th columns that I’ll use are: “party” “state” “pct”

The state_wins data shows the result of the 2020 election with number of votes for candidates by state. “state” “votes” “party”

Part 2: Data Transformation

Describe any transformations of the data. Have one bullet point per field.
- For example, missing values, binned data, 1/0 conversions, normalizing continuous data, creating categorical features, etc…
- I deleted rows where there’s NA or missing values. It’s normally missing for strings like state because the poll was for national polling. -If democrat votes are greater than republican votes, the new column would show one. -Removed Maine CDs and District of Columbia because the other data did not have this data -Averaged the percents by states because it’s harder to work with different percent polled. -Turned state to 0 after it’s merged because the cor function would only take numeric -Inner joined by states
Describe any new fields that you created. For example, converting from Biden / Trump votes to a over/under field.

-New fields would be dem_win to show where the democrats had won the state. Having a 1/0 makes it easier to create analysis.

Part 3: Correlations

Describe the results of relationships between key variables (e.g., voter turnout vs. age, education).

The relationship between percent that voted Democrats when polled with bachelors/higher and over_15/never_married is strong because rsquared is 0.7338 while pvalues are less than 0.05. This means that the variation that my model accounted for is high as 0.7 is close to 1 , which is a perfect model. The pvalues mean that there is a low chance that the result is due to randomness.

Identify any anomalies or patterns that could affect analysis (e.g., outliers, multicollinearity).

Some outliers would be Maine CD and the District of Columbia, which I had deleted from the data because they would skew the analysis. It also doesn’t place much importance on result as there’s already a column for Maine which District of Columbia isn’t really a state.

Part 4: Modeling

Document the following models:

Model 1: Simple Linear Regression - Select a single continuous variable (e.g., age) to predict voter turnout. - Run the regression and document the results, including coefficient and R-squared values. - Visualize the regression line and assess the goodness of fit.

I choose education level to predict voter turnout. The coefficient shows that there is a positive correlation and the rsquared shows that it’s a good model(0.6) and that it has a good chance of predicting the value of response variable. It’s a positive, linear regression line.

Model 2: Multiple Linear Regression (Continuous Variables) - Include several continuous variables (e.g., age, income, education) in the model to predict voter turnout. - Analyze the coefficients, p-values, and R-squared of the model. - Evaluate the importance of each variable and the overall model fit.

-The variables are households_number and votes. The pvalues show that there is low randomness for votes because they’re smaller than 0.05 but randomness for bachelors becasue its 0.09. R-squared is 0.3921 which means that it’s an okay model with most variance accounted for. It’s a good fit because there is a positive, linear relationship.

Model 3: Regression with a Categorical Variable - Add a categorical variable (e.g., region, gender) into the regression along with continuous variables. - Interpret the results, paying attention to the impact of the categorical variable on voting behavior. - Compare this model’s performance with the previous ones, using adjusted R-squared and other metrics.

The variables used are bachelors, marital status, and dem_win. This summary is unreliable. Pvalues are all greater than 0.05, meaning that the data is more extreme than what’s expected from ranom chance while r squared is 0.066, which states that it’s not a good model as it’s too low.

Part 4: Analysis of Results

Compare the performance of the three models based on key metrics (e.g., R-squared, residuals).
Interpret the significance of each variable across the models. What factors seem most important in predicting voter turnout?

Performance is best for the simple linear regression because there is a pvalues less than 0.05 while rsquared was 0.6 which means that it has most variance accounted for. The factor that’s most important in predicting voter turnout is education and marital status.

Discuss any limitations or potential biases in the models (e.g., omitted variable bias, overfitting).
Suggest improvements or further steps for analysis (e.g., adding interaction terms, using different algorithms).

I’m not sure but I believe the dem_wins might cause limitations as it’s a 0 or 1. It would cause skews and I might need to use log regressions to get a better picture for analysis. I still need to do training and testing because I ran into some errors when coding it. I would also need to be more detailed in reasonings.