This analysis uses popular vote data from 2016 and 2020 to predict the popular vote per candidate for the 2024 election. Its data includes the election year, the candidate, the grade of the pollster, and the popular vote prediction per pollster. If Project 538 pollster data over predicts the third-party candidate and under predicts the republican and democratic candidate, I expect the actual popular vote to have less votes for the third-party candidate and more votes for republican and democratic candidates. Proportionally, though, the republican candidate should get more of the third-party votes than the democratic candidate. However, if the predictions for the republican, democratic, and third-party candidate are accurate, the Project 538 pollster data is accurate as is.
Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.
Key variables included:
New names:
• `` -> `...73`
• `` -> `...74`
• `` -> `...75`
• `` -> `...76`
• `` -> `...77`
• `` -> `...78`
• `` -> `...79`
• `` -> `...80`
• `` -> `...81`
• `` -> `...82`
• `` -> `...83`
• `` -> `...84`
• `` -> `...85`
• `` -> `...86`
• `` -> `...87`
• `` -> `...88`
• `` -> `...89`
• `` -> `...90`
• `` -> `...91`
• `` -> `...92`
• `` -> `...93`
• `` -> `...94`
• `` -> `...95`
• `` -> `...96`
• `` -> `...97`
• `` -> `...98`
• `` -> `...99`
• `` -> `...100`
• `` -> `...101`
• `` -> `...102`
• `` -> `...103`
• `` -> `...104`
• `` -> `...105`
• `` -> `...106`
• `` -> `...107`
• `` -> `...108`
• `` -> `...109`
• `` -> `...110`
• `` -> `...111`
• `` -> `...112`
• `` -> `...113`
• `` -> `...114`
• `` -> `...115`
• `` -> `...116`
• `` -> `...117`
• `` -> `...118`
• `` -> `...119`
• `` -> `...120`
• `` -> `...121`
• `` -> `...122`
• `` -> `...123`
• `` -> `...124`
• `` -> `...125`
• `` -> `...126`
• `` -> `...127`
• `` -> `...128`
• `` -> `...129`
• `` -> `...130`
• `` -> `...131`
• `` -> `...132`
• `` -> `...133`
• `` -> `...134`
• `` -> `...135`
• `` -> `...136`
• `` -> `...137`
• `` -> `...138`
• `` -> `...139`
• `` -> `...140`
• `` -> `...141`
• `` -> `...142`
• `` -> `...143`
• `` -> `...144`
• `` -> `...145`
• `` -> `...146`
• `` -> `...147`
• `` -> `...148`
• `` -> `...149`
• `` -> `...150`
• `` -> `...151`
• `` -> `...152`
• `` -> `...153`
• `` -> `...154`
• `` -> `...155`
• `` -> `...156`
• `` -> `...157`
• `` -> `...158`
• `` -> `...159`
• `` -> `...160`
• `` -> `...161`
• `` -> `...162`
• `` -> `...163`
• `` -> `...164`
• `` -> `...165`
• `` -> `...166`
• `` -> `...167`
• `` -> `...168`
• `` -> `...169`
• `` -> `...170`
• `` -> `...171`
• `` -> `...172`
• `` -> `...173`
• `` -> `...174`
• `` -> `...175`
• `` -> `...176`
• `` -> `...177`
• `` -> `...178`
• `` -> `...179`
• `` -> `...180`
• `` -> `...181`
• `` -> `...182`
• `` -> `...183`
• `` -> `...184`
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 62 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (3): County, FIPS, Rank within US (of 3143 counties)
dbl (2): Value (Percent), People (Unemployed)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People (Education: Less Than 9th...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People(Education: Less Than High...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People (Education: At Least Bach...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 65 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3142 counties)
dbl (1): FIPS
num (1): Value (Dollars)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 65 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3142 counties)
dbl (1): FIPS
num (1): Value (Dollars)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 64 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3135 counties)
dbl (1): FIPS
num (1): Value (Dollars)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Call:
lm(formula = proficiency ~ enroll + tlocrev + ppcstot + unemployed +
at_least_bachelor_education + household_income, data = t_train)
Residuals:
Min 1Q Median 3Q Max
-6.7360 -2.9337 -0.2146 2.3756 8.5571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.59405726 10.62009737 0.809 0.42418
enroll 0.00045645 0.00052416 0.871 0.39014
tlocrev -0.00001838 0.00008596 -0.214 0.83201
ppcstot 0.00070458 0.00044490 1.584 0.12281
unemployed -0.19554527 0.27793175 -0.704 0.48663
at_least_bachelor_education 0.39826973 0.13013574 3.060 0.00437 **
household_income 0.00007607 0.00012306 0.618 0.54074
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.912 on 33 degrees of freedom
Multiple R-squared: 0.5629, Adjusted R-squared: 0.4834
F-statistic: 7.082 on 6 and 33 DF, p-value: 0.00006584
Root Mean Squared Error (RMSE): 4.939345
R-squared: 0.3152698
Error in usmap::map_with_data(data, values = values, include = include, :
`data` must be a data.frame containing either a `state` or `fips` column.
We don’t have equal number of spam and non-spam messages. We can use bootstrapping to create a more balanced dataset.
Note that we re-used the function from earlier to return a t_train and a t_test using our bootstrapped sample.