This document contains two new tables which build off the configuration 3 table from previous updates. These new configurations (4-7) include:
Configuration 4 - this uses the own_party_vote_share_jacknife instead of the win-record (as in config.3). This is a parties average vote share in a state over all years excluding the current.
Configuration 5 - this uses the own_party_vote_share_ratio of the individuals own/opponent party average vote share in a state over all years excluding the current.
Configuration 6 - this uses an interaction of both own_party_vote_share_jacknife and own_party_vote_share_ratio
Configuration 7 - this uses non-Jacknife features. Here I compute own_party_vote_share_total as the average vote share of a party over the entire sample in the training_df and then take those values for each state-party grouping and merge this with the validation_df. This now includes the vote-share of a party over all years for a particular state, but is computed on data outside the validation df. I repeat the same thing to get a own_party_vote_share_ratio_total. This is the still going to be overfitting, but hopefully not as blatantly as just computing these directly on the validation df.
NOTE I also plot the densities of these features conditional on did_win. This gives a potential explanation for their lack of explanatory power (all their adj-r-sqrt are quite low), as none of the distributions vary massively between different win outcomes.
NOTE Throughout this entire analysis I am splitting tables to predict both (1) Win and (2) Vote share. This should be clear from the column-group sub-headings. For predicting vote_share the baseline model is linear.
Before jumping into regression output, I will put all definitions of important terms here:
Vote-Share : In all tables below, when vote-share is a LHS variable this means the vote-share of a specific candidate election. I.e. how much of the total vote did a specific candidate get in their election.
Own-Party-Vote-Share-Jacknife : This is a feature and computed as the “average vote share of my party in this state over all years excluding the current”
Own-Party-Vote-Share-Ratio : This is a feature and computed as the “average vote share of my party in this state over all years excluding the current / opponents parties vote share in this state over all years excluding the current”
Own-Party-Vote-Share : In configuration (7) I compute a non-Jacknife version of this feature. This is computed as the the “average vote share of my party in this state over all years including the current”. However, this is computed on the train_df and NOT directly on the val_df. I then transfer these values by their state, party key to the val_df. This is meant to reduce the amount of overfitting.
Own-Party-Vote-Share-Ratio : In configuration (7) this is computed with the same method as item (4) above, now just taking the ratio of “average vote share of my party in this state over all years / opponents parties vote share in this state over all years”
NOTE I am keeping the definition of variables above each table configuration as before. This section will be in all new markdowns and is meant to make it easier to find definitions.
Replacing own_party_win_rate with own_party_vote_share_jacknife in the election lm. This now includes:
Own Party Vote-share-Jacknife is computed as: “average vote share of my party in this state over all years excluding the current”
| Table 01 - Version 04 - Election Regressions | |||
|---|---|---|---|
| Fit measured in adjusted R squared and AUC | |||
| Model Configuration | Election Outcome | Vote Share | |
| Adjusted R Squared | ROC AUC | Adjusted R squared | |
| Single Variable Model | |||
| Election LM | 0.0258 | 0.6015 | 0.0163 |
| Lower 95% C.I. | 0.0128 | 0.5665 | 0.0053 |
| Upper 95% C.I. | 0.0444 | 0.6364 | 0.0334 |
| Sex | 0.0153 | 0.5605 | 0.0181 |
| 0.0052 | 0.5315 | 0.0067 | |
| 0.0312 | 0.5895 | 0.0352 | |
| Skine-Tone | −0.0049 | 0.5454 | −0.0055 |
| −0.0013 | 0.5107 | −0.0015 | |
| 0.0259 | 0.5801 | 0.0278 | |
| MTurk Features | 0.0017 | 0.5434 | 0.0026 |
| −0.0016 | 0.5078 | −0.0014 | |
| 0.0156 | 0.5790 | 0.0172 | |
| P_hat_cnn | 0.0972 | 0.6815 | 0.1016 |
| 0.0677 | 0.6487 | 0.0743 | |
| 0.1284 | 0.7142 | 0.1313 | |
| Combined Variable Model | |||
| Election LM + P_hat_cnn | 0.1172 | 0.7005 | 0.1185 |
| 0.0872 | 0.6685 | 0.0918 | |
| 0.1541 | 0.7325 | 0.1535 | |
| Election LM + Sex | 0.0388 | 0.6215 | 0.0367 |
| 0.0218 | 0.5870 | 0.0201 | |
| 0.0643 | 0.6559 | 0.0595 | |
| Election LM + Sex + P_hat_cnn | 0.1197 | 0.7042 | 0.1244 |
| 0.0918 | 0.6722 | 0.0949 | |
| 0.1554 | 0.7361 | 0.1578 | |
| Election LM + Sex + Skin-Tone | 0.0326 | 0.6330 | 0.0304 |
| 0.0307 | 0.5989 | 0.0266 | |
| 0.0728 | 0.6672 | 0.0775 | |
| Election LM + Sex + Skin-Tone + P_hat_cnn | 0.1150 | 0.7123 | 0.1205 |
| 0.1015 | 0.6807 | 0.1076 | |
| 0.1666 | 0.7439 | 0.1736 | |
| Election LM + Sex + Skin-Tone + MTurk | 0.0332 | 0.6359 | 0.0326 |
| 0.0315 | 0.6019 | 0.0326 | |
| 0.0808 | 0.6700 | 0.0825 | |
| Election LM + Sex + Skin-Tone + MTurk + P_hat_cnn | 0.1158 | 0.7161 | 0.1221 |
| 0.1030 | 0.6847 | 0.1104 | |
| 0.1726 | 0.7475 | 0.1819 | |
Replacing own_party_win_rate with own_party_vote_share_ratio in the election lm. This now includes:
Own Party Vote share ratio is computed as: “average vote share of my party in this state over all years excluding the current / opponents parties vote share in this state over all years excluding the current”
| Table 01 - Version 05 - Election Regressions | |||
|---|---|---|---|
| Fit measured in adjusted R squared and AUC | |||
| Model Configuration | Election Outcome | Vote Share | |
| Adjusted R Squared | ROC AUC | Adjusted R Squared | |
| Single Variable Model | |||
| Election LM | 0.0243 | 0.5959 | 0.0079 |
| Lower 95% C.I. | 0.0114 | 0.5608 | 0.0006 |
| Upper 95% C.I. | 0.0421 | 0.6310 | 0.0216 |
| Sex | 0.0153 | 0.5605 | 0.0181 |
| 0.0051 | 0.5315 | 0.0074 | |
| 0.0312 | 0.5895 | 0.0369 | |
| Skine-Tone | −0.0049 | 0.5454 | −0.0055 |
| −0.0010 | 0.5107 | −0.0015 | |
| 0.0260 | 0.5801 | 0.0272 | |
| MTurk Features | 0.0017 | 0.5434 | 0.0026 |
| −0.0017 | 0.5078 | −0.0010 | |
| 0.0163 | 0.5790 | 0.0179 | |
| P_hat_cnn | 0.0972 | 0.6815 | 0.1016 |
| 0.0696 | 0.6487 | 0.0747 | |
| 0.1301 | 0.7142 | 0.1332 | |
| Combined Variable Model | |||
| Election LM + P_hat_cnn | 0.1177 | 0.6993 | 0.1115 |
| 0.0925 | 0.6672 | 0.0824 | |
| 0.1516 | 0.7314 | 0.1464 | |
| Election LM + Sex | 0.0375 | 0.6202 | 0.0278 |
| 0.0205 | 0.5856 | 0.0128 | |
| 0.0610 | 0.6547 | 0.0486 | |
| Election LM + Sex + P_hat_cnn | 0.1202 | 0.7027 | 0.1171 |
| 0.0940 | 0.6708 | 0.0882 | |
| 0.1554 | 0.7347 | 0.1522 | |
| Election LM + Sex + Skin-Tone | 0.0311 | 0.6350 | 0.0214 |
| 0.0267 | 0.6009 | 0.0202 | |
| 0.0745 | 0.6692 | 0.0657 | |
| Election LM + Sex + Skin-Tone + P_hat_cnn | 0.1152 | 0.7113 | 0.1131 |
| 0.1020 | 0.6796 | 0.0997 | |
| 0.1668 | 0.7429 | 0.1658 | |
| Election LM + Sex + Skin-Tone + MTurk | 0.0318 | 0.6381 | 0.0235 |
| 0.0294 | 0.6041 | 0.0235 | |
| 0.0789 | 0.6721 | 0.0726 | |
| Election LM + Sex + Skin-Tone + MTurk + P_hat_cnn | 0.1162 | 0.7153 | 0.1147 |
| 0.1043 | 0.6839 | 0.1050 | |
| 0.1729 | 0.7468 | 0.1712 | |
After seeing how weak the election_lm is with both the own_party_vote_share_jacknife and own_party_vote_share_ratio features, I plot their conditional densities below. There just seems to be not enough significant variation between the did_win_conditional == True and == False groups for these features to have legitimate explanatory power.
I now include both own_party_vote_share_jacknife and own_party_vote_share_ratio in the election_lm such that the model becomes:
y = party + vote_share_jacknife + vote_share_ratio
| Table 01 - Version 06 - Election Regressions | |||
|---|---|---|---|
| Fit measured in adjusted R squared and AUC | |||
| Model Configuration | Election Outcome | Vote Share | |
| Adjusted R Squared | ROC AUC | Adjusted R Squared | |
| Single Variable Model | |||
| Election LM | 0.0253 | 0.5988 | 0.0172 |
| Lower 95% C.I. | 0.0113 | 0.5639 | 0.0059 |
| Upper 95% C.I. | 0.0441 | 0.6338 | 0.0337 |
| Sex | 0.0153 | 0.5605 | 0.0181 |
| 0.0043 | 0.5315 | 0.0070 | |
| 0.0307 | 0.5895 | 0.0358 | |
| Skine-Tone | −0.0049 | 0.5454 | −0.0055 |
| −0.0015 | 0.5107 | −0.0016 | |
| 0.0260 | 0.5801 | 0.0278 | |
| MTurk Features | 0.0017 | 0.5434 | 0.0026 |
| −0.0014 | 0.5078 | −0.0012 | |
| 0.0152 | 0.5790 | 0.0160 | |
| P_hat_cnn | 0.0972 | 0.6815 | 0.1016 |
| 0.0702 | 0.6487 | 0.0740 | |
| 0.1277 | 0.7142 | 0.1325 | |
| Combined Variable Model | |||
| Election LM + P_hat_cnn | 0.1177 | 0.7000 | 0.1192 |
| 0.0897 | 0.6679 | 0.0909 | |
| 0.1536 | 0.7320 | 0.1535 | |
| Election LM + Sex | 0.0384 | 0.6213 | 0.0376 |
| 0.0217 | 0.5868 | 0.0211 | |
| 0.0630 | 0.6557 | 0.0614 | |
| Election LM + Sex + P_hat_cnn | 0.1202 | 0.7036 | 0.1251 |
| 0.0913 | 0.6716 | 0.0960 | |
| 0.1553 | 0.7355 | 0.1627 | |
| Election LM + Sex + Skin-Tone | 0.0321 | 0.6336 | 0.0314 |
| 0.0283 | 0.5994 | 0.0276 | |
| 0.0759 | 0.6677 | 0.0782 | |
| Election LM + Sex + Skin-Tone + P_hat_cnn | 0.1153 | 0.7118 | 0.1212 |
| 0.0992 | 0.6802 | 0.1071 | |
| 0.1687 | 0.7434 | 0.1751 | |
| Election LM + Sex + Skin-Tone + MTurk | 0.0328 | 0.6367 | 0.0335 |
| 0.0311 | 0.6027 | 0.0317 | |
| 0.0798 | 0.6707 | 0.0838 | |
| Election LM + Sex + Skin-Tone + MTurk + P_hat_cnn | 0.1163 | 0.7156 | 0.1228 |
| 0.1057 | 0.6841 | 0.1120 | |
| 0.1735 | 0.7470 | 0.1819 | |
I now compute a non-jacknifed version of the own_party_vote_share and own_party_vote_share_ratio on the train_df. Then I transfer those valued, based on state, party groups, to the validation set. The hope is that the overfitting will be limited as the validation set contained ‘technically’ new observations.
This new election_lm will thus be using the own_party_vote_share_total and own_party_vote_share_ratio_total variables as interactions where each does not exclude the current year. However, as outlined above, these features are computed on different data.
| Table 01 - Version 07 - Election Regressions | |||
|---|---|---|---|
| Fit measured in adjusted R squared and AUC | |||
| Model Configuration | Election Outcome | Vote Share | |
| Adjusted R Squared | ROC AUC | Adjusted R Squared | |
| Single Variable Model | |||
| Election LM | 0.0590 | 0.6395 | 0.0589 |
| Lower 95% C.I. | 0.0365 | 0.6054 | 0.0391 |
| Upper 95% C.I. | 0.0853 | 0.6735 | 0.0837 |
| Sex | 0.0153 | 0.5605 | 0.0181 |
| 0.0050 | 0.5315 | 0.0063 | |
| 0.0314 | 0.5895 | 0.0336 | |
| Skine-Tone | −0.0049 | 0.5454 | −0.0055 |
| −0.0008 | 0.5107 | −0.0022 | |
| 0.0243 | 0.5801 | 0.0274 | |
| MTurk Features | 0.0017 | 0.5434 | 0.0026 |
| −0.0015 | 0.5078 | −0.0011 | |
| 0.0159 | 0.5790 | 0.0166 | |
| P_hat_cnn | 0.0972 | 0.6815 | 0.1016 |
| 0.0682 | 0.6487 | 0.0731 | |
| 0.1302 | 0.7142 | 0.1332 | |
| Combined Variable Model | |||
| Election LM + P_hat_cnn | 0.1422 | 0.7197 | 0.1499 |
| 0.1118 | 0.6885 | 0.1186 | |
| 0.1750 | 0.7509 | 0.1854 | |
| Election LM + Sex | 0.0692 | 0.6544 | 0.0756 |
| 0.0480 | 0.6208 | 0.0524 | |
| 0.0998 | 0.6880 | 0.1056 | |
| Election LM + Sex + P_hat_cnn | 0.1437 | 0.7220 | 0.1543 |
| 0.1161 | 0.6908 | 0.1203 | |
| 0.1807 | 0.7531 | 0.1914 | |
| Election LM + Sex + Skin-Tone | 0.0620 | 0.6622 | 0.0687 |
| 0.0532 | 0.6288 | 0.0603 | |
| 0.1109 | 0.6956 | 0.1186 | |
| Election LM + Sex + Skin-Tone + P_hat_cnn | 0.1380 | 0.7279 | 0.1497 |
| 0.1219 | 0.6970 | 0.1314 | |
| 0.1917 | 0.7587 | 0.2065 | |
| Election LM + Sex + Skin-Tone + MTurk | 0.0623 | 0.6646 | 0.0706 |
| 0.0573 | 0.6314 | 0.0641 | |
| 0.1153 | 0.6979 | 0.1231 | |
| Election LM + Sex + Skin-Tone + MTurk + P_hat_cnn | 0.1386 | 0.7314 | 0.1510 |
| 0.1267 | 0.7008 | 0.1349 | |
| 0.1971 | 0.7621 | 0.2098 | |
I repeat the same plots for the new non-Jacknife own_party_vote_share_total and own_party_vote_share_ratio_total variables.