library(car)
library(readxl)
library(dplyr)
library(RWeka)
## Warning: package 'RWeka' was built under R version 4.4.2
This exercise uses data from the 2024 American National Election Studies Pilot Survey to train, test, and evaluate the performance of regression and M5 machine learning models in predicting respondents’ self-placement on a 5-point political ideology scale.
a24p<-read_excel("C:\\Users\\Jaire\\OneDrive\\Desktop\\Exploratory Research\\Data\\a24pconv.xlsx")
a24psub<-dplyr::select(a24p, group_maga, group_demsoc, group_fascists, group_kkk, group_blm, group_atheists, group_prolife, group_prochoice, group_gunrights, group_lgbtrights, group_feminists, group_whitesup, group_antigay, group_socialist, group_christfund, group_laborun, group_gayles, group_transppl, group_christians, group_police, group_journal, group_nra, group_ruralam, group_planparent, group_colprofs, group_black, group_white, group_hispanic, group_transmen, group_transwom, group_jews, group_muslims, foreign_israel, foreign_hamas, foreign_israelis, foreign_palest, party_repub, party_democ, ideo5)
The “ideo5” feature is a self-response measure indicating respondents’ perception of their placement on a political ideological scale from 1 to 5 (1-Very Liberal, 2-Liberal, 3-Moderate, 4-Conservative, 5-Very Conservative). All other features are feeling thermometers; higher values indicate more positive ratings by respondents of each group’s activists and members.
table(a24psub$ideo5)
##
## 1 2 3 4 5
## 182 331 801 393 202
# Remove rows with any NA's
clean_a24psub <- a24psub[rowSums(is.na(a24psub)) <= 0, ]
summary(clean_a24psub)
## group_maga group_demsoc group_fascists group_kkk
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 20.0 Median : 32.00 Median : 1.00 Median : 0.000
## Mean : 33.1 Mean : 35.51 Mean : 14.87 Mean : 9.213
## 3rd Qu.: 57.0 3rd Qu.: 60.00 3rd Qu.: 24.00 3rd Qu.: 5.000
## Max. :100.0 Max. :100.00 Max. :100.00 Max. :100.000
## group_blm group_atheists group_prolife group_prochoice
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 3.00 1st Qu.: 5.00 1st Qu.: 10.00
## Median : 39.00 Median : 50.00 Median : 50.00 Median : 50.00
## Mean : 39.16 Mean : 40.87 Mean : 44.17 Mean : 48.85
## 3rd Qu.: 71.00 3rd Qu.: 58.00 3rd Qu.: 71.00 3rd Qu.: 84.00
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## group_gunrights group_lgbtrights group_feminists group_whitesup
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 4.00 1st Qu.: 1.00 1st Qu.: 9.0 1st Qu.: 0.00
## Median : 50.00 Median : 42.00 Median : 50.0 Median : 1.00
## Mean : 44.16 Mean : 40.55 Mean : 44.7 Mean : 13.24
## 3rd Qu.: 72.00 3rd Qu.: 70.00 3rd Qu.: 70.0 3rd Qu.: 14.00
## Max. :100.00 Max. :100.00 Max. :100.0 Max. :100.00
## group_antigay group_socialist group_christfund group_laborun
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 19.00
## Median : 11.00 Median : 33.00 Median : 47.00 Median : 51.00
## Mean : 26.32 Mean : 33.84 Mean : 39.89 Mean : 49.57
## 3rd Qu.: 50.00 3rd Qu.: 55.00 3rd Qu.: 63.00 3rd Qu.: 75.00
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## group_gayles group_transppl group_christians group_police
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 8 1st Qu.: 2.00 1st Qu.: 41.00 1st Qu.: 39.00
## Median : 50 Median : 49.00 Median : 68.00 Median : 65.00
## Mean : 48 Mean : 41.25 Mean : 60.67 Mean : 58.19
## 3rd Qu.: 79 3rd Qu.: 69.00 3rd Qu.: 91.00 3rd Qu.: 85.00
## Max. :100 Max. :100.00 Max. :100.00 Max. :100.00
## group_journal group_nra group_ruralam group_planparent
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 15.00 1st Qu.: 1.00 1st Qu.: 50.00 1st Qu.: 4.00
## Median : 50.00 Median : 49.00 Median : 70.00 Median : 52.00
## Mean : 46.61 Mean : 41.28 Mean : 63.29 Mean : 50.19
## 3rd Qu.: 72.00 3rd Qu.: 71.00 3rd Qu.: 90.00 3rd Qu.: 85.00
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## group_colprofs group_black group_white group_hispanic
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 25.00 1st Qu.: 50.00 1st Qu.: 50.00 1st Qu.: 50.00
## Median : 52.00 Median : 70.00 Median : 70.00 Median : 68.00
## Mean : 50.44 Mean : 63.18 Mean : 62.85 Mean : 62.41
## 3rd Qu.: 75.00 3rd Qu.: 90.00 3rd Qu.: 88.00 3rd Qu.: 89.00
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## group_transmen group_transwom group_jews group_muslims
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 50.00 1st Qu.: 21.0
## Median : 41.00 Median : 43.00 Median : 69.00 Median : 50.0
## Mean : 38.22 Mean : 39.28 Mean : 62.34 Mean : 48.9
## 3rd Qu.: 63.00 3rd Qu.: 66.00 3rd Qu.: 88.00 3rd Qu.: 71.0
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.0
## foreign_israel foreign_hamas foreign_israelis foreign_palest
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 20.00 1st Qu.: 0.00 1st Qu.: 30.00 1st Qu.: 10.00
## Median : 51.00 Median : 2.00 Median : 53.00 Median : 50.00
## Mean : 50.91 Mean : 16.95 Mean : 53.11 Mean : 41.47
## 3rd Qu.: 80.00 3rd Qu.: 33.00 3rd Qu.: 81.00 3rd Qu.: 60.00
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## party_repub party_democ ideo5
## Min. : 0.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.:2.000
## Median : 40.00 Median : 40.00 Median :3.000
## Mean : 39.33 Mean : 39.26 Mean :3.053
## 3rd Qu.: 68.00 3rd Qu.: 70.00 3rd Qu.:4.000
## Max. :100.00 Max. :100.00 Max. :5.000
nrow(clean_a24psub)
## [1] 1909
table(clean_a24psub$ideo5)
##
## 1 2 3 4 5
## 182 331 801 393 202
library(RWeka)
# create training (%75) and test data (%25)
ideo_train <- clean_a24psub[1:1125, ]
ideo_test <- clean_a24psub[1126:1500, ]
# create and view model
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2
ideo.rpart <- rpart(ideo5 ~ ., data = ideo_train)
summary(ideo.rpart)
## Call:
## rpart(formula = ideo5 ~ ., data = ideo_train)
## n= 1125
##
## CP nsplit rel error xerror xstd
## 1 0.23581204 0 1.0000000 1.0005177 0.03816929
## 2 0.08619864 1 0.7641880 0.7817530 0.03328216
## 3 0.08455985 2 0.6779893 0.7252345 0.03235649
## 4 0.02450426 3 0.5934295 0.6240213 0.03011342
## 5 0.02424716 4 0.5689252 0.6231018 0.03049406
## 6 0.01589449 5 0.5446780 0.6090065 0.02980377
## 7 0.01202238 6 0.5287836 0.6026834 0.03015003
## 8 0.01049292 7 0.5167612 0.5899096 0.02930940
## 9 0.01000000 8 0.5062682 0.5837806 0.02870807
##
## Variable importance
## group_demsoc party_democ group_blm group_socialist
## 13 10 10 9
## group_planparent group_christfund group_journal group_maga
## 8 8 8 7
## group_nra group_prolife group_gunrights group_antigay
## 5 5 3 3
## foreign_israelis group_transmen foreign_israel group_transwom
## 2 2 1 1
## group_feminists group_transppl group_gayles group_lgbtrights
## 1 1 1 1
## party_repub group_christians
## 1 1
##
## Node number 1: 1125 observations, complexity param=0.235812
## mean=3.026667, MSE=1.124622
## left son=2 (545 obs) right son=3 (580 obs)
## Primary splits:
## group_demsoc < 38.5 to the right, improve=0.2358120, (0 missing)
## party_democ < 33.5 to the right, improve=0.2266821, (0 missing)
## party_repub < 48.5 to the left, improve=0.2150226, (0 missing)
## group_blm < 31.5 to the right, improve=0.2108504, (0 missing)
## group_lgbtrights < 53.5 to the right, improve=0.2085210, (0 missing)
## Surrogate splits:
## group_blm < 29.5 to the right, agree=0.867, adj=0.725, (0 split)
## party_democ < 33.5 to the right, agree=0.860, adj=0.712, (0 split)
## group_socialist < 37.5 to the right, agree=0.844, adj=0.679, (0 split)
## group_planparent < 49.5 to the right, agree=0.801, adj=0.589, (0 split)
## group_journal < 46.5 to the right, agree=0.788, adj=0.563, (0 split)
##
## Node number 2: 545 observations, complexity param=0.08455985
## mean=2.495413, MSE=0.8775019
## left son=4 (226 obs) right son=5 (319 obs)
## Primary splits:
## group_christfund < 36.5 to the left, improve=0.2237066, (0 missing)
## party_repub < 32.5 to the left, improve=0.2098515, (0 missing)
## group_maga < 27.5 to the left, improve=0.1973309, (0 missing)
## group_antigay < 5.5 to the left, improve=0.1969671, (0 missing)
## group_nra < 21.5 to the left, improve=0.1888299, (0 missing)
## Surrogate splits:
## group_nra < 29.5 to the left, agree=0.817, adj=0.558, (0 split)
## group_prolife < 31.5 to the left, agree=0.807, adj=0.535, (0 split)
## group_maga < 13.5 to the left, agree=0.798, adj=0.513, (0 split)
## group_gunrights < 17.5 to the left, agree=0.793, adj=0.500, (0 split)
## group_antigay < 7.5 to the left, agree=0.769, adj=0.442, (0 split)
##
## Node number 3: 580 observations, complexity param=0.08619864
## mean=3.525862, MSE=0.8424346
## left son=6 (387 obs) right son=7 (193 obs)
## Primary splits:
## group_maga < 58.5 to the left, improve=0.2232006, (0 missing)
## group_christfund < 43.5 to the left, improve=0.2095804, (0 missing)
## party_repub < 56.5 to the left, improve=0.1867628, (0 missing)
## group_christians < 57.5 to the left, improve=0.1709261, (0 missing)
## group_nra < 47.5 to the left, improve=0.1708726, (0 missing)
## Surrogate splits:
## foreign_israelis < 76.5 to the left, agree=0.790, adj=0.368, (0 split)
## group_nra < 67.5 to the left, agree=0.783, adj=0.347, (0 split)
## foreign_israel < 69.5 to the left, agree=0.771, adj=0.311, (0 split)
## group_prolife < 72.5 to the left, agree=0.767, adj=0.301, (0 split)
## group_christfund < 53.5 to the left, agree=0.762, adj=0.285, (0 split)
##
## Node number 4: 226 observations, complexity param=0.02450426
## mean=1.969027, MSE=0.6406336
## left son=8 (105 obs) right son=9 (121 obs)
## Primary splits:
## group_transmen < 83.5 to the right, improve=0.2141325, (0 missing)
## group_socialist < 55.5 to the right, improve=0.2095422, (0 missing)
## group_demsoc < 79.5 to the right, improve=0.2088198, (0 missing)
## group_transwom < 57 to the right, improve=0.2057537, (0 missing)
## group_transppl < 56.5 to the right, improve=0.1974737, (0 missing)
## Surrogate splits:
## group_transwom < 84.5 to the right, agree=0.925, adj=0.838, (0 split)
## group_transppl < 83.5 to the right, agree=0.907, adj=0.800, (0 split)
## group_gayles < 84.5 to the right, agree=0.845, adj=0.667, (0 split)
## group_lgbtrights < 84.5 to the right, agree=0.832, adj=0.638, (0 split)
## group_feminists < 82.5 to the right, agree=0.765, adj=0.495, (0 split)
##
## Node number 5: 319 observations, complexity param=0.01589449
## mean=2.868339, MSE=0.709938
## left son=10 (199 obs) right son=11 (120 obs)
## Primary splits:
## party_repub < 54.5 to the left, improve=0.08879626, (0 missing)
## group_prolife < 41.5 to the left, improve=0.06760915, (0 missing)
## group_maga < 27 to the left, improve=0.06358133, (0 missing)
## group_planparent < 69.5 to the right, improve=0.05253754, (0 missing)
## group_prochoice < 76.5 to the right, improve=0.04673697, (0 missing)
## Surrogate splits:
## group_maga < 52.5 to the left, agree=0.721, adj=0.258, (0 split)
## group_fascists < 59.5 to the left, agree=0.699, adj=0.200, (0 split)
## group_whitesup < 65.5 to the left, agree=0.687, adj=0.167, (0 split)
## group_christfund < 64.5 to the left, agree=0.687, adj=0.167, (0 split)
## foreign_hamas < 52.5 to the left, agree=0.683, adj=0.158, (0 split)
##
## Node number 6: 387 observations, complexity param=0.02424716
## mean=3.219638, MSE=0.7192009
## left son=12 (244 obs) right son=13 (143 obs)
## Primary splits:
## group_christfund < 38.5 to the left, improve=0.11021960, (0 missing)
## party_democ < 66.5 to the right, improve=0.08334673, (0 missing)
## group_prolife < 30.5 to the left, improve=0.08211000, (0 missing)
## party_repub < 58 to the left, improve=0.08002590, (0 missing)
## group_nra < 16.5 to the left, improve=0.07876842, (0 missing)
## Surrogate splits:
## group_prolife < 39.5 to the left, agree=0.824, adj=0.524, (0 split)
## group_christians < 52.5 to the left, agree=0.817, adj=0.503, (0 split)
## group_antigay < 4.5 to the left, agree=0.793, adj=0.441, (0 split)
## group_gunrights < 29.5 to the left, agree=0.788, adj=0.427, (0 split)
## group_nra < 41 to the left, agree=0.786, adj=0.420, (0 split)
##
## Node number 7: 193 observations
## mean=4.139896, MSE=0.5244705
##
## Node number 8: 105 observations
## mean=1.571429, MSE=0.4163265
##
## Node number 9: 121 observations
## mean=2.31405, MSE=0.5790588
##
## Node number 10: 199 observations, complexity param=0.01049292
## mean=2.673367, MSE=0.581753
## left son=20 (69 obs) right son=21 (130 obs)
## Primary splits:
## group_feminists < 62.5 to the right, improve=0.11467370, (0 missing)
## group_kkk < 63.5 to the right, improve=0.08251836, (0 missing)
## group_socialist < 54.5 to the right, improve=0.07402137, (0 missing)
## group_atheists < 53.5 to the right, improve=0.07366350, (0 missing)
## group_journal < 74.5 to the right, improve=0.06852082, (0 missing)
## Surrogate splits:
## group_transmen < 59.5 to the right, agree=0.764, adj=0.319, (0 split)
## group_lgbtrights < 63 to the right, agree=0.739, adj=0.246, (0 split)
## group_transwom < 56.5 to the right, agree=0.739, adj=0.246, (0 split)
## group_journal < 74.5 to the right, agree=0.734, adj=0.232, (0 split)
## group_gayles < 60.5 to the right, agree=0.729, adj=0.217, (0 split)
##
## Node number 11: 120 observations
## mean=3.191667, MSE=0.7549306
##
## Node number 12: 244 observations, complexity param=0.01202238
## mean=3.004098, MSE=0.6926062
## left son=24 (21 obs) right son=25 (223 obs)
## Primary splits:
## party_democ < 57.5 to the right, improve=0.09000645, (0 missing)
## group_lgbtrights < 84.5 to the right, improve=0.08730407, (0 missing)
## group_transppl < 95.5 to the right, improve=0.08315653, (0 missing)
## group_feminists < 64 to the right, improve=0.07454337, (0 missing)
## group_transmen < 97 to the right, improve=0.06877034, (0 missing)
## Surrogate splits:
## group_prochoice < 93 to the right, agree=0.922, adj=0.095, (0 split)
##
## Node number 13: 143 observations
## mean=3.587413, MSE=0.5500513
##
## Node number 20: 69 observations
## mean=2.318841, MSE=0.5070363
##
## Node number 21: 130 observations
## mean=2.861538, MSE=0.5192899
##
## Node number 24: 21 observations
## mean=2.190476, MSE=0.5351474
##
## Node number 25: 223 observations
## mean=3.080717, MSE=0.6392246
# visualize decision tree
rpart.plot(ideo.rpart, digits = 3)
rpart.plot(ideo.rpart, digits = 4, fallen.leaves = TRUE,
type = 3, extra = 101)
# create predictions on test data, check estimates
pred_ideo.rpart <- predict(ideo.rpart, ideo_test)
summary(pred_ideo.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.571 2.319 3.081 3.083 3.587 4.140
summary(ideo_test$ideo)
## Warning: Unknown or uninitialised column: `ideo`.
## Length Class Mode
## 0 NULL NULL
cor(pred_ideo.rpart, ideo_test$ideo5)
## [1] 0.6167006
A correlation of 0.62 is good. However, correlation only indicates the strength of the relationship between the predictions and the true values; it does not measure how far the predictions deviate from the true values.
# create mean of the absolute value of the errors function
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))}
# adjust model to correctly ID outliers, create and assess mean of the absolute value of the errors
summary(pred_ideo.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.571 2.319 3.081 3.083 3.587 4.140
MAE(pred_ideo.rpart, ideo_test$ideo5)
## [1] 0.6798532
mean(ideo_train$ideo5)
## [1] 3.026667
MAE(3.03, ideo_test$ideo5)
## [1] 0.8404
MAE measures the average size of prediction errors, ignoring direction, and is a simple metric for model accuracy. An MAE of 0.68 indicates the model’s predictions are, on average, 0.68 units off from the true quality scores.
If the model consistently predicted a mean respondent self-placement of 3.03 (politically moderate), the imputed MAE would be about 0.84.
The regression tree (MAE = 0.68) is slightly closer to the true quality scores than the imputed MAE (MAE = 0.84), but the difference (0.16) is small, suggesting room for improvement.
# fit model using
i.m5p <- M5P(ideo5 ~ ., data = ideo_train)
# view model
print(i.m5p)
## M5 pruned model tree:
## (using smoothed linear models)
## LM1 (1125/67.167%)
##
## LM num: 1
## ideo5 =
## 0.0028 * group_maga
## - 0.0045 * group_demsoc
## + 0.0026 * group_prolife
## - 0.0023 * group_prochoice
## - 0.0029 * group_feminists
## + 0.0016 * group_antigay
## - 0.0032 * group_socialist
## + 0.0033 * group_christfund
## + 0.0016 * group_christians
## + 0.0044 * group_police
## - 0.0026 * group_planparent
## - 0.0021 * group_muslims
## + 0.0018 * foreign_palest
## + 0.0033 * party_repub
## - 0.0046 * party_democ
## + 3.0246
##
## Number of Rules : 1
The coefficient of 0.0028 for group_maga indicates that a 1-unit increase in the feeling thermometer rating for MAGA activists predicts a 0.0028 increase in the respondent’s political ideology score, suggesting a slight shift toward a more conservative self-placement.
Interpret the remaining features in a similar fashion keeping in mind the sign direction and 5-point scale.
# model diagnostics
summary(i.m5p)
##
## === Summary ===
##
## Correlation coefficient 0.7408
## Mean absolute error 0.5315
## Root mean squared error 0.7123
## Relative absolute error 69.2029 %
## Root relative squared error 67.1673 %
## Total Number of Instances 1125
# assess performance on unseen data
e.m5p <- predict(i.m5p, ideo_test)
# view value range, correlation, and MAE
summary(e.m5p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.182 2.680 3.025 3.110 3.740 4.904
cor(e.m5p, ideo_test$ideo5)
## [1] 0.7171848
MAE(ideo_test$ideo5, e.m5p)
## [1] 0.5690298
mean(ideo_test$ideo5)
## [1] 3.170667
MAE(3.17, ideo_test$ideo5)
## [1] 0.8796
The correlation between predictions and true values is higher for the M5 model (0.72) than the regression tree (0.62). The M5 model’s MAE of 0.57 indicates its predictions are, on average, 0.57 units off from the true self-placement of political ideology scores, showing slight improvement.
If the model consistently predicted a mean respondent self-placement of 3.17 (politically moderate), the MAE would be about 0.88.
In this practice exercise, I observed that the M5 model (MAE = 0.57) is somewhat more accurate than the regression tree (MAE = 0.68) and the imputed MAE (MAE = 0.88).