Problem Set 3 Submission

Question 1 Data Set up

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(haven)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

boatright <- read_dta("C:/Users/amyaz/OneDrive/Documents/Grad School/Spring 2025/Data Science/Problem Sets/PSet 3/boatright_subset.dta")
glimpse(boatright)

## Rows: 8,024
## Columns: 32
## $ person_id               <dbl> 21377, 28779, 33685, 9112, 24709, 29101, 29544…
## $ short_name              <chr> "Jim McGreevey", "Bret Schundler", "Mark Warne…
## $ state_postal            <chr> "NJ", "NJ", "VA", "VA", "AL", "AL", "AL", "AL"…
## $ primary_party           <dbl+lbl> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, …
## $ district_age_18_to_24   <dbl> 0.08126166, 0.08126166, 0.09203869, 0.09203869…
## $ district_black          <dbl> 0.12494519, 0.12494519, 0.21841812, 0.21841812…
## $ district_blue_collar    <dbl> 0.2208841, 0.2208841, 0.2364317, 0.2364317, 0.…
## $ district_carpool        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ district_college_educ   <dbl> 0.3329969, 0.3329969, 0.3346924, 0.3346924, 0.…
## $ district_dirty_industry <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ district_foreign_born   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0.0198, 0.0198…
## $ district_hs_educ        <dbl> 0.8636644, 0.8636644, 0.8660130, 0.8660130, 0.…
## $ district_land_area      <dbl> 8722.58, 8722.58, 42774.93, 42774.93, 52420.07…
## $ district_latino         <dbl> 0.135879159, 0.135879159, 0.041843250, 0.04184…
## $ district_median_income  <dbl> 54568, 54568, 49631, 49631, 37603, 37603, 3760…
## $ district_mm             <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ district_nonwhite       <dbl> 0.31038362, 0.31038362, 0.31110424, 0.31110424…
## $ district_over_65        <dbl> 0.12279768, 0.12279768, 0.11148047, 0.11148047…
## $ district_pop            <dbl> 8552643, 8552643, 7286873, 7286873, 4480089, 4…
## $ district_pop_density    <dbl> 980.5175781, 980.5175781, 170.3538208, 170.353…
## $ district_poverty        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0.17, 0.17, 0.…
## $ district_service        <dbl> 0.1432318, 0.1432318, 0.1501231, 0.1501231, 0.…
## $ district_unemployment   <dbl> 0.05938078, 0.05938078, 0.03861475, 0.03861475…
## $ district_vap            <dbl> 6448445, 6448445, 5536511, 5536511, 3371451, 3…
## $ district_veteran        <dbl> 0.10051504, 0.10051504, 0.13645324, 0.13645324…
## $ district_white          <dbl> 0.6896164, 0.6896164, 0.6888958, 0.6888958, 0.…
## $ district_white_collar   <dbl> 0.6358841, 0.6358841, 0.6134452, 0.6134452, 0.…
## $ district_pvi            <dbl> 7, 7, -5, -5, -8, -8, -8, -8, -13, -13, -15, -…
## $ district_dpres          <dbl> 0.5821141, 0.5821141, 0.4585277, 0.4585277, 0.…
## $ district_dpres_lagged   <dbl> 0.5821141, 0.5821141, 0.4585277, 0.4585277, 0.…
## $ incumbent               <dbl+lbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Gvote_share             <dbl> 0.5643287, 0.4167514, 0.5216336, 0.4702518, 0.…

set.seed(123)
boatright_clean <- boatright %>%
  filter(!is.na(Gvote_share))
train_index <- createDataPartition(boatright_clean$Gvote_share, 
                                   p = 0.80, 
                                   list = FALSE)
train_data <- boatright_clean %>% slice(train_index)

## Warning: Slicing with a 1-column matrix was deprecated in dplyr 1.1.0.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

test_data  <- boatright_clean %>% slice(-train_index)
nrow(train_data)

## [1] 6420

nrow(test_data)

## [1] 1604

Question a:

train_clean <- train_data %>%
  select(Gvote_share, primary_party, district_pvi, 
         district_nonwhite, incumbent, district_dpres_lagged) %>%
  na.omit()

test_clean <- test_data %>%
  select(Gvote_share, primary_party, district_pvi, 
         district_nonwhite, incumbent, district_dpres_lagged) %>%
  na.omit()

set.seed(123)
tree_model <- train(
  Gvote_share ~ primary_party + district_pvi + 
                district_nonwhite + incumbent + district_dpres_lagged,
  data   = train_clean,
  method = "rpart"
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

print(tree_model)

## CART 
## 
## 5679 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5679, 5679, 5679, 5679, 5679, 5679, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE       
##   0.02671824  0.1315874  0.5606821  0.09655974
##   0.03495925  0.1363476  0.5282605  0.10207945
##   0.51275267  0.1748325  0.5044932  0.13855113
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02671824.

tree_model$finalModel

## n= 5679 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 5679 223.346400 0.5007596  
##   2) incumbent< 0.5 3176  62.491020 0.3746937 *
##   3) incumbent>=0.5 2503  46.333900 0.6607218  
##     6) district_pvi< 16.5 2171  33.847100 0.6388804 *
##     7) district_pvi>=16.5 332   4.678782 0.8035457 *

The least populated leaf node contains 332 observations and is reached through the following decision logic: First, the tree splits on incumbency status — candidates who are incumbents (incumbent >= 0.5) are sent to the right branch. Among incumbents, the tree then splits on the district’s partisan voting index — those in districts with a PVI of 16.5 or higher are sent to the right branch, arriving at the least populated leaf node. Candidates reaching this node are predicted to receive approximately 80% of the general election vote share (yval = 0.8035), suggesting that incumbent candidates in strongly partisan districts perform significantly better than other candidates.

Question b:

set.seed(123)
bag_model <- train(
  Gvote_share ~ primary_party + district_pvi + 
                district_nonwhite + incumbent + district_dpres_lagged,
  data   = train_clean,
  method = "treebag"
)

# Print the model summary
print(bag_model)

## Bagged CART 
## 
## 5679 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5679, 5679, 5679, 5679, 5679, 5679, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1246528  0.6059557  0.09022699

The bagged model performs better than the single decision tree from part (a). The bagged model produced a lower RMSE (0.125) compared to the single decision tree (0.132), indicating that its predictions are closer to the actual values on average. This improvement makes sense given how bagging works — by repeatedly bootstrapping the training data and averaging predictions across 25 trees, the bagged model reduces the variance that comes with relying on a single tree. A single decision tree is sensitive to the specific data it is trained on, but averaging across many trees produces more stable and accurate predictions.

Question c:

tree_predictions <- predict(tree_model, newdata = test_clean)
bag_predictions  <- predict(bag_model,  newdata = test_clean)
set.seed(123)
ols_model <- train(
  Gvote_share ~ primary_party + district_pvi + 
                district_nonwhite + incumbent + district_dpres_lagged,
  data   = train_clean,
  method = "lm"
)
ols_predictions <- predict(ols_model, newdata = test_clean)
tree_rmse <- RMSE(tree_predictions, test_clean$Gvote_share)
bag_rmse  <- RMSE(bag_predictions,  test_clean$Gvote_share)
ols_rmse  <- RMSE(ols_predictions,  test_clean$Gvote_share)
tree_rmse

## [1] 0.1298257

bag_rmse

## [1] 0.1194772

ols_rmse

## [1] 0.133701

The bagged model performs best on the hold-out test set with the lowest RMSE of 0.1195, followed by the decision tree (0.1298), and then OLS (0.1337). The bagged model’s superior performance is consistent with what we saw in Q1b — averaging predictions across many bootstrapped trees reduces variance and produces more accurate out-of-sample predictions compared to a single tree. The OLS model performs worst of the three. This is likely because OLS assumes a linear relationship between the predictors and vote share, which may not fully capture the more complex, non-linear patterns in the data. The tree-based models, being non-parametric, are more flexible and better able to capture these patterns without assuming a specific functional form.

Question d:

prediction_comparison <- test_clean %>%
  mutate(
    predicted    = bag_predictions,
    error        = abs(Gvote_share - predicted)
  ) %>%
  arrange(desc(error))
prediction_comparison %>%
  select(Gvote_share, predicted, error, incumbent, 
         district_pvi, district_nonwhite, 
         primary_party, district_dpres_lagged) %>%
  slice(1:3)

## # A tibble: 3 × 8
##   Gvote_share predicted error incumbent           district_pvi district_nonwhite
##         <dbl>     <dbl> <dbl> <dbl+lbl>                  <dbl>             <dbl>
## 1      0.0934     0.801 0.708 1 [Incumbent]                 41             0.690
## 2      1.000      0.405 0.594 0 [Not an incumben…          -18             0.191
## 3      0.909      0.386 0.523 0 [Not an incumben…           43             0.600
## # ℹ 2 more variables: primary_party <dbl+lbl>, district_dpres_lagged <dbl>

Case 1: This case has the largest prediction error in the entire hold-out set. The candidate is an incumbent (incumbent = 1) in a strongly Republican district (district_pvi = 41) with a high proportion of nonwhite residents (0.69) and a high Democratic presidential vote share (0.90). The model likely struggled here because this is an unusual combination — an incumbent in a heavily Republican district with demographics that typically favor Democrats. This contradiction between district partisanship and demographic composition may represent a rare edge case that the model has not seen enough of in training to predict accurately. Case 2: This candidate received nearly 100% of the vote share, yet the model predicted only 40%. The candidate is a non-incumbent (incumbent = 0) in a strongly Democratic-leaning district (district_pvi = -18) with a relatively low nonwhite population (0.19). The model likely struggled because uncontested or near-uncontested races — where a candidate wins essentially all votes — are rare and extreme outcomes that tree-based models tend to underestimate by averaging across many more typical observations. Case 3: This candidate received over 90% of the vote yet the model predicted only 39%. Like Case 2, this is a non-incumbent in a strongly partisan district (district_pvi = 43) with a high Democratic presidential vote lagged (0.93). The model appears to systematically underpredict very high vote shares for non-incumbents, suggesting it has learned that non-incumbents typically perform more modestly. These three cases together suggest the model’s key limitation is handling extreme or unusual outcomes that fall outside the typical patterns seen in training.

Question 2 Data Set up

wine <- read.csv("C:/Users/amyaz/OneDrive/Documents/Grad School/Spring 2025/Data Science/Problem Sets/PSet 3/wine_quality.csv")
glimpse(wine)

## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ premium              <int> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1…

set.seed(123)
wine_train_index <- createDataPartition(wine$alcohol, 
                                        p    = 0.80, 
                                        list = FALSE)
wine_train <- wine %>% slice(wine_train_index)
wine_test  <- wine %>% slice(-wine_train_index)
nrow(wine_train)

## [1] 1280

nrow(wine_test)

## [1] 319

Example response to question 1 here.

Question a:

library(ggplot2)
set.seed(123)
wine_ols <- train(
  alcohol ~ fixed.acidity + residual.sugar + pH + premium,
  data   = wine_train,
  method = "lm"
)
print(wine_ols)

## Linear Regression 
## 
## 1280 samples
##    4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1280, 1280, 1280, 1280, 1280, 1280, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE     
##   0.9355896  0.219996  0.734495
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

wine_ols_predictions <- predict(wine_ols, newdata = wine_test)
wine_ols_rmse <- RMSE(wine_ols_predictions, wine_test$alcohol)
wine_ols_rmse

## [1] 0.9238032

ggplot(data = wine_test, aes(x = wine_ols_predictions)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Predicted Alcohol Values",
    x     = "Predicted Alcohol",
    y     = "Count"
  )

The linear regression model was fit using fixed.acidity, residual.sugar, pH, and premium as predictors of alcohol content. The model produced an RMSE of 0.924 on the hold-out test set, meaning that on average the model’s predictions are off by about 0.924 percentage points of alcohol content. The histogram of predicted values shows a bimodal distribution — there appear to be two distinct clusters of predictions, one centered around 10% alcohol and another around 11%. This pattern likely reflects the influence of the premium variable, which splits wines into two groups and pulls predictions in two different directions. Overall, the relatively high RMSE suggests that these four predictors alone do not explain alcohol content very well, and that a more complex model may be needed.

Question b:

set.seed(123)
knn_model <- train(
  alcohol ~ fixed.acidity + residual.sugar + pH + premium,
  data      = wine_train,
  method    = "knn",
  tuneGrid  = expand.grid(k = c(1, 5, 10, 20)),
  preProcess = c("center", "scale")
)
print(knn_model)

## k-Nearest Neighbors 
## 
## 1280 samples
##    4 predictor
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1280, 1280, 1280, 1280, 1280, 1280, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  1.1006471  0.1937649  0.7412739
##    5  0.9670722  0.2349779  0.7197882
##   10  0.9176695  0.2703444  0.6967703
##   20  0.8962088  0.2861909  0.6919724
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 20.

plot(knn_model)

knn_predictions <- predict(knn_model, newdata = wine_test)
knn_rmse <- RMSE(knn_predictions, wine_test$alcohol)
knn_rmse

## [1] 0.8850283

new_wine <- data.frame(
  fixed.acidity  = 7.1,
  residual.sugar = 4.2,
  pH             = 2.8,
  premium        = 0
)

predict(knn_model, newdata = new_wine)

## [1] 9.365

K=1 gives us RMSE of 1.1006, K=5 with RMSE = 0.9671, K=10 with RMSE = 0.9177, K=20 with RMSE = 0.8962. Overall K = 20 performs best as it produces the lowest RMSE of 0.8962 across the bootstrapped resamples, and also the lowest RMSE of 0.885 on the hold-out test set. As shown in the plot, RMSE consistently decreases as K increases from 1 to 20, suggesting that averaging across more neighbors produces more stable and accurate predictions. This is consistent with the bias-variance tradeoff discussed in class — smaller values of K like K = 1 produce high variance models that are very sensitive to individual data points, while larger values of K smooth out those irregularities. One important assumption made in this model is that the predictor variables were centered and scaled before fitting. This is necessary for KNN because the algorithm relies on measuring distances between observations — without scaling, variables with larger numeric ranges would dominate the distance calculation over variables with smaller ranges. Using the best model (K = 20), the predicted alcohol content for a wine with fixed.acidity = 7.1, residual.sugar = 4.2, pH = 2.8, and premium = 0 is 9.365%.

Question c:

set.seed(123)
wine_tree <- train(
  alcohol ~ .,
  data   = wine_train,
  method = "rpart"
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

print(wine_tree)

## CART 
## 
## 1280 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1280, 1280, 1280, 1280, 1280, 1280, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.04882132  0.8490307  0.3611027  0.6543731
##   0.07195399  0.8854062  0.3027272  0.6937088
##   0.29007284  0.9646649  0.2608825  0.7754662
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04882132.

wine_tree_predictions <- predict(wine_tree, newdata = wine_test)
wine_tree_rmse <- RMSE(wine_tree_predictions, wine_test$alcohol)
wine_tree_rmse

## [1] 0.8204915

The kitchen sink regression tree used all 11 available variables as predictors and produced a test RMSE of 0.8205, which is lower than the best KNN model’s test RMSE of 0.8850. This means the kitchen sink regression tree outperforms the KNN model on the hold-out test set. This result makes intuitive sense — by including all available variables, the regression tree has access to more information about each wine and can therefore make more accurate predictions. The additional predictors beyond the four used in the KNN model appear to contain useful information about alcohol content that the KNN model was not able to capture.

Question d:

When K=1, the KNN model predicts each training observation by finding its single nearest neighbor — which is the observation itself. This means the model simply returns the exact observed value for every training observation, resulting in a perfect RMSE of 0 on the training set. However, this does not indicate a good model, it is a classic example of overfitting, where the model has memorized the training data rather than learning any meaningful patterns. The problem becomes clear when we evaluate the model on the hold-out test set. When presented with new, unseen observations, the model can no longer find an exact match and instead relies on the single closest neighbor in the training data, which may not be very similar at all. As we saw in our results, K=1 produced the highest RMSE of 1.1006 across all K values we tested, confirming that perfect training performance does not translate to good out-of-sample predictions. This is precisely why we set aside a hold-out test set, to evaluate how well our model generalizes to new data rather than how well it memorizes the data it was trained on.

Problem Set 3 Submission

Amy Azania (Adrianne Sugden)

2026-03-18

Question 1 Data Set up

Question 2 Data Set up