Instructions

This is the R portion of your final exam.

Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.

Troubleshooting: If you find errors in your code that prevent the RMD file from knitting, please comment them out (add # before the code). I will give you partial credit based on your logic.

Good luck!

Part I: Logistic Regression

0. Background

Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.

They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.

Data Dictionary:

  • High_Sales (Target): Factor with levels Yes and No. Indicates if the store sold more than 8,000 units.
  • CompPrice: Price charged by the nearest competitor at each location.
  • Income: Community income level (in thousands of dollars).
  • Advertising: Local advertising budget for the company at each location.
  • Population: Population size of the region (in thousands).
  • Price: Price charged for the car seats at each site.
  • ShelveLoc: A factor indicating the quality of the shelving location for the car seats at the site (Good, Bad, or Medium).
  • Age: Average age of the local population.
  • Education: Education level at each location.
  • Urban: Factor (Yes/No) indicating if the store is in an urban location.
  • US: Factor (Yes/No) indicating if the store is in the US.

1. Data Preparation (0.5 point)

  1. Load the data from the file store_sales.csv and name it store_sales.
  2. Split the data into training (70%) and test (30%) sets. Use set.seed(2025) to ensure reproducibility.
# a) Load data
# Your code here
store_sales <- read.csv("store_sales.csv")
store_sales$High_Sales <- factor(store_sales$High_Sales,
levels = c(0, 1),
labels = c("No", "Yes"))

# b) Split data into training and test sets
set.seed(2025)
# Your code here
n_store <- nrow(store_sales)
train_size_store <- floor(0.7 * n_store)

train_index_store <- sample(1:n_store, size = train_size_store)

store_train <- store_sales[train_index_store, ]
store_test  <- store_sales[-train_index_store, ]

2. Logistic Regression (5 points)

  1. Fit a logistic regression model to predict High_Sales using all other variables as predictors. Please use the training dataset.
# Your code here
logit_model <- glm(High_Sales ~ .,
data = store_train,
family = binomial)

logit_model
## 
## Call:  glm(formula = High_Sales ~ ., family = binomial, data = store_train)
## 
## Coefficients:
##     (Intercept)        CompPrice           Income      Advertising  
##       -6.787364         0.201653         0.034553         0.371516  
##      Population            Price    ShelveLocGood  ShelveLocMedium  
##       -0.002566        -0.179454         9.338531         4.270635  
##             Age        Education         UrbanYes            USYes  
##       -0.087756        -0.133422         0.140172        -1.626954  
## 
## Degrees of Freedom: 279 Total (i.e. Null);  268 Residual
## Null Deviance:       375.2 
## Residual Deviance: 106.2     AIC: 130.2
  1. Use the summary() function to examine your fitted model. What is the estimated coefficient for Price? Is it statistically significant? Please interpret the number using the odds ratio?
# Your code here
logit_summary <- summary(logit_model)
logit_summary
## 
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = store_train)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.787364   3.390753  -2.002  0.04531 *  
## CompPrice        0.201653   0.034124   5.909 3.43e-09 ***
## Income           0.034553   0.010828   3.191  0.00142 ** 
## Advertising      0.371516   0.075899   4.895 9.84e-07 ***
## Population      -0.002566   0.001927  -1.332  0.18296    
## Price           -0.179454   0.026173  -6.856 7.06e-12 ***
## ShelveLocGood    9.338531   1.407041   6.637 3.20e-11 ***
## ShelveLocMedium  4.270635   0.891648   4.790 1.67e-06 ***
## Age             -0.087756   0.019324  -4.541 5.59e-06 ***
## Education       -0.133422   0.103935  -1.284  0.19924    
## UrbanYes         0.140172   0.594386   0.236  0.81357    
## USYes           -1.626954   0.830909  -1.958  0.05023 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.21  on 279  degrees of freedom
## Residual deviance: 106.23  on 268  degrees of freedom
## AIC: 130.23
## 
## Number of Fisher Scoring iterations: 7
price_coef <- logit_summary$coefficients["Price", ]
price_coef
##      Estimate    Std. Error       z value      Pr(>|z|) 
## -1.794535e-01  2.617334e-02 -6.856347e+00  7.064380e-12
price_or <- exp(coef(logit_model)["Price"])
price_ci <- exp(confint(logit_model, "Price"))
## Waiting for profiling to be done...
price_or
##     Price 
## 0.8357268
price_ci
##     2.5 %    97.5 % 
## 0.7886684 0.8748646

Comments: The estimated coefficient for Price is price_coef[1] (from the table). If the p-value is less than 0.05, Price is statistically significant. If the odds ratio is less than 1, each 1–unit increase in Price reduces the odds that a store has High Sales. If it is greater than 1, each 1–unit increase in Price increases the odds of High Sales. Holding other variables constant, a one‐unit increase in Price multiplies the odds of High Sales by OR, i.e., the odds change by (OR − 1)×100%.

  1. Generate predicted probabilities on the testing dataset. Then, obtain the predicted classes using 0.6 as the cutoff value (threshold).
# Your code here
prob_test <- predict(logit_model,
newdata = store_test,
type = "response")

pred_class_0.6 <- ifelse(prob_test >= 0.6, "Yes", "No")

pred_class_0.6 <- factor(pred_class_0.6,
levels = levels(store_test$High_Sales))

head(prob_test)
##          1          3          7          9         11         17 
## 0.11573246 0.86791690 0.03602412 0.06718566 0.51958266 0.46516148
head(pred_class_0.6)
##   1   3   7   9  11  17 
##  No Yes  No  No  No  No 
## Levels: No Yes
  1. Create a confusion matrix using the testing dataset and report Misclassification Rate (MR).
# Your code here
conf_mat <- table(Predicted = pred_class_0.6,
Actual = store_test$High_Sales)
conf_mat
##          Actual
## Predicted No Yes
##       No  63  16
##       Yes  3  38
misclass_rate <- mean(pred_class_0.6 != store_test$High_Sales)
misclass_rate
## [1] 0.1583333

Comments: The confusion matrix above shows the counts of correctly and incorrectly classified stores. The Misclassification Rate (MR) is printed as misclass_rate; this is the proportion of test observations where the predicted class does not match the actual class. A lower MR indicates better classification performance. Briefly comment whether this MR seems acceptable given the context (e.g., how costly misclassification is).

  1. Draw an ROC curve and calculate the AUC (Area Under Curve) for this logistic regression model on the testing dataset. Do you think the prediction performance is acceptable?
# Your code here

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(response = store_test$High_Sales,
predictor = prob_test,
levels = c("No", "Yes"), # specify order: control, case
direction = "<")

plot(roc_obj, main = "ROC Curve for Logistic Regression (Test Data)")

auc_value <- auc(roc_obj)
auc_value
## Area under the curve: 0.9374

Comments: The ROC curve shows the tradeoff between True Positive Rate (TPR) and False Positive Rate (FPR) across all thresholds. The AUC (printed as auc_value) measures the model’s overall ability to discriminate between High Sales and Not High Sales:

AUC ≈ 0.5 → no better than random guessing. AUC between about 0.7–0.8 → reasonable / acceptable. AUC above 0.8 → strong discrimination.

Part II: Regression Tree

0. Background

You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.

Data Dictionary:

  • charges (Target): Individual medical costs billed by health insurance.
  • age: Age of primary beneficiary.
  • sex: Insurance contractor gender (female, male).
  • bmi: Body mass index (providing an understanding of body weights that are relatively high or low relative to height).
  • children: Number of children covered by health insurance / Number of dependents.
  • smoker: Smoking status (yes, no).
  • region: The beneficiary’s residential area in the US (northeast, southeast, southwest, northwest).

1. Data Preparation (0.5 point)

  1. Load the data from the file insurance.csv and name it insurance.
  2. Split the data into training (70%) and test (30%) sets. Use set.seed(2025) to ensure reproducibility. Hint: use round() function to retain only the integer part of a number.
# a) Load data
insurance <- read.csv("insurance.csv")

insurance$sex    <- factor(insurance$sex)
insurance$smoker <- factor(insurance$smoker)
insurance$region <- factor(insurance$region)

dim(insurance)
## [1] 1338    7
head(insurance)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
# b) Split data into training (70%) and test (30%) sets
set.seed(2025)

n_ins <- nrow(insurance)
train_size_ins <- floor(0.7 * n_ins)   # use floor to avoid oversampling

train_index_ins <- sample(seq_len(n_ins),
                          size    = train_size_ins,
                          replace = FALSE)

ins_train <- insurance[train_index_ins, ]
ins_test  <- insurance[-train_index_ins, ]

dim(ins_train)
## [1] 936   7
dim(ins_test)
## [1] 402   7
  1. Fit a regression tree based on the training data, using the charges as response variable, and all other variables as predictors. Then visualize the tree using rpart.plot.
library(rpart)
library(rpart.plot)

tree_model <- rpart(charges ~ .,
data = ins_train,
method = "anova")

rpart.plot(tree_model,
main = "Regression Tree for Insurance Charges",
type = 2, extra = 101)

  1. Pruning: Fit a large regression tree with cp = 0.001, and then use plotcp() function to view the complexity parameter plot. Based on this plot, what value of cp you would you choose to prune the tree, and why?
# Your code here

tree_big <- rpart(charges ~ .,
data = ins_train,
method = "anova",
control = rpart.control(cp = 0.001))

plotcp(tree_big, main = "CP Plot for Regression Tree")

tree_big$cptable
##             CP nsplit rel error    xerror       xstd
## 1  0.636487668      0 1.0000000 1.0022350 0.06372409
## 2  0.131491557      1 0.3635123 0.3673720 0.02215802
## 3  0.066828548      2 0.2320208 0.2383865 0.01746178
## 4  0.010333490      3 0.1651922 0.1754024 0.01595723
## 5  0.010057707      4 0.1548587 0.1713478 0.01641777
## 6  0.007485914      5 0.1448010 0.1627650 0.01604445
## 7  0.005999863      6 0.1373151 0.1547192 0.01617404
## 8  0.003067693      7 0.1313153 0.1511808 0.01616544
## 9  0.002239249      8 0.1282476 0.1548680 0.01616360
## 10 0.001905540     10 0.1237691 0.1554141 0.01639511
## 11 0.001669467     12 0.1199580 0.1541179 0.01632603
## 12 0.001544252     13 0.1182885 0.1530116 0.01629801
## 13 0.001360973     14 0.1167443 0.1527338 0.01631195
## 14 0.001170351     15 0.1153833 0.1498842 0.01618482
## 15 0.001111418     16 0.1142129 0.1490967 0.01615289
## 16 0.001000000     17 0.1131015 0.1502781 0.01621959
best_cp <- tree_big$cptable[which.min(tree_big$cptable[, "xerror"]), "CP"]
best_cp
## [1] 0.001111418

Comments: The CP plot (plotcp) shows the relative error and cross-validated error (xerror) as tree size and cp change. A common rule is to choose the cp that corresponds to the minimum xerror, or the simplest tree whose xerror is within one standard error of the minimum (xerror + xstd). In the table (tree_big$cptable), best_cp is the cp value that minimizes cross-validated error.

  1. Refit the regression tree using the cp selected in question (b). Then obtain the predictions on the testing data.
# Your code here

tree_pruned <- prune(tree_big, cp = best_cp)

pred_tree_initial <- predict(tree_model, newdata = ins_test)
pred_tree_pruned <- predict(tree_pruned, newdata = ins_test)

head(pred_tree_initial)
##         4         5         6         7         9        12 
##  5322.270  5322.270  5322.270  9852.798  5322.270 21535.288
head(pred_tree_pruned)
##         4         5         6         7         9        12 
##  5511.906  5511.906  5511.906  9852.798 10060.859 25294.248
  1. Obtain the out-of-sample MSE (Mean Squared Error) for both the initial tree model in (a) and the pruned tree model in (c). Which model is preferred for this data, and why?
# Your code here

y_test <- ins_test$charges

mse_initial <- mean((y_test - pred_tree_initial)^2)

mse_pruned <- mean((y_test - pred_tree_pruned)^2)

mse_initial
## [1] 30134418
mse_pruned
## [1] 27660845

Comments: If mse_pruned < mse_initial, the pruned tree has better out-of-sample performance and is preferred (simpler tree, less overfitting). If mse_initial is slightly lower but the pruned tree is much simpler, you may still argue in favor of the pruned tree on interpretability grounds. I prefer the pruned tree because it has a lower test MSE and is less complex, so it generalizes better to new data.


End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.

Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.