This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
# Your code here
store_sales <- read.csv("store_sales.csv")
store_sales$High_Sales <- factor(store_sales$High_Sales,
levels = c(0, 1),
labels = c("No", "Yes"))
# b) Split data into training and test sets
set.seed(2025)
# Your code here
n_store <- nrow(store_sales)
train_size_store <- floor(0.7 * n_store)
train_index_store <- sample(1:n_store, size = train_size_store)
store_train <- store_sales[train_index_store, ]
store_test <- store_sales[-train_index_store, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.# Your code here
logit_model <- glm(High_Sales ~ .,
data = store_train,
family = binomial)
logit_model
##
## Call: glm(formula = High_Sales ~ ., family = binomial, data = store_train)
##
## Coefficients:
## (Intercept) CompPrice Income Advertising
## -6.787364 0.201653 0.034553 0.371516
## Population Price ShelveLocGood ShelveLocMedium
## -0.002566 -0.179454 9.338531 4.270635
## Age Education UrbanYes USYes
## -0.087756 -0.133422 0.140172 -1.626954
##
## Degrees of Freedom: 279 Total (i.e. Null); 268 Residual
## Null Deviance: 375.2
## Residual Deviance: 106.2 AIC: 130.2
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?# Your code here
logit_summary <- summary(logit_model)
logit_summary
##
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = store_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
price_coef <- logit_summary$coefficients["Price", ]
price_coef
## Estimate Std. Error z value Pr(>|z|)
## -1.794535e-01 2.617334e-02 -6.856347e+00 7.064380e-12
price_or <- exp(coef(logit_model)["Price"])
price_ci <- exp(confint(logit_model, "Price"))
## Waiting for profiling to be done...
price_or
## Price
## 0.8357268
price_ci
## 2.5 % 97.5 %
## 0.7886684 0.8748646
Comments: The estimated coefficient for Price is price_coef[1] (from the table). If the p-value is less than 0.05, Price is statistically significant. If the odds ratio is less than 1, each 1–unit increase in Price reduces the odds that a store has High Sales. If it is greater than 1, each 1–unit increase in Price increases the odds of High Sales. Holding other variables constant, a one‐unit increase in Price multiplies the odds of High Sales by OR, i.e., the odds change by (OR − 1)×100%.
# Your code here
prob_test <- predict(logit_model,
newdata = store_test,
type = "response")
pred_class_0.6 <- ifelse(prob_test >= 0.6, "Yes", "No")
pred_class_0.6 <- factor(pred_class_0.6,
levels = levels(store_test$High_Sales))
head(prob_test)
## 1 3 7 9 11 17
## 0.11573246 0.86791690 0.03602412 0.06718566 0.51958266 0.46516148
head(pred_class_0.6)
## 1 3 7 9 11 17
## No Yes No No No No
## Levels: No Yes
# Your code here
conf_mat <- table(Predicted = pred_class_0.6,
Actual = store_test$High_Sales)
conf_mat
## Actual
## Predicted No Yes
## No 63 16
## Yes 3 38
misclass_rate <- mean(pred_class_0.6 != store_test$High_Sales)
misclass_rate
## [1] 0.1583333
Comments: The confusion matrix above shows the counts of correctly and incorrectly classified stores. The Misclassification Rate (MR) is printed as misclass_rate; this is the proportion of test observations where the predicted class does not match the actual class. A lower MR indicates better classification performance. Briefly comment whether this MR seems acceptable given the context (e.g., how costly misclassification is).
# Your code here
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(response = store_test$High_Sales,
predictor = prob_test,
levels = c("No", "Yes"), # specify order: control, case
direction = "<")
plot(roc_obj, main = "ROC Curve for Logistic Regression (Test Data)")
auc_value <- auc(roc_obj)
auc_value
## Area under the curve: 0.9374
Comments: The ROC curve shows the tradeoff between True Positive Rate (TPR) and False Positive Rate (FPR) across all thresholds. The AUC (printed as auc_value) measures the model’s overall ability to discriminate between High Sales and Not High Sales:
AUC ≈ 0.5 → no better than random guessing. AUC between about 0.7–0.8 → reasonable / acceptable. AUC above 0.8 → strong discrimination.
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# a) Load data
insurance <- read.csv("insurance.csv")
insurance$sex <- factor(insurance$sex)
insurance$smoker <- factor(insurance$smoker)
insurance$region <- factor(insurance$region)
dim(insurance)
## [1] 1338 7
head(insurance)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
# b) Split data into training (70%) and test (30%) sets
set.seed(2025)
n_ins <- nrow(insurance)
train_size_ins <- floor(0.7 * n_ins) # use floor to avoid oversampling
train_index_ins <- sample(seq_len(n_ins),
size = train_size_ins,
replace = FALSE)
ins_train <- insurance[train_index_ins, ]
ins_test <- insurance[-train_index_ins, ]
dim(ins_train)
## [1] 936 7
dim(ins_test)
## [1] 402 7
library(rpart)
library(rpart.plot)
tree_model <- rpart(charges ~ .,
data = ins_train,
method = "anova")
rpart.plot(tree_model,
main = "Regression Tree for Insurance Charges",
type = 2, extra = 101)
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?# Your code here
tree_big <- rpart(charges ~ .,
data = ins_train,
method = "anova",
control = rpart.control(cp = 0.001))
plotcp(tree_big, main = "CP Plot for Regression Tree")
tree_big$cptable
## CP nsplit rel error xerror xstd
## 1 0.636487668 0 1.0000000 1.0022350 0.06372409
## 2 0.131491557 1 0.3635123 0.3673720 0.02215802
## 3 0.066828548 2 0.2320208 0.2383865 0.01746178
## 4 0.010333490 3 0.1651922 0.1754024 0.01595723
## 5 0.010057707 4 0.1548587 0.1713478 0.01641777
## 6 0.007485914 5 0.1448010 0.1627650 0.01604445
## 7 0.005999863 6 0.1373151 0.1547192 0.01617404
## 8 0.003067693 7 0.1313153 0.1511808 0.01616544
## 9 0.002239249 8 0.1282476 0.1548680 0.01616360
## 10 0.001905540 10 0.1237691 0.1554141 0.01639511
## 11 0.001669467 12 0.1199580 0.1541179 0.01632603
## 12 0.001544252 13 0.1182885 0.1530116 0.01629801
## 13 0.001360973 14 0.1167443 0.1527338 0.01631195
## 14 0.001170351 15 0.1153833 0.1498842 0.01618482
## 15 0.001111418 16 0.1142129 0.1490967 0.01615289
## 16 0.001000000 17 0.1131015 0.1502781 0.01621959
best_cp <- tree_big$cptable[which.min(tree_big$cptable[, "xerror"]), "CP"]
best_cp
## [1] 0.001111418
Comments: The CP plot (plotcp) shows the relative error and cross-validated error (xerror) as tree size and cp change. A common rule is to choose the cp that corresponds to the minimum xerror, or the simplest tree whose xerror is within one standard error of the minimum (xerror + xstd). In the table (tree_big$cptable), best_cp is the cp value that minimizes cross-validated error.
# Your code here
tree_pruned <- prune(tree_big, cp = best_cp)
pred_tree_initial <- predict(tree_model, newdata = ins_test)
pred_tree_pruned <- predict(tree_pruned, newdata = ins_test)
head(pred_tree_initial)
## 4 5 6 7 9 12
## 5322.270 5322.270 5322.270 9852.798 5322.270 21535.288
head(pred_tree_pruned)
## 4 5 6 7 9 12
## 5511.906 5511.906 5511.906 9852.798 10060.859 25294.248
# Your code here
y_test <- ins_test$charges
mse_initial <- mean((y_test - pred_tree_initial)^2)
mse_pruned <- mean((y_test - pred_tree_pruned)^2)
mse_initial
## [1] 30134418
mse_pruned
## [1] 27660845
Comments: If mse_pruned < mse_initial, the pruned tree has better out-of-sample performance and is preferred (simpler tree, less overfitting). If mse_initial is slightly lower but the pruned tree is much simpler, you may still argue in favor of the pruned tree on interpretability grounds. I prefer the pruned tree because it has a lower test MSE and is less complex, so it generalizes better to new data.
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.