This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
store_sales <- read.csv("store_sales.csv")
str(store_sales)
## 'data.frame': 400 obs. of 11 variables:
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : chr "Bad" "Good" "Medium" "Medium" ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "Yes" "Yes" "Yes" ...
## $ High_Sales : int 1 1 1 0 0 1 0 1 0 0 ...
# b) Split data into training and test sets
set.seed(2025)
n <- nrow(store_sales)
train_idx <- sample(1:n, size = round(0.7 * n))
store_train <- store_sales[train_idx, ]
store_test <- store_sales[-train_idx, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.library(stats)
store_train$High_Sales <- as.factor(store_train$High_Sales)
store_test$High_Sales <- as.factor(store_test$High_Sales)
logit_model <- glm(High_Sales ~ ., data = store_train, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = store_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?summary(logit_model)$coefficients["Price", ]
## Estimate Std. Error z value Pr(>|z|)
## -1.794535e-01 2.617334e-02 -6.856347e+00 7.064380e-12
exp(coef(logit_model)["Price"])
## Price
## 0.8357268
Comments:
This means higher prices are associated with lower odds of high sales, holding other variables constant.
test_prob <- predict(logit_model, newdata = store_test, type = "response")
test_pred_class <- ifelse(test_prob >= 0.6, "Yes", "No")
test_pred_class <- factor(test_pred_class, levels = levels(store_test$High_Sales))
cm_test <- table(Predicted = test_pred_class, Actual = store_test$High_Sales)
cm_test
## Actual
## Predicted 0 1
## 0 0 0
## 1 0 0
MR_test <- mean(test_pred_class != store_test$High_Sales)
MR_test
## [1] NA
Comments:
MR represents the proportion of misclassified observations in the test set which there was an NA with the MR_Test.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(store_test$High_Sales, test_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue", main = "ROC Curve – Logistic Regression (Test)")
auc(roc_obj)
## Area under the curve: 0.9374
Comments:
Based on the computed AUC this model has a 0.9374 which is very strong ability to distinguish between high sales and non-high sales.
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.insurance <- read.csv("insurance.csv")
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
# b) Split data into training and test sets
set.seed(2025)
n2 <- nrow(insurance)
train_idx2 <- sample(1:n2, size = round(0.7 * n2))
ins_train <- insurance[train_idx2, ]
ins_test <- insurance[-train_idx2, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.library(rpart)
library(rpart.plot)
tree_model <- rpart(charges ~ ., data = ins_train, method = "anova")
rpart.plot(tree_model, main = "Initial Regression Tree for Charges")
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?big_tree <- rpart(
charges ~ .,
data = ins_train,
method = "anova",
control = rpart.control(cp = 0.001)
)
plotcp(big_tree)
printcp(big_tree)
##
## Regression tree:
## rpart(formula = charges ~ ., data = ins_train, method = "anova",
## control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age bmi children region smoker
##
## Root node error: 1.3144e+11/937 = 140282620
##
## n= 937
##
## CP nsplit rel error xerror xstd
## 1 0.6365601 0 1.00000 1.00299 0.063864
## 2 0.1314088 1 0.36344 0.36608 0.022202
## 3 0.0669267 2 0.23203 0.23279 0.017152
## 4 0.0103270 3 0.16510 0.16991 0.015611
## 5 0.0100514 4 0.15478 0.16636 0.015667
## 6 0.0074812 5 0.14473 0.15817 0.015599
## 7 0.0059708 6 0.13724 0.15142 0.015665
## 8 0.0030658 7 0.13127 0.14455 0.015318
## 9 0.0022378 8 0.12821 0.14392 0.015363
## 10 0.0019148 10 0.12373 0.14334 0.015288
## 11 0.0016684 12 0.11990 0.14223 0.015084
## 12 0.0015433 13 0.11823 0.14340 0.015226
## 13 0.0013601 14 0.11669 0.14353 0.015322
## 14 0.0011696 15 0.11533 0.14525 0.015444
## 15 0.0011223 16 0.11416 0.14710 0.015615
## 16 0.0010000 17 0.11304 0.14551 0.015438
Comments:
I chose cp = 0.001 because it corresponds to the smallest cross-validated error and represents a good trade-off between model complexity and predictive performance.
best_cp <- big_tree$cptable[which.min(big_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(big_tree, cp = best_cp)
pred_initial <- predict(tree_model, newdata = ins_test)
pred_pruned <- predict(pruned_tree, newdata = ins_test)
MSE_initial <- mean((ins_test$charges - pred_initial)^2)
MSE_pruned <- mean((ins_test$charges - pred_pruned)^2)
MSE_initial
## [1] 30205107
MSE_pruned
## [1] 28961204
Comments:
The pruned tree has a lower MSE, which suggests that pruning reduced overfitting and improved predictive performance.
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.