This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
# Your code here
store_sales <- read.csv("store_sales.csv")
# b) Split data into training and test sets
set.seed(2025)
# Your code here
sample_size <- floor(0.7 * nrow(store_sales))
train_indices <- sample(1:nrow(store_sales), sample_size)
train_data <- store_sales[train_indices, ]
test_data <- store_sales[-train_indices, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.# Your code here
model <- glm(High_Sales ~ ., data = train_data, family = "binomial")
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?# Your code here
summary(model)
##
## Call:
## glm(formula = High_Sales ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
Comments:Price coefficient = -0.179454, p-value = 7.06e-12 (significant). Odds ratio = 0.836. Each $1 price increase reduces the odds of high sales by 16.4%
# Your code here
pred_probs <- predict(model, newdata = test_data, type = "response")
pred_classes <- ifelse(pred_probs >= 0.6, 1, 0)
# Your code here
conf_matrix <- table(Predicted = pred_classes, Actual = test_data$High_Sales)
conf_matrix
## Actual
## Predicted 0 1
## 0 63 16
## 1 3 38
misclass_rate <- mean(pred_classes != test_data$High_Sales)
misclass_rate
## [1] 0.1583333
Comments: Confusion matrix: 63 True Negatives, 38 True Positives, 16 False Negatives, 3 False Positives. Misclassification Rate = 15.8%. Model is 84.2% accurate
# Your code here
order_index <- order(pred_probs, decreasing = TRUE)
actual_sorted <- test_data$High_Sales[order_index]
tpr <- cumsum(actual_sorted == 1) / sum(actual_sorted == 1)
fpr <- cumsum(actual_sorted == 0) / sum(actual_sorted == 0)
plot(fpr, tpr, type = "l", main = "ROC Curve")
abline(0, 1, lty = 2)
sum(diff(fpr) * (tpr[-1] + tpr[-length(tpr)]) / 2)
## [1] 0.9374299
Comments: ROC curve is well above the diagonal line, showing good performance. AUC = 0.937 (above 0.8 means good predictive performance)
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# Your code here
insurance <- read.csv("insurance.csv")
# b) Split data into training and test sets
set.seed(2025)
# Your code here
train_size <- round(0.7 * nrow(insurance))
train_indices <- sample(1:nrow(insurance), train_size)
train_data <- insurance[train_indices, ]
test_data <- insurance[-train_indices, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.# Your code here
library(rpart)
library(rpart.plot)
tree_model <- rpart(charges ~ ., data = train_data)
rpart.plot(tree_model)
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?# Your code here
large_tree <- rpart(charges ~ ., data = train_data, cp = 0.001)
plotcp(large_tree)
Comments: I choose cp = 0.01 because that’s where the error line flattens out.This prevents overfitting while keeping good accuracy.
# Your code here
final_tree <- rpart(charges ~ ., data = train_data, cp = 0.01)
predictions <- predict(final_tree, newdata = test_data)
# d) Obtain the out-of-sample MSE (Mean Squared Error) for both the initial tree model in (a) and the pruned tree model in (c). Which model is preferred for this data, and why?
#```{r}
# Your code here
pred_initial <- predict(tree_model, newdata = test_data)
mse_initial <- mean((test_data$charges - pred_initial)^2)
mse_initial
## [1] 30205107
pred_pruned <- predict(final_tree, newdata = test_data)
mse_pruned <- mean((test_data$charges - pred_pruned)^2)
mse_pruned
## [1] 30205107
Comments: - Initial tree MSE = 30205107 - Pruned tree MSE = 30205107 Both models have the same MSE, so either model is acceptable.The pruned tree is slightly preferred because it’s simpler (fewer splits) while maintaining the same accuracy.
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.