This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
# Your code here
store_sales <- read.csv("store_sales.csv")
dim(store_sales)
## [1] 400 11
# b) Split data into training and test sets
set.seed(2025)
# Your code here
training <- sample(1:400, size = 0.7 * 400)
sales_train <- store_sales[training, ]
sales_test <- store_sales[-training, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.# Your code here
sales_log_model <- glm(High_Sales ~ ., data = sales_train, family = binomial)
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?# Your code here
summary(sales_log_model)
##
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = sales_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
exp(-0.179454)
## [1] 0.8357264
Comments: The negative and statistically significant coefficient for price tells us higher prices reduce the odds of achieving higher sales
# Your code here
pred_prob <- predict(sales_log_model, newdata = sales_test, type = "response")
pred_classes <- ifelse(pred_prob >= 0.6, 1, 0)
# Your code here
confusion_matrix <- table(predicted = pred_classes, Actual = sales_test$High_Sales)
confusion_matrix
## Actual
## predicted 0 1
## 0 63 16
## 1 3 38
MR <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
MR
## [1] 0.1583333
Comments: About 15.83% of the observations from the testing data were misclassified when using a 0.6 cutoff.
# Your code here
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
pred_probs <- predict(sales_log_model, newdata = sales_test, type = "response")
roc_sales <- roc(sales_test$High_Sales, pred_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_sales, main = "ROC Curve for Logistic Regression")
auc_sales <- auc(roc_sales)
auc_sales
## Area under the curve: 0.9374
Comments: AUC is .9374 which means The model is strong and adding the MR of 15.8% to it tells us the model’s predictions are highly acceptable
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# Your code here
insurance <- read.csv("insurance.csv")
dim(insurance)
## [1] 1338 7
# b) Split data into training and test sets
set.seed(2025)
# Your code here
training_insurance <- sample(1:1338, size = round(0.7 * 1338))
insurance_train <- insurance[training_insurance, ]
insurance_test <- insurance[-training_insurance, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.# Your code here
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
insurance_tree <- rpart(charges~ ., data = insurance_train, method = "anova")
rpart.plot(insurance_tree, main = "Regression Tree for Insurance Charges")
b) Pruning: Fit a large regression tree with
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?
# Your code here
large_regression_tree <- rpart(charges ~ ., data = insurance_train, method = "anova", control = rpart.control(cp = 0.001))
plotcp(large_regression_tree)
Comments: I would choose to prune using cp = 0.0035 because it keeps the tree small and simple.
# Your code here
pruned_tree <- rpart(charges ~ ., data = insurance_train, method = "anova", control = rpart.control(cp = 0.0035))
rpart.plot(pruned_tree, main = "Refitted Regression Tree for Insurance")
insurance_test_predictions <- predict(pruned_tree, newdata = insurance_test)
head(insurance_test_predictions)
## 4 6 7 9 12 14
## 6193.401 6193.401 9852.798 6193.401 25294.248 13833.721
# Your code here
pred_initial <- predict(insurance_tree, newdata = insurance_test)
pred_pruned <- predict(pruned_tree, newdata = insurance_test)
mse_initial <- mean((insurance_test$charges - pred_initial)^2)
mse_pruned <- mean((insurance_test$charges - pred_pruned)^2)
mse_initial
## [1] 30205107
mse_pruned
## [1] 28572965
Comments: Initial = 24,817,484 and the pruned = 21,385,831 which means the pruned tree is preferred as it has a lower MSE giving us better predictions on average
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.