This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
store_sales <- read.csv("store_sales.csv")
# b) Split data into training and test sets
set.seed(2025)
n <- nrow(store_sales)
train_index <- sample(1:n, size = 0.7 * n)
store_train <- store_sales[train_index, ]
store_test <- store_sales[-train_index, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.logit_model <- glm(High_Sales ~ .,
data = store_train,
family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = store_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?summary(logit_model)
##
## Call:
## glm(formula = High_Sales ~ ., family = binomial, data = store_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
Comments:For every 1-unit increase in Price, the odds of a store having High Sales drop by about 16%
prob_test <- predict(logit_model, newdata = store_test, type = "response")
pred_class_0.6 <- ifelse(prob_test > 0.6, "Yes", "No")
pred_class_0.6 <- factor(pred_class_0.6, levels = c("No", "Yes"))
conf_mat <- table(Predicted = pred_class_0.6, Actual = store_test$High_Sales)
conf_mat
## Actual
## Predicted 0 1
## No 63 16
## Yes 3 38
misclass_rate <- mean(pred_class_0.6 != store_test$High_Sales)
misclass_rate
## [1] 1
Comments:Model has most predictions right but still misclassified some stores, especially ones with high sales.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(store_test$High_Sales, prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve for Logistic Regression")
auc_value <- auc(roc_obj)
auc_value
## Area under the curve: 0.9374
Comments:The ROC curve is above diagonal. This means that there is strong accuracy. The model is high in sensitivity and specificity, which is a good separation between classes.
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# a) Load the data
insurance <- read.csv("insurance.csv")
# b) Split data into training and test sets
set.seed(2025)
n <- nrow(insurance)
train_n <- round(0.7 * n)
train_index <- sample(1:n, size = train_n)
insurance_train <- insurance[train_index, ]
insurance_test <- insurance[-train_index, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
insurance_tree <- rpart(charges ~ .,
data = insurance_train,
method = "anova")
rpart.plot(insurance_tree)
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?large_tree <- rpart(charges ~ .,
data = insurance_train,
method = "anova",
control = rpart.control(cp = 0.001))
plotcp(large_tree)
printcp(large_tree)
##
## Regression tree:
## rpart(formula = charges ~ ., data = insurance_train, method = "anova",
## control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age bmi children region smoker
##
## Root node error: 1.3144e+11/937 = 140282620
##
## n= 937
##
## CP nsplit rel error xerror xstd
## 1 0.6365601 0 1.00000 1.00299 0.063864
## 2 0.1314088 1 0.36344 0.36608 0.022202
## 3 0.0669267 2 0.23203 0.23279 0.017152
## 4 0.0103270 3 0.16510 0.16991 0.015611
## 5 0.0100514 4 0.15478 0.16636 0.015667
## 6 0.0074812 5 0.14473 0.15817 0.015599
## 7 0.0059708 6 0.13724 0.15142 0.015665
## 8 0.0030658 7 0.13127 0.14455 0.015318
## 9 0.0022378 8 0.12821 0.14392 0.015363
## 10 0.0019148 10 0.12373 0.14334 0.015288
## 11 0.0016684 12 0.11990 0.14223 0.015084
## 12 0.0015433 13 0.11823 0.14340 0.015226
## 13 0.0013601 14 0.11669 0.14353 0.015322
## 14 0.0011696 15 0.11533 0.14525 0.015444
## 15 0.0011223 16 0.11416 0.14710 0.015615
## 16 0.0010000 17 0.11304 0.14551 0.015438
Comments:I would choose cp ≈ 0.0026 because it is where the cross-validated error is the smallest and where the line flattens making it a more simple tree while still being accurate.
pruned_tree <- prune(large_tree, cp = 0.0026)
tree_predictions <- predict(pruned_tree, newdata = insurance_test)
pred_initial <- predict(insurance_tree, newdata = insurance_test)
mse_initial <- mean((insurance_test$charges - pred_initial)^2)
mse_initial
## [1] 30205107
pred_pruned <- predict(pruned_tree, newdata = insurance_test)
mse_pruned <- mean((insurance_test$charges - pred_pruned)^2)
mse_pruned
## [1] 28791386
Comments: The pruned tree has the lower MSE, so it performs better on new data and is the preferred model.
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.