This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
# Your code here
getwd()
## [1] "C:/Users/benfa/Downloads"
store_sales <- read.csv("store_sales.csv")
str(store_sales)
## 'data.frame': 400 obs. of 11 variables:
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : chr "Bad" "Good" "Medium" "Medium" ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "Yes" "Yes" "Yes" ...
## $ High_Sales : int 1 1 1 0 0 1 0 1 0 0 ...
# b) Split data into training and test sets
set.seed(2025)
# Your code here
sample_index_ss <- sample(1:nrow(store_sales),
round(.7 * nrow(store_sales)))
store_sales_train <- store_sales[sample_index_ss, ]
store_sales_test <- store_sales[- sample_index_ss, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.# Your code here
store_sales_glm <- glm(High_Sales ~ .,
data = store_sales,
family = "binomial")
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?# Your code here
summary(store_sales_glm)
##
## Call:
## glm(formula = High_Sales ~ ., family = "binomial", data = store_sales)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0889857 2.4067745 -2.114 0.0345 *
## CompPrice 0.1724750 0.0233456 7.388 1.49e-13 ***
## Income 0.0341248 0.0079002 4.319 1.56e-05 ***
## Advertising 0.2957492 0.0512506 5.771 7.90e-09 ***
## Population -0.0009173 0.0014190 -0.646 0.5180
## Price -0.1674933 0.0196266 -8.534 < 2e-16 ***
## ShelveLocGood 8.2408876 1.0328293 7.979 1.48e-15 ***
## ShelveLocMedium 3.5970462 0.6520385 5.517 3.46e-08 ***
## Age -0.0793793 0.0142195 -5.582 2.37e-08 ***
## Education -0.0576359 0.0763170 -0.755 0.4501
## UrbanYes -0.3827861 0.4392739 -0.871 0.3835
## USYes -0.6986095 0.5949922 -1.174 0.2403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 541.49 on 399 degrees of freedom
## Residual deviance: 181.48 on 388 degrees of freedom
## AIC: 205.48
##
## Number of Fisher Scoring iterations: 7
exp(-.1674933) - 1
## [1] -0.1542177
Comments: yes it is significant beacsue the p value of is below .05
# Your code here
pred_prob_ss_test_glm <- predict(store_sales_glm,
newdata = store_sales_test,
type = "response")
head(pred_prob_ss_test_glm)
## 1 3 7 9 11 17
## 0.21575375 0.88564177 0.04360418 0.07951665 0.70645231 0.44478259
pred_class_ss_test_glm <- as.numeric(pred_prob_ss_test_glm > .6)
# Your code here
confusion_mat_ss_test_glm <- table(actual = store_sales_test$High_Sales,
pred = pred_class_ss_test_glm)
confusion_mat_ss_test_glm
## pred
## actual 0 1
## 0 63 3
## 1 14 40
1 - sum(diag(confusion_mat_ss_test_glm))/sum(confusion_mat_ss_test_glm)
## [1] 0.1416667
Comments: this 14.46% of predictions are incorret
# Your code here
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_store_sales_test_glm <- roc(store_sales_test$High_Sales,
pred_prob_ss_test_glm)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_store_sales_test_glm)
auc(roc_store_sales_test_glm)
## Area under the curve: 0.9512
Comments: An AUC of .9512 means we have a very high discriminatory curve
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# Your code here
getwd()
## [1] "C:/Users/benfa/Downloads"
insurance <- read.csv("insurance.csv")
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
# b) Split data into training and test sets
set.seed(2025)
# Your code here
training_index_insurnace <- sample(1 : nrow(insurance), round(0.7 * nrow(insurance)))
insurance_training <- insurance[training_index_insurnace, ]
insurance_testing <- insurance[-training_index_insurnace, ]
dim(insurance_training)
## [1] 937 7
dim(insurance_testing)
## [1] 401 7
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.# Your code here
library(rpart)
## Warning: package 'rpart' was built under R version 4.5.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
insurance_tree <- rpart(charges ~ .,
data = insurance_training,
method = "anova")
rpart.plot(insurance_tree)
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?# Your code here
insurance_tree_big <- rpart(
charges ~ .,
data = insurance_training,
method = "anova",
control = rpart.control(cp = 0.001))
plotcp(insurance_tree_big)
Comments: from this chart we can conclude that the proper cp value is .012
# Your code here
insurance_tree_pruned <- rpart(charges ~ .,
data = insurance_training,
method = "anova",
control = rpart.control(cp = .012))
pred_test_insurance <- predict(insurance_tree_pruned, new_data = insurance_testing)
head(pred_test_insurance)
## 909 460 932 279 187 1290
## 12157.543 5318.788 5318.788 12157.543 5318.788 12157.543
# Your code here
pred_test_a <- predict(insurance_tree,
newdata = insurance_testing)
pred_test_b <- predict(insurance_tree_pruned,
newdata = insurance_testing)
MSE_a <- mean((insurance_testing$charges - pred_test_a)^2)
MSE_b <- mean((insurance_testing$charges - pred_test_b)^2)
MSE_a
## [1] 30205107
MSE_b
## [1] 31288947
Comments: since the initial tree mse_a is lower than mse_b, mse_a is preferred. It is a less complex tree
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.