This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
# Your code here
store_sales <- read.csv("store_sales.csv")
str(store_sales)
## 'data.frame': 400 obs. of 11 variables:
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : chr "Bad" "Good" "Medium" "Medium" ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "Yes" "Yes" "Yes" ...
## $ High_Sales : int 1 1 1 0 0 1 0 1 0 0 ...
# b) Split data into training and test sets
set.seed(2025)
# Your code here
sample_index <- sample(1:nrow(store_sales), round(0.7 * nrow(store_sales)))
store_sales_train <- store_sales[sample_index, ]
store_sales_test <- store_sales[-sample_index, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.# Your code here
store_sales_glm <- glm(High_Sales ~., data = store_sales_train, family = "binomial")
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?# Your code here
summary(store_sales_glm)
##
## Call:
## glm(formula = High_Sales ~ ., family = "binomial", data = store_sales_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
Comments: The estimated coefficient is 7.06e-12, which means that it is not statistically significant because the p-value > 0.
# Your code here
pred_prob_test_glm <- predict(store_sales_glm, newdata = store_sales_test, type = "response")
pred_class_test_glm <- as.numeric(pred_prob_test_glm > 0.6)
# Your code here
table(actual = store_sales_test$High_Sales, pred = pred_class_test_glm)
## pred
## actual 0 1
## 0 63 3
## 1 16 38
Comments:The MR is 0.1583, which means that 15.83% of our predictions are misclassified.
# Your code here
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#roc_glm <- store_sales_test$High_Sales, pred_prob_test_glm)
#plot(roc_glm)
#auc(roc_glm)
auc(store_sales_test$High_Sales, pred_prob_test_glm)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9374
Comments:I got an error message saying that there was an unexpected ‘,’ in “roc_glm <- store_sales_test$High_Sales,”
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.# Your code here
insurance <- read.csv("insurance.csv")
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
# b) Split data into training and test sets
set.seed(2025)
# Your code here
ins_sample_index <- sample(1:nrow(insurance), round(0.7 * nrow(insurance)))
insurance_train <- insurance[ins_sample_index, ]
insurance_test <- insurance[-ins_sample_index, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.# Your code here
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
#insurance_rpart <- rpart(charges ~., data = insurance_train, method = "class")
#rpart.plot(insurance_rpart)
Comment: I am getting an error message saying “Error in rpart.plot(insurance_rpart) : could not find function”rpart.plot”
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?# Your code here
Comments:
# Your code here
# Your code here
Comments:
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.