This is the R portion of your final exam.
Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.
Troubleshooting: If you find errors in your code
that prevent the RMD file from knitting, please comment them
out (add # before the code). I will give you
partial credit based on your logic.
Good luck!
Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.
They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.
Data Dictionary:
High_Sales (Target): Factor with levels Yes and
No. Indicates if the store sold more than 8,000 units.CompPrice: Price charged by the nearest competitor at
each location.Income: Community income level (in thousands of
dollars).Advertising: Local advertising budget for the company
at each location.Population: Population size of the region (in
thousands).Price: Price charged for the car seats at each
site.ShelveLoc: A factor indicating the quality of the
shelving location for the car seats at the site (Good, Bad, or
Medium).Age: Average age of the local population.Education: Education level at each location.Urban: Factor (Yes/No) indicating if the store is in an
urban location.US: Factor (Yes/No) indicating if the store is in the
US.store_sales.csv and name it
store_sales.set.seed(2025) to ensure reproducibility.# a) Load data
store_sales <- read.csv("store_sales.csv")
# b) Split data into training and test sets
set.seed(2025)
training_index <- sample(1 : nrow(store_sales), round(0.7 * nrow(store_sales)))
store_sales_train <- store_sales[training_index, ]
store_sales_test <- store_sales[-training_index, ]
High_Sales
using all other variables as predictors. Please use the training
dataset.store_sales_glm <- glm(High_Sales ~ .,data = store_sales_train, family = "binomial")
store_sales_glm
##
## Call: glm(formula = High_Sales ~ ., family = "binomial", data = store_sales_train)
##
## Coefficients:
## (Intercept) CompPrice Income Advertising
## -6.787364 0.201653 0.034553 0.371516
## Population Price ShelveLocGood ShelveLocMedium
## -0.002566 -0.179454 9.338531 4.270635
## Age Education UrbanYes USYes
## -0.087756 -0.133422 0.140172 -1.626954
##
## Degrees of Freedom: 279 Total (i.e. Null); 268 Residual
## Null Deviance: 375.2
## Residual Deviance: 106.2 AIC: 130.2
summary() function to examine your fitted
model. What is the estimated coefficient for Price? Is it
statistically significant? Please interpret the number using the odds
ratio?summary(store_sales_glm)
##
## Call:
## glm(formula = High_Sales ~ ., family = "binomial", data = store_sales_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.787364 3.390753 -2.002 0.04531 *
## CompPrice 0.201653 0.034124 5.909 3.43e-09 ***
## Income 0.034553 0.010828 3.191 0.00142 **
## Advertising 0.371516 0.075899 4.895 9.84e-07 ***
## Population -0.002566 0.001927 -1.332 0.18296
## Price -0.179454 0.026173 -6.856 7.06e-12 ***
## ShelveLocGood 9.338531 1.407041 6.637 3.20e-11 ***
## ShelveLocMedium 4.270635 0.891648 4.790 1.67e-06 ***
## Age -0.087756 0.019324 -4.541 5.59e-06 ***
## Education -0.133422 0.103935 -1.284 0.19924
## UrbanYes 0.140172 0.594386 0.236 0.81357
## USYes -1.626954 0.830909 -1.958 0.05023 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.21 on 279 degrees of freedom
## Residual deviance: 106.23 on 268 degrees of freedom
## AIC: 130.23
##
## Number of Fisher Scoring iterations: 7
exp(-0.179454) - 1
## [1] -0.1642736
Comments:
The estimated coefficient for “price” is -0.179454 and is absolutely statistically significant because its p-value is lower than .001.
Estimated coefficient for variable “price” decreases the odds of being a student by 16% which means it would also decrease the probability.
pred_prob_test_glmsales <- predict(store_sales_glm, newdata = store_sales_test, type = "response")
pred_prob_test_glmsales
## 1 3 7 9 11 17
## 1.157325e-01 8.679169e-01 3.602412e-02 6.718566e-02 5.195827e-01 4.651615e-01
## 20 30 33 34 35 36
## 7.428930e-01 1.044761e-02 1.668735e-02 4.724336e-01 1.345831e-04 9.970722e-01
## 46 48 54 55 62 69
## 4.162308e-04 9.575538e-03 1.698010e-02 2.720861e-02 1.244826e-02 9.998689e-01
## 72 75 76 80 83 84
## 2.229800e-01 1.460593e-01 2.101886e-01 6.159187e-01 9.953430e-01 1.005983e-03
## 85 86 93 97 101 102
## 9.516365e-06 3.397414e-01 9.028151e-02 9.213292e-01 8.167088e-02 4.648729e-01
## 103 105 108 111 112 113
## 1.259301e-02 1.215173e-03 8.615046e-01 4.432965e-01 9.223529e-02 3.766606e-01
## 115 123 124 125 128 129
## 6.091053e-01 1.084842e-01 6.983759e-02 9.738395e-01 2.468652e-02 9.532738e-04
## 133 138 139 141 146 149
## 4.917953e-01 2.742818e-03 9.710408e-01 1.671238e-01 9.789576e-01 1.179054e-02
## 150 157 161 162 165 173
## 9.991060e-01 1.149129e-01 1.122862e-02 1.088691e-04 1.620102e-01 9.941107e-01
## 176 178 181 183 189 190
## 1.304405e-01 9.881844e-01 1.381137e-03 6.038800e-03 2.606047e-02 9.978266e-01
## 191 192 193 194 197 205
## 9.867588e-01 2.610870e-01 1.264157e-02 9.999553e-01 4.856875e-06 9.751246e-01
## 208 211 217 219 220 221
## 2.315668e-03 4.409301e-05 1.077275e-02 9.877998e-01 9.942278e-01 9.947318e-01
## 223 225 226 227 229 233
## 1.792816e-01 6.047329e-03 6.653986e-03 1.860492e-01 6.734906e-03 9.998652e-01
## 234 235 237 239 244 245
## 9.697155e-01 3.550736e-01 9.792052e-01 9.946575e-03 5.976173e-01 7.761091e-01
## 249 251 257 258 261 265
## 8.165277e-03 2.204428e-01 3.526629e-03 1.778538e-01 7.520191e-01 1.273507e-02
## 271 272 282 283 286 291
## 9.993047e-01 1.270966e-03 9.918927e-01 8.312449e-01 8.157933e-01 8.692320e-01
## 297 301 309 313 316 319
## 9.315542e-01 2.435940e-01 8.320895e-01 1.090878e-02 1.114910e-02 6.350951e-01
## 328 329 331 335 337 348
## 8.175433e-02 2.227808e-04 2.010502e-03 2.404372e-02 2.340082e-03 1.836541e-01
## 350 351 354 355 358 360
## 9.972800e-01 9.621623e-01 8.722449e-01 3.175979e-04 8.569082e-01 3.954597e-04
## 362 378 380 381 383 385
## 6.907155e-01 1.094528e-01 1.011670e-02 2.784145e-01 1.469050e-01 9.995331e-01
pred_class_test_glmsales <- as.numeric(pred_prob_test_glmsales > 0.6)
pred_class_test_glmsales
## [1] 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1
## [38] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0
## [75] 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1
## [112] 0 1 0 1 0 0 0 0 1
confusion_mat_test_glmsales <- table(actual = store_sales_test$High_Sales, pred = pred_class_test_glmsales)
confusion_mat_test_glmsales
## pred
## actual 0 1
## 0 63 3
## 1 16 38
MR_rate <- (3+16)/(63+3+16+38)
MR_rate
## [1] 0.1583333
Comments: 15.83% of the time results are misclassified
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_glm <- roc(store_sales_test$High_Sales, pred_prob_test_glmsales)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_glm)
auc(roc_glm)
## Area under the curve: 0.9374
Comments: AUC reports 0.9374 which means around 94% of the model is explained by our data and our analysis…it is a very good model and accounts for all the necessary parts
You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.
Data Dictionary:
charges (Target): Individual medical costs billed by
health insurance.age: Age of primary beneficiary.sex: Insurance contractor gender (female, male).bmi: Body mass index (providing an understanding of
body weights that are relatively high or low relative to height).children: Number of children covered by health
insurance / Number of dependents.smoker: Smoking status (yes, no).region: The beneficiary’s residential area in the US
(northeast, southeast, southwest, northwest).insurance.csv and name it
insurance.set.seed(2025) to ensure reproducibility. Hint: use
round() function to retain only the integer part of a
number.insurance <- read.csv("insurance.csv")
# b) Split data into training and test sets
set.seed(2025)
training_index1 <- sample(1 : nrow(insurance), round(0.7 * nrow(insurance)))
insurance_train <- insurance[training_index1, ]
insurance_test <- insurance[-training_index1, ]
charges as response variable, and all other
variables as predictors. Then visualize the tree using
rpart.plot.library(rpart)
library(rpart.plot)
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
insurance_rpart <- rpart(charges ~ ., data = insurance_train, method = "anova")
rpart.plot(insurance_rpart)
cp = 0.001,
and then use plotcp() function to view the complexity
parameter plot. Based on this plot, what value of cp you would you
choose to prune the tree, and why?insurance_tree_large <- rpart(charges ~. , data = insurance_train,
method = "anova", cp = 0.001)
rpart.plot(insurance_tree_large, extra = 1, digits = 3)
plotcp(insurance_tree_large)
Comments: I would choose the cp value of .094 because it is at the decline of the graph
insurance_tree_large <- rpart(charges ~. , data = insurance_train,
method = "anova", cp = 0.094)
# pred_prob_test1 <- predict(insurance_tree_large, newdata = insurance_test, type = "prob")
MSE_model <- mean((insurance_tree_large$residuals)^2)
MSE_model
## [1] NaN
Comments: The first model is preferred
End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.
Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.