Instructions

This is the R portion of your final exam.

Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Submission: Please make sure the RMD is knittable and submit the RMD file along with the generated HTML report.

Troubleshooting: If you find errors in your code that prevent the RMD file from knitting, please comment them out (add # before the code). I will give you partial credit based on your logic.

Good luck!

Part I: Logistic Regression

0. Background

Context: You have been hired by a retail consulting firm to analyze the sales performance of a company selling child car seats. The company wants to identify the key drivers of high sales performance to optimize their marketing and store layout strategies.

They have provided you with a dataset (store_sales.csv) containing data from 400 different store locations. Your goal is to build classification models to predict whether a store will have “High Sales” (Yes) or not.

Data Dictionary:

  • High_Sales (Target): Factor with levels Yes and No. Indicates if the store sold more than 8,000 units.
  • CompPrice: Price charged by the nearest competitor at each location.
  • Income: Community income level (in thousands of dollars).
  • Advertising: Local advertising budget for the company at each location.
  • Population: Population size of the region (in thousands).
  • Price: Price charged for the car seats at each site.
  • ShelveLoc: A factor indicating the quality of the shelving location for the car seats at the site (Good, Bad, or Medium).
  • Age: Average age of the local population.
  • Education: Education level at each location.
  • Urban: Factor (Yes/No) indicating if the store is in an urban location.
  • US: Factor (Yes/No) indicating if the store is in the US.

1. Data Preparation (0.5 point)

  1. Load the data from the file store_sales.csv and name it store_sales.
  2. Split the data into training (70%) and test (30%) sets. Use set.seed(2025) to ensure reproducibility.
# a) Load data
store_sales <- read.csv("store_sales.csv")

# b) Split data into training and test sets
set.seed(2025)
training_index <- sample(1 : nrow(store_sales), round(0.7 * nrow(store_sales))) 

store_sales_train <- store_sales[training_index, ]
store_sales_test <- store_sales[-training_index, ]

2. Logistic Regression (5 points)

  1. Fit a logistic regression model to predict High_Sales using all other variables as predictors. Please use the training dataset.
store_sales_glm <- glm(High_Sales ~ .,data = store_sales_train, family = "binomial")
store_sales_glm
## 
## Call:  glm(formula = High_Sales ~ ., family = "binomial", data = store_sales_train)
## 
## Coefficients:
##     (Intercept)        CompPrice           Income      Advertising  
##       -6.787364         0.201653         0.034553         0.371516  
##      Population            Price    ShelveLocGood  ShelveLocMedium  
##       -0.002566        -0.179454         9.338531         4.270635  
##             Age        Education         UrbanYes            USYes  
##       -0.087756        -0.133422         0.140172        -1.626954  
## 
## Degrees of Freedom: 279 Total (i.e. Null);  268 Residual
## Null Deviance:       375.2 
## Residual Deviance: 106.2     AIC: 130.2
  1. Use the summary() function to examine your fitted model. What is the estimated coefficient for Price? Is it statistically significant? Please interpret the number using the odds ratio?
summary(store_sales_glm)
## 
## Call:
## glm(formula = High_Sales ~ ., family = "binomial", data = store_sales_train)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.787364   3.390753  -2.002  0.04531 *  
## CompPrice        0.201653   0.034124   5.909 3.43e-09 ***
## Income           0.034553   0.010828   3.191  0.00142 ** 
## Advertising      0.371516   0.075899   4.895 9.84e-07 ***
## Population      -0.002566   0.001927  -1.332  0.18296    
## Price           -0.179454   0.026173  -6.856 7.06e-12 ***
## ShelveLocGood    9.338531   1.407041   6.637 3.20e-11 ***
## ShelveLocMedium  4.270635   0.891648   4.790 1.67e-06 ***
## Age             -0.087756   0.019324  -4.541 5.59e-06 ***
## Education       -0.133422   0.103935  -1.284  0.19924    
## UrbanYes         0.140172   0.594386   0.236  0.81357    
## USYes           -1.626954   0.830909  -1.958  0.05023 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.21  on 279  degrees of freedom
## Residual deviance: 106.23  on 268  degrees of freedom
## AIC: 130.23
## 
## Number of Fisher Scoring iterations: 7
exp(-0.179454) - 1
## [1] -0.1642736

Comments:

The estimated coefficient for “price” is -0.179454 and is absolutely statistically significant because its p-value is lower than .001.

Estimated coefficient for variable “price” decreases the odds of being a student by 16% which means it would also decrease the probability.

  1. Generate predicted probabilities on the testing dataset. Then, obtain the predicted classes using 0.6 as the cutoff value (threshold).
pred_prob_test_glmsales <-  predict(store_sales_glm, newdata = store_sales_test, type = "response")
pred_prob_test_glmsales
##            1            3            7            9           11           17 
## 1.157325e-01 8.679169e-01 3.602412e-02 6.718566e-02 5.195827e-01 4.651615e-01 
##           20           30           33           34           35           36 
## 7.428930e-01 1.044761e-02 1.668735e-02 4.724336e-01 1.345831e-04 9.970722e-01 
##           46           48           54           55           62           69 
## 4.162308e-04 9.575538e-03 1.698010e-02 2.720861e-02 1.244826e-02 9.998689e-01 
##           72           75           76           80           83           84 
## 2.229800e-01 1.460593e-01 2.101886e-01 6.159187e-01 9.953430e-01 1.005983e-03 
##           85           86           93           97          101          102 
## 9.516365e-06 3.397414e-01 9.028151e-02 9.213292e-01 8.167088e-02 4.648729e-01 
##          103          105          108          111          112          113 
## 1.259301e-02 1.215173e-03 8.615046e-01 4.432965e-01 9.223529e-02 3.766606e-01 
##          115          123          124          125          128          129 
## 6.091053e-01 1.084842e-01 6.983759e-02 9.738395e-01 2.468652e-02 9.532738e-04 
##          133          138          139          141          146          149 
## 4.917953e-01 2.742818e-03 9.710408e-01 1.671238e-01 9.789576e-01 1.179054e-02 
##          150          157          161          162          165          173 
## 9.991060e-01 1.149129e-01 1.122862e-02 1.088691e-04 1.620102e-01 9.941107e-01 
##          176          178          181          183          189          190 
## 1.304405e-01 9.881844e-01 1.381137e-03 6.038800e-03 2.606047e-02 9.978266e-01 
##          191          192          193          194          197          205 
## 9.867588e-01 2.610870e-01 1.264157e-02 9.999553e-01 4.856875e-06 9.751246e-01 
##          208          211          217          219          220          221 
## 2.315668e-03 4.409301e-05 1.077275e-02 9.877998e-01 9.942278e-01 9.947318e-01 
##          223          225          226          227          229          233 
## 1.792816e-01 6.047329e-03 6.653986e-03 1.860492e-01 6.734906e-03 9.998652e-01 
##          234          235          237          239          244          245 
## 9.697155e-01 3.550736e-01 9.792052e-01 9.946575e-03 5.976173e-01 7.761091e-01 
##          249          251          257          258          261          265 
## 8.165277e-03 2.204428e-01 3.526629e-03 1.778538e-01 7.520191e-01 1.273507e-02 
##          271          272          282          283          286          291 
## 9.993047e-01 1.270966e-03 9.918927e-01 8.312449e-01 8.157933e-01 8.692320e-01 
##          297          301          309          313          316          319 
## 9.315542e-01 2.435940e-01 8.320895e-01 1.090878e-02 1.114910e-02 6.350951e-01 
##          328          329          331          335          337          348 
## 8.175433e-02 2.227808e-04 2.010502e-03 2.404372e-02 2.340082e-03 1.836541e-01 
##          350          351          354          355          358          360 
## 9.972800e-01 9.621623e-01 8.722449e-01 3.175979e-04 8.569082e-01 3.954597e-04 
##          362          378          380          381          383          385 
## 6.907155e-01 1.094528e-01 1.011670e-02 2.784145e-01 1.469050e-01 9.995331e-01
pred_class_test_glmsales <- as.numeric(pred_prob_test_glmsales > 0.6)
pred_class_test_glmsales
##   [1] 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1
##  [38] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0
##  [75] 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1
## [112] 0 1 0 1 0 0 0 0 1
  1. Create a confusion matrix using the testing dataset and report Misclassification Rate (MR).
confusion_mat_test_glmsales <- table(actual = store_sales_test$High_Sales, pred = pred_class_test_glmsales)
confusion_mat_test_glmsales
##       pred
## actual  0  1
##      0 63  3
##      1 16 38
MR_rate <- (3+16)/(63+3+16+38)
MR_rate
## [1] 0.1583333

Comments: 15.83% of the time results are misclassified

  1. Draw an ROC curve and calculate the AUC (Area Under Curve) for this logistic regression model on the testing dataset. Do you think the prediction performance is acceptable?
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_glm <- roc(store_sales_test$High_Sales, pred_prob_test_glmsales)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_glm)

auc(roc_glm)
## Area under the curve: 0.9374

Comments: AUC reports 0.9374 which means around 94% of the model is explained by our data and our analysis…it is a very good model and accounts for all the necessary parts

Part II: Regression Tree

0. Background

You have been hired by a health insurance company to improve their pricing strategy. They want to understand which factors contribute most to high individual medical costs.

Data Dictionary:

  • charges (Target): Individual medical costs billed by health insurance.
  • age: Age of primary beneficiary.
  • sex: Insurance contractor gender (female, male).
  • bmi: Body mass index (providing an understanding of body weights that are relatively high or low relative to height).
  • children: Number of children covered by health insurance / Number of dependents.
  • smoker: Smoking status (yes, no).
  • region: The beneficiary’s residential area in the US (northeast, southeast, southwest, northwest).

1. Data Preparation (0.5 point)

  1. Load the data from the file insurance.csv and name it insurance.
  2. Split the data into training (70%) and test (30%) sets. Use set.seed(2025) to ensure reproducibility. Hint: use round() function to retain only the integer part of a number.
insurance <- read.csv("insurance.csv")

# b) Split data into training and test sets
set.seed(2025)
training_index1 <- sample(1 : nrow(insurance), round(0.7 * nrow(insurance))) 

insurance_train <- insurance[training_index1, ]
insurance_test <- insurance[-training_index1, ]

2. Regression Tree (4 points)

  1. Fit a regression tree based on the training data, using the charges as response variable, and all other variables as predictors. Then visualize the tree using rpart.plot.
library(rpart)
library(rpart.plot)
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
insurance_rpart <- rpart(charges ~ ., data = insurance_train, method = "anova")
rpart.plot(insurance_rpart)

  1. Pruning: Fit a large regression tree with cp = 0.001, and then use plotcp() function to view the complexity parameter plot. Based on this plot, what value of cp you would you choose to prune the tree, and why?
insurance_tree_large <- rpart(charges ~. , data = insurance_train,
                           method = "anova", cp = 0.001)
rpart.plot(insurance_tree_large, extra = 1, digits = 3)

plotcp(insurance_tree_large)

Comments: I would choose the cp value of .094 because it is at the decline of the graph

  1. Refit the regression tree using the cp selected in question (b). Then obtain the predictions on the testing data.
insurance_tree_large <- rpart(charges ~. , data = insurance_train,
                           method = "anova", cp = 0.094)
# pred_prob_test1 <- predict(insurance_tree_large, newdata = insurance_test, type = "prob")
  1. Obtain the out-of-sample MSE (Mean Squared Error) for both the initial tree model in (a) and the pruned tree model in (c). Which model is preferred for this data, and why?
MSE_model <- mean((insurance_tree_large$residuals)^2)
MSE_model
## [1] NaN

Comments: The first model is preferred


End of Exam. Please double-check that your RMD file knits successfully. Submit both the RMD and the generated HTML report.

Reminder: If a specific chunk causes an error, comment it out to allow the file to knit. Failure to submit an HTML report may result in a point deduction.