HW17

helped by 111078513

insurance <- read.csv("C:/R-language/BACS/insurance.csv")

Question 1) Create some explanatory models to learn more about charges:

1-a) Create an OLS regression model and report which factors are significantly related to charges

insu_lm <- lm(charges ~ age + sex + bmi + children + smoker + region, data = insurance)
summary(insu_lm)

## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

1-b) Create a decision tree (specifically, a regression tree) with default parameters to rpart().

1-b-i) Plot a visual representation of the tree structure

library(rpart)

## Warning: 套件 'rpart' 是用 R 版本 4.2.3 來建造的

library(rpart.plot)

## Warning: 套件 'rpart.plot' 是用 R 版本 4.2.3 來建造的

insu_rpart <- rpart(charges ~ age + sex + bmi + children + smoker + region, data = insurance)
rpart.plot(insu_rpart)

1-b-ii) How deep is the tree (see nodes with “decisions” – ignore the leaves at the bottom)

According to the plot, it is noticed that the depth of the tree equals to two.

1-b-iii) How many leaf groups does it suggest to bin the data into?

According to the plot, it suggests to bin the data into four leaf groups.

1-b-iv) What conditions (combination of decisions) describe each leaf group?

rules <- rpart.rules(insu_rpart)
print(rules)

##  charges                                           
##     5399 when smoker is  no & age <  43            
##    12300 when smoker is  no & age >= 43            
##    21369 when smoker is yes             & bmi <  30
##    41693 when smoker is yes             & bmi >= 30

Question 2) Let’s use LOOCV to see how how our models perform predictively overall

mse <- function(errs) mean(errs^2)
mse_in <- function(model) mse(residuals(model))
mse_out <- function(actual, predicted) mse(actual - predicted)

fold_i_pe <- function(i, k, model, dataset, outcome) {
  folds <- cut(1:nrow(dataset), breaks=k, labels=FALSE)
  test_indices <- which(folds==i)
  test_set <- dataset[test_indices, ]
  train_set <- dataset[-test_indices, ]
  trained_model <- update(model, data = train_set)
  predictions <- predict(trained_model, test_set)
  dataset[test_indices, outcome] - predictions
}

k_fold_mse <- function(model, dataset, outcome, k=nrow(dataset)) {
  shuffled_indicies <- sample(1:nrow(dataset))
  dataset <- dataset[shuffled_indicies,]
  fold_pred_errors <- sapply(1:k, \(kth) {
    fold_i_pe(kth, k, model, dataset, outcome)
  })
  pred_errors <- unlist(fold_pred_errors)
  mse(pred_errors)
}

2-a) What is the RMSEout for the OLS regression model?

RMSE_out_lm <- sqrt(k_fold_mse(insu_lm, insurance, "charges"))
cat("The RMSEout for the OLS regression model is", RMSE_out_lm)

## The RMSEout for the OLS regression model is 6087.388

2-b) What is the RMSEout for the decision tree model?

RMSE_out_rpart <- sqrt(k_fold_mse(insu_rpart, insurance,"charges"))
cat("The RMSEout for the decision tree model is", RMSE_out_rpart)

## The RMSEout for the decision tree model is 5135.175

Question 3) Let’s see if bagging helps our models

For bagging and boosting, we will only use split-sample testing to save time: partition the data to create training and test sets using an 80:20 split. Use the regression model and decision tree you created in Question 1 for bagging and boosting.

shuffled_indicies <- sample(1:nrow(insurance))[1:(nrow(insurance)*0.8)]
train_set <- insurance[shuffled_indicies, ]
test_set <- insurance[-shuffled_indicies, ]

3-a) Implement the bagged_learn(…) and bagged_predict(…) functions using the hints in the class notes and help from your classmates on Teams. Feel free to share your code on Teams to get feedback, or ask others for help.

bagged_learn <- function(model, dataset, b=100) {
  lapply(1:b, \(i) {
    n = nrow(dataset)
    train_set <- dataset[sample(1:n, n, replace = TRUE),]
    update(model, data = train_set)
  })
}
bagged_predict <- function(bagged_models, new_data) {
  predictions <- lapply(bagged_models, \(model){
    predict(model, new_data)
  }) # get b predictions of new_data
  as.data.frame(predictions) |> apply(X= _,1,mean) # apply a mean over the columns of predictions
}
RMSE_out <- function(actuals,preds){
  sqrt(mean( (actuals - preds) ^ 2))
}

3-b) What is the RMSEout for the bagged OLS regression?

RMSE_out_baglm <- bagged_learn(insu_lm, train_set) |> bagged_predict(test_set) |>
  RMSE_out(test_set$charges,preds = _)
cat("The RMSEout for the bagged OLS regression",RMSE_out_baglm)

## The RMSEout for the bagged OLS regression 5558.484

3-c) What is the RMSEout for the bagged decision tree?

RMSE_out_bagrpart <- bagged_learn(insu_rpart, train_set) |> bagged_predict(test_set) |>
  RMSE_out(test_set$charges,preds = _)
cat("The RMSEout for the bagged OLS regression",RMSE_out_bagrpart)

## The RMSEout for the bagged OLS regression 4451.156

Question 4) Let’s see if boosting helps our models. You can use a learning rate of 0.1 and adjust it if you find a better rate.

4-a) Write boosted_learn(…) and boosted_predict(…) functions using the hints in the class notes and help from your classmates on Teams.

boost_learn <- function(model, dataset, outcome, n=100, rate=0.1) {
  predictors <- dataset[, !(colnames(dataset)== outcome)] # get data frame of only predictor variables
  # Initialize residuals and models
  res <- dataset[, c(outcome)] # set res to vector of actuals (y) to start
  models <- list()
  for (i in 1:n) {
    this_model <- update(model, data = cbind(predictors,charges = res))
    res <- res - rate*predict(this_model,predictors)  # update residuals with learning rate
    models[[i]] <- this_model # Store model
  }
  list(models=models, rate=rate)
}

boost_predict <- function(boosted_learning, new_data) {
  boosted_models <- boosted_learning$models
  rate <- boosted_learning$rate
  n <- nrow(new_data)
  predictions <- lapply(boosted_models,\(model){predict(model,new_data)}) # get predictions of new_data from each model
  pred_frame <- as.data.frame(predictions) |> unname()
  apply(pred_frame, 1, \(predict){0.1*sum(predict)}) # apply a sum over the columns of predictions, weighted by learning rate
}

4-b) What is the RMSEout for the boosted OLS regression?

RMSE_out_bootlm <- boost_learn(insu_lm, train_set, outcome="charges", n=1000) |>
  boost_predict(test_set) |>
  RMSE_out(test_set$charges, preds = _)
cat("The RMSEout for the boosted OLS regression is",RMSE_out_bootlm)

## The RMSEout for the boosted OLS regression is 5561.18

4-c) What is the RMSEout for the boosted decision tree?

RMSE_out_bootrpart <- boost_learn(insu_rpart, train_set, outcome = "charges", n=1000) |>
  boost_predict(test_set) |>
  RMSE_out(test_set$charges, pred = _)
cat("The RMSEout for the boosted decision tree is",RMSE_out_bootrpart)

## The RMSEout for the boosted decision tree is 4103.057

Question 5) Let’s engineer the best predictive decision trees. Let’s repeat the bagging and boosting decision tree several times to see what kind of base tree helps us learn the fastest. But this time, split the data 70:20:10 — use 70% for training, 20% for fine-tuning, and use the last 10% to report the final RMSEout.

shuffled_indicies <- sample(1:nrow(insurance))
n = nrow(insurance)
train_set <- insurance[shuffled_indicies[1:(n*0.7)], ]
tuning_set <- insurance[shuffled_indicies[(n*0.7):(n*0.9)], ]
test_set <- insurance[shuffled_indicies[(n*0.9):n], ]

5-a) Repeat the bagging of the decision tree, using a base tree of maximum depth 1, 2, … n, keep training on the 70% training set while the RMSEout of your 20% set keeps dropping; stop when the RMSEout has started increasing again (show prediction error at each depth). Report the final RMSEout using the final 10% of the data as your test set.

repeat_bagging <- function(model, train_set, tuning_set){
  this_model <- update(model, control = rpart.control(maxdepth = 1))
  rmse_out_rep <- bagged_learn(this_model, train_set, b=100) |>
    bagged_predict(tuning_set) |>
    RMSE_out(tuning_set$charges, preds = _)
  k <- 1
  print(paste("--- k =", k, "---"))
  print(rmse_out_rep)
  while(TRUE){
    k <- k + 1
    print(paste("--- k =", k, "---"))
    control_text <- paste("update(this_model, control = rpart.control(maxdepth =",
                          k, "))")
    temp_model <- eval(parse(text = control_text))
    rmse_out_temp <- bagged_learn(temp_model, train_set, b=100) |>
      bagged_predict(tuning_set) |>
      RMSE_out(tuning_set$charges, preds = _)
    if (rmse_out_temp > rmse_out_rep){
      print(paste(rmse_out_temp, "(Stop)"))
      result = list(model = this_model, k = k-1)
      return(result)
    }
    rmse_out_rep <- rmse_out_temp
    this_model <- temp_model
    print(rmse_out_rep)
  }
}
best_tree_bagging <- repeat_bagging(insu_rpart, train_set, tuning_set)

## [1] "--- k = 1 ---"
## [1] 7754.462
## [1] "--- k = 2 ---"
## [1] 5127.367
## [1] "--- k = 3 ---"
## [1] 4874.015
## [1] "--- k = 4 ---"
## [1] 4817.473
## [1] "--- k = 5 ---"
## [1] "4859.24071940965 (Stop)"

RMSE_out_repeat_bagged <- bagged_learn(best_tree_bagging$model, train_set, b=100) |>
  bagged_predict(test_set) |>
  RMSE_out(test_set$charges, preds = _)
print(paste("The RMSE_out for the Repeat Bagged decision tree is ",
            round(RMSE_out_repeat_bagged, 2),
            ", where max depth = ", best_tree_bagging$k,
            sep = ""))

## [1] "The RMSE_out for the Repeat Bagged decision tree is 5237.44, where max depth = 4"

5-b) Repeat the boosting of the decision tree, using a base tree of maximum depth 1, 2, … n, keep training on the 70% training set while the RMSEout of your 20% set keeps dropping; stop when the RMSEout has started increasing again (show prediction error at each depth). Report the final RMSEout using the final 10% of the data as your test set.

repeat_boosting <- function(model, train_set, tuning_set, maxdepth = 1){
  this_model <- update(model, control = rpart.control(maxdepth = 1))
  rmse_out_rep <- boost_learn(this_model, train_set, outcome="charges", n=1000) |>
    boost_predict(tuning_set) |>
    RMSE_out(tuning_set$charges, preds = _)
  k <- 1
  print(paste("--- k =", k, "---"))
  print(rmse_out_rep)
  while(TRUE){
    k <- k + 1
    print(paste("--- k =", k, "---"))
    control_text <- paste("update(this_model, control = rpart.control(maxdepth =",
                          k, "))")
    temp_model <- eval(parse(text = control_text))
    rmse_out_temp <- boost_learn(temp_model, train_set, outcome="charges", n=1000) |>
      boost_predict(tuning_set) |>
      RMSE_out(tuning_set$charges, preds = _)
    if (rmse_out_temp >= rmse_out_rep){
      print(paste(rmse_out_temp, "(Stop)"))
      result = list(model = this_model, k = k-1)
      return(result)
    }
    rmse_out_rep <- rmse_out_temp
    this_model <- temp_model
    print(rmse_out_rep)
  }
}
best_tree_boosting <- repeat_boosting(insu_rpart, train_set, tuning_set)

## [1] "--- k = 1 ---"
## [1] 6413.797
## [1] "--- k = 2 ---"
## [1] 4578.489
## [1] "--- k = 3 ---"
## [1] 4488.325
## [1] "--- k = 4 ---"
## [1] "4499.68781636166 (Stop)"

RMSE_out_repeat_boost <-
  boost_learn(best_tree_boosting$model, train_set, outcome="charges", n=1000) |>
boost_predict(test_set) |>
  RMSE_out(test_set$charges, preds = _)
print(paste("The RMSE_out for the Repeat Boosted decision tree is ",
            round(RMSE_out_repeat_boost, 2),
            ", where max depth = ", best_tree_boosting$k, sep = ""))

## [1] "The RMSE_out for the Repeat Boosted decision tree is 5075.44, where max depth = 3"

HW17

111078517

2023-06-11

helped by 111078513

Question 1) Create some explanatory models to learn more about charges:

1-b) Create a decision tree (specifically, a regression tree) with default parameters to rpart().

1-b-i) Plot a visual representation of the tree structure

1-b-ii) How deep is the tree (see nodes with “decisions” – ignore the leaves at the bottom)

According to the plot, it is noticed that the depth of the tree equals to two.

1-b-iii) How many leaf groups does it suggest to bin the data into?

According to the plot, it suggests to bin the data into four leaf groups.

1-b-iv) What conditions (combination of decisions) describe each leaf group?

Question 2) Let’s use LOOCV to see how how our models perform predictively overall

2-a) What is the RMSEout for the OLS regression model?

2-b) What is the RMSEout for the decision tree model?

Question 3) Let’s see if bagging helps our models

For bagging and boosting, we will only use split-sample testing to save time: partition the data to create training and test sets using an 80:20 split. Use the regression model and decision tree you created in Question 1 for bagging and boosting.

3-b) What is the RMSEout for the bagged OLS regression?

3-c) What is the RMSEout for the bagged decision tree?

Question 4) Let’s see if boosting helps our models. You can use a learning rate of 0.1 and adjust it if you find a better rate.

4-a) Write boosted_learn(…) and boosted_predict(…) functions using the hints in the class notes and help from your classmates on Teams.

4-b) What is the RMSEout for the boosted OLS regression?

4-c) What is the RMSEout for the boosted decision tree?

HW17

111078517

2023-06-11

helped by 111078513

Question 1) Create some explanatory models to learn more about charges:

1-a) Create an OLS regression model and report which factors are significantly related to charges

According to the results, age,bmi,children,smoker-yes(compared to smoker-no),region northeast, region southwest(compared to northeast) are significantly related to charges.

1-b) Create a decision tree (specifically, a regression tree) with default parameters to rpart().

1-b-i) Plot a visual representation of the tree structure

1-b-ii) How deep is the tree (see nodes with “decisions” – ignore the leaves at the bottom)

According to the plot, it is noticed that the depth of the tree equals to two.

1-b-iii) How many leaf groups does it suggest to bin the data into?

According to the plot, it suggests to bin the data into four leaf groups.

1-b-iv) What conditions (combination of decisions) describe each leaf group?

Question 2) Let’s use LOOCV to see how how our models perform predictively overall

2-a) What is the RMSEout for the OLS regression model?

2-b) What is the RMSEout for the decision tree model?

Question 3) Let’s see if bagging helps our models

For bagging and boosting, we will only use split-sample testing to save time: partition the data to create training and test sets using an 80:20 split. Use the regression model and decision tree you created in Question 1 for bagging and boosting.

3-a) Implement the bagged_learn(…) and bagged_predict(…) functions using the hints in the class notes and help from your classmates on Teams. Feel free to share your code on Teams to get feedback, or ask others for help.

3-b) What is the RMSEout for the bagged OLS regression?

3-c) What is the RMSEout for the bagged decision tree?

Question 4) Let’s see if boosting helps our models. You can use a learning rate of 0.1 and adjust it if you find a better rate.

4-a) Write boosted_learn(…) and boosted_predict(…) functions using the hints in the class notes and help from your classmates on Teams.

4-b) What is the RMSEout for the boosted OLS regression?

4-c) What is the RMSEout for the boosted decision tree?