Homework 5

RMI 4226 Risk and Insurance Data Analytics – Spring 2025

Objective: This exercise serves the purpose of helping student using simple linear regression models.

Load necessary libraries

library(dplyr) library(readr) library(caret) library(ggplot2) library(stargazer)

1. Import the medical_insurance_expense.csv data and remove the duplicated observations, create dummy variables for character variables, and save it “medical_data”.

medical_data <- read_csv(“medical_insurance_expense.csv”) medical_data <- distinct(medical_data) medical_data <- medical_data %>% mutate(across(where(is.character), ~ as.factor(.))) %>% mutate(across(where(is.factor), ~ as.numeric(. == levels(.)[2]), .names = “dummy_{col}”)) head(medical_data)

2. Find the number of observations (N), mean, standard deviation, min, quartiles and max for all numerical/integer variables: age, bmi, children, charges, and the dummy variables created in question 1. Save it as a data frame called summary_statistics.

numeric_cols <- medical_data %>% select(age, bmi, children, charges, starts_with(“dummy_”)) summary_statistics <- numeric_cols %>% summarise(across(everything(), list( N = ~ n(), mean = ~ mean(.), sd = ~ sd(.), min = ~ min(.), Q1 = ~ quantile(., 0.25), median = ~ median(.), Q3 = ~ quantile(., 0.75), max = ~ max(.) ))) summary_statistics

3. Split the medical_data into training and test data sets with a 70-30 split and save them as train_strat and test_strat. Set your seed as 123. When split the data, please make sure each data set have the same proportion of smoker versus non-smoker. Run a simple linear regression model using train_strat and save it as “model_SL” and summary the output.

set.seed(123) train_index <- createDataPartition(medical_data$smoker, p = 0.7, list = FALSE) train_strat <- medical_data[train_index, ] test_strat <- medical_data[-train_index, ] model_SL <- lm(charges ~ bmi, data = train_strat) summary(model_SL)

Regression formula

charges = β0 + β1 * bmi + ε

4. Plot the bmi and charges, and the fitted line with 95% confidence intervals.

ggplot(train_strat, aes(x = bmi, y = charges)) + geom_point() + geom_smooth(method = “lm”, se = TRUE, color = “blue”)

5. Can you write down the estimated linear regression equation? How do you interpret the parameters and there statistics, as well as the overall model evaluation.

coef(model_SL)

6. To test the hypothesis that medical insurance charges is positively associated with policyholder’s bmi. Can you justify whether the training data support his hypothesis and why.

summary(model_SL)$coefficients

7. Find the confidence interval for model parameters at 95% confidence level. (\(\hat{\beta_1} \pm 2*SE\))

confint(model_SL, level = 0.95)

8. Check the histogram of residuals, and how to you evaluate your data.

residuals_SL <- residuals(model_SL) hist(residuals_SL, main = “Histogram of Residuals”, xlab = “Residuals”, col = “lightblue”)

9. Predict expected charges add it to train_strat and test_strat, and find mean squared errors, respectively.

train_strat\(predicted_charges <- predict(model_SL, newdata = train_strat) test_strat\)predicted_charges <- predict(model_SL, newdata = test_strat) mse_train <- mean((train_strat\(charges - train_strat\)predicted_charges)^2) mse_test <- mean((test_strat\(charges - test_strat\)predicted_charges)^2) mse_train mse_test

10. Check the linear relationship line between bmi and charges, and the fitted line with 95% confidence intervals for test_strat.

ggplot(test_strat, aes(x = bmi, y = charges)) + geom_point() + geom_smooth(method = “lm”, se = TRUE, color = “red”)

11. Predict the expected medical charges when a policyholder has bmi at 30 using the estimated simple linear regression model.

predict(model_SL, newdata = data.frame(bmi = 30))