Homework 5
RMI 4226 Risk and Insurance Data Analytics – Spring 2025
Objective: This exercise serves the purpose of helping student using simple linear regression models.
Load necessary libraries
library(dplyr) library(readr) library(caret) library(ggplot2) library(stargazer)
1. Import the medical_insurance_expense.csv data and remove the duplicated observations, create dummy variables for character variables, and save it “medical_data”.
medical_data <- read_csv(“medical_insurance_expense.csv”) medical_data <- distinct(medical_data) medical_data <- medical_data %>% mutate(across(where(is.character), ~ as.factor(.))) %>% mutate(across(where(is.factor), ~ as.numeric(. == levels(.)[2]), .names = “dummy_{col}”)) head(medical_data)
2. Find the number of observations (N), mean, standard deviation, min, quartiles and max for all numerical/integer variables: age, bmi, children, charges, and the dummy variables created in question 1. Save it as a data frame called summary_statistics.
numeric_cols <- medical_data %>% select(age, bmi, children, charges, starts_with(“dummy_”)) summary_statistics <- numeric_cols %>% summarise(across(everything(), list( N = ~ n(), mean = ~ mean(.), sd = ~ sd(.), min = ~ min(.), Q1 = ~ quantile(., 0.25), median = ~ median(.), Q3 = ~ quantile(., 0.75), max = ~ max(.) ))) summary_statistics
3. Split the medical_data into training and test data sets with a 70-30 split and save them as train_strat and test_strat. Set your seed as 123. When split the data, please make sure each data set have the same proportion of smoker versus non-smoker. Run a simple linear regression model using train_strat and save it as “model_SL” and summary the output.
set.seed(123) train_index <- createDataPartition(medical_data$smoker, p = 0.7, list = FALSE) train_strat <- medical_data[train_index, ] test_strat <- medical_data[-train_index, ] model_SL <- lm(charges ~ bmi, data = train_strat) summary(model_SL)
Regression formula
charges = β0 + β1 * bmi + ε
4. Plot the bmi and charges, and the fitted line with 95% confidence intervals.
ggplot(train_strat, aes(x = bmi, y = charges)) + geom_point() + geom_smooth(method = “lm”, se = TRUE, color = “blue”)
5. Can you write down the estimated linear regression equation? How do you interpret the parameters and there statistics, as well as the overall model evaluation.
coef(model_SL)
6. To test the hypothesis that medical insurance charges is positively associated with policyholder’s bmi. Can you justify whether the training data support his hypothesis and why.
summary(model_SL)$coefficients
7. Find the confidence interval for model parameters at 95% confidence level. (\(\hat{\beta_1} \pm 2*SE\))
confint(model_SL, level = 0.95)
8. Check the histogram of residuals, and how to you evaluate your data.
residuals_SL <- residuals(model_SL) hist(residuals_SL, main = “Histogram of Residuals”, xlab = “Residuals”, col = “lightblue”)
9. Predict expected charges add it to train_strat and test_strat, and find mean squared errors, respectively.
train_strat\(predicted_charges <- predict(model_SL, newdata = train_strat) test_strat\)predicted_charges <- predict(model_SL, newdata = test_strat) mse_train <- mean((train_strat\(charges - train_strat\)predicted_charges)^2) mse_test <- mean((test_strat\(charges - test_strat\)predicted_charges)^2) mse_train mse_test
10. Check the linear relationship line between bmi and charges, and the fitted line with 95% confidence intervals for test_strat.
ggplot(test_strat, aes(x = bmi, y = charges)) + geom_point() + geom_smooth(method = “lm”, se = TRUE, color = “red”)
11. Predict the expected medical charges when a policyholder has bmi at 30 using the estimated simple linear regression model.
predict(model_SL, newdata = data.frame(bmi = 30))
12. Print out the model using stargazer package.
stargazer(model_SL, type = “text”)