1 Introduction
Dataset used in this R Markdown is downloaded from https://www.kaggle.com/mirichoi0218/insurance/version/1. Dataset consist of 1338 observations with the following variables, as written in source website:
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
Insurance cost is to be predicted using linear regression.
2 Read Data
Read dataset and check data type.
insurance <- read.csv("data/insurance.csv")
glimpse(insurance)## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
3 Data Wrangling
Change char to factor
insurance %<>% mutate_if(is.character,as.factor)
glimpse(insurance)## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <fct> female, male, male, male, male, female, female, female, male,~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
## $ region <fct> southwest, southeast, southeast, northwest, northwest, southe~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
Check if NA exist
anyNA(insurance)## [1] FALSE
4 EDA
Check summary.
summary(insurance)## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
Detect any outliers
boxplot(insurance$charges)hist(insurance$charges)
Remove outliers.
q.charges <- quantile(insurance$charges)
insurance %>% filter(!charges > q.charges[4] + 1.5 * (q.charges[4] - q.charges[2])) -> insurance_clean
dataset <- list()
dataset[[1]] <- insurance
dataset[[2]] <- insurance_cleanAnalyze correlation between charges and other variables.
ggcorr(insurance,label = T)## Warning in ggcorr(insurance, label = T): data in column(s) 'sex', 'smoker',
## 'region' are not numeric and were ignored
plot(insurance$sex,insurance$charges)plot(insurance$smoker,insurance$charges)plot(insurance$region,insurance$charges)Analysis:
- Observations related to charges shows about 10% outliers, which later can be used for comparing models
- Age as numeric variable shows strongest correlation to charges, while only smoker category shows significance correlation to charges
5 Modelling
5.1 Manual Feature Selection
Create linear regression model using Age and Smoker predictors.
model <- list()
for (item in dataset) {
model[[length(model)+1]] <- lm(charges ~ age + smoker,item)
}Evaluate models
i <- 0
for (item in model) {
i <- i+1
print(paste("Model No.",i))
print("------------------")
print(paste("Adj R Squared:",summary(item)$adj.r.squared))
print(paste("RMSE:",RMSE(item$fitted.values,dataset[[i]]$charges)))
print(paste("Shapiro-Wilk Test P-Value:",shapiro.test(item$residuals)$p.value))
print(paste("Breusch-Pagan Test P-Value:",bptest(item)$p.value))
print(paste("VIF:",vif(item)))
print("------------------")
}## [1] "Model No. 1"
## [1] "------------------"
## [1] "Adj R Squared: 0.720983449018106"
## [1] "RMSE: 6389.57695722642"
## [1] "Shapiro-Wilk Test P-Value: 9.86277258793249e-37"
## [1] "Breusch-Pagan Test P-Value: 3.64483356704953e-54"
## [1] "VIF: 1.00062632997212" "VIF: 1.00062632997212"
## [1] "------------------"
## [1] "Model No. 2"
## [1] "------------------"
## [1] "Adj R Squared: 0.591731480145691"
## [1] "RMSE: 4621.0108443498"
## [1] "Shapiro-Wilk Test P-Value: 1.94644919027387e-48"
## [1] "Breusch-Pagan Test P-Value: 0.0182165772533092"
## [1] "VIF: 1.00456224187403" "VIF: 1.00456224187403"
## [1] "------------------"
Result shows model from dataset with outliers is better in terms of adjusted R-squared. However, both models not showing normal distribution and homoscedasticity of residuals.
5.2 Step-wise Regression
model_none <- lm(charges ~ 1,insurance)
model_all <- lm(charges ~ .,insurance)
model_step <- list()
model_step[[1]] <- step(model_none,direction = "forward",trace = F,scope = list(upper = model_all,lower = model_none))
model_step[[2]] <- step(model_all,direction = "backward",trace = F)
model_step[[3]] <- step(model_all,data = insurance,direction = "both",trace = F,scope = list(upper = model_all))i <- 0
for (item in model_step) {
i <- i+1
print(paste("Model No.",i))
print("------------------")
print(paste("Adj R Squared:",summary(item)$adj.r.squared))
print(paste("RMSE:",RMSE(item$fitted.values,insurance$charges)))
print(paste("Shapiro-Wilk Test P-Value:",shapiro.test(item$residuals)$p.value))
print(paste("Breusch-Pagan Test P-Value:",bptest(item)$p.value))
print(paste("VIF:",vif(item)))
print("------------------")
}## [1] "Model No. 1"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741846e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905244e-23"
## [1] "VIF: 1.00636872563314" "VIF: 1.01618764565747" "VIF: 1.1041974319208"
## [4] "VIF: 1.00371409080625" "VIF: 1.09886952082173" "VIF: 1"
## [7] "VIF: 1" "VIF: 1" "VIF: 1"
## [10] "VIF: 3" "VIF: 1.0031793088143" "VIF: 1.00806133030558"
## [13] "VIF: 1.05080799003472" "VIF: 1.00185532428902" "VIF: 1.01583776589367"
## [1] "------------------"
## [1] "Model No. 2"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741812e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905222e-23"
## [1] "VIF: 1.01618764565747" "VIF: 1.1041974319208" "VIF: 1.00371409080625"
## [4] "VIF: 1.00636872563314" "VIF: 1.09886952082173" "VIF: 1"
## [7] "VIF: 1" "VIF: 1" "VIF: 1"
## [10] "VIF: 3" "VIF: 1.00806133030558" "VIF: 1.05080799003472"
## [13] "VIF: 1.00185532428902" "VIF: 1.0031793088143" "VIF: 1.01583776589367"
## [1] "------------------"
## [1] "Model No. 3"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741812e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905222e-23"
## [1] "VIF: 1.01618764565747" "VIF: 1.1041974319208" "VIF: 1.00371409080625"
## [4] "VIF: 1.00636872563314" "VIF: 1.09886952082173" "VIF: 1"
## [7] "VIF: 1" "VIF: 1" "VIF: 1"
## [10] "VIF: 3" "VIF: 1.00806133030558" "VIF: 1.05080799003472"
## [13] "VIF: 1.00185532428902" "VIF: 1.0031793088143" "VIF: 1.01583776589367"
## [1] "------------------"
Result shows all feature selection directions have picked the same features. Using step-wise regression still resulting models without normal distribution and homoscedasticity of residuals.
6 Conclusion
- Outliers have low influence to model, thus can be kept.
- Best model comes from step-wise regression that can fit 75% of data.
- All models not showing normal distribution and homoscedasticity of residuals.
- All models show no multicollinearity, with VIF < 10.