# Introduction
A Medical Insurance Company Has Released Data For Almost 1000 Customers. Create A Model That Predicts The Yearly Medical Cover Cost. The Data Is Voluntarily Given By Customers. We ara going to use this dataset to build a model.
We are going to make prediction from Medical Insurance. We want to know relationship between Premium Price with other variable. From that, we are going to predict Premium Price based on historical data.
# Data Preparation
Load packages that we need
```r
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
options(scipen = 100, max.print = 1e+06)
Read and check the dataset
med_ins <- read.csv("Medicalpremium.csv")
med_ins %>% head()# Check Data Structure
str(med_ins)#> 'data.frame': 986 obs. of 11 variables:
#> $ Age : int 45 60 36 52 38 30 33 23 48 38 ...
#> $ Diabetes : int 0 1 1 1 0 0 0 0 1 0 ...
#> $ BloodPressureProblems : int 0 0 1 1 0 0 0 0 0 0 ...
#> $ AnyTransplants : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AnyChronicDiseases : int 0 0 0 1 1 0 0 0 0 0 ...
#> $ Height : int 155 180 158 183 166 160 150 181 169 182 ...
#> $ Weight : int 57 73 59 93 88 69 54 79 74 93 ...
#> $ KnownAllergies : int 0 0 0 0 0 1 0 1 1 0 ...
#> $ HistoryOfCancerInFamily: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ NumberOfMajorSurgeries : int 0 0 1 2 1 1 0 0 0 0 ...
#> $ PremiumPrice : int 25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...
The data has 986 rows and 11 columns. Our target variable is
PremiumPrice. We can see that the datatype still not right.
For example Diabetes. It is a yes or no answer. So, we must change to
factor. Let’s change the data type so we can use it further.
# Change data type
med_ins <- med_ins %>% mutate(
Diabetes = as.factor(Diabetes),
BloodPressureProblems = as.factor(BloodPressureProblems),
AnyTransplants = as.factor(AnyTransplants),
AnyChronicDiseases = as.factor(AnyChronicDiseases),
KnownAllergies = as.factor(KnownAllergies),
HistoryOfCancerInFamily = as.factor(HistoryOfCancerInFamily),
)
str(med_ins)#> 'data.frame': 986 obs. of 11 variables:
#> $ Age : int 45 60 36 52 38 30 33 23 48 38 ...
#> $ Diabetes : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 1 ...
#> $ BloodPressureProblems : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
#> $ AnyTransplants : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ AnyChronicDiseases : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
#> $ Height : int 155 180 158 183 166 160 150 181 169 182 ...
#> $ Weight : int 57 73 59 93 88 69 54 79 74 93 ...
#> $ KnownAllergies : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
#> $ HistoryOfCancerInFamily: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ NumberOfMajorSurgeries : int 0 0 1 2 1 1 0 0 0 0 ...
#> $ PremiumPrice : int 25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...
Before we try to make model, we check all data variable. We see if any corellation between them.
ggcorr(med_ins, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)We can see that only Age has strong correlation to
PremiumPrice
To avoid any chance of overfitting, we are going to hold out some data from data set, that we are going to use for test. We will use 70% of the data as the training and 30% for testing data.
set.seed(70)
samplesize <- round(0.7 * nrow(med_ins), 0)
index <- sample(seq_len(nrow(med_ins)), size = samplesize)
data_train <- med_ins[index, ]
data_test <- med_ins[-index, ]We are going to make model with all varible. Then see how the model works.
set.seed(70)
med_lm <- lm(PremiumPrice ~ ., data = data_train)
summary(med_lm)#>
#> Call:
#> lm(formula = PremiumPrice ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12905.7 -2086.7 -386.2 1828.3 24240.7
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6510.42 2516.51 2.587 0.00989 **
#> Age 333.22 11.99 27.795 < 0.0000000000000002 ***
#> Diabetes1 -593.53 302.37 -1.963 0.05006 .
#> BloodPressureProblems1 114.19 304.91 0.375 0.70814
#> AnyTransplants1 7314.00 641.76 11.397 < 0.0000000000000002 ***
#> AnyChronicDiseases1 2595.70 383.83 6.763 0.0000000000293 ***
#> Height -10.48 14.42 -0.727 0.46763
#> Weight 62.84 10.33 6.084 0.0000000019585 ***
#> KnownAllergies1 570.91 363.99 1.568 0.11724
#> HistoryOfCancerInFamily1 1989.55 470.52 4.228 0.0000267579690 ***
#> NumberOfMajorSurgeries -500.62 232.67 -2.152 0.03178 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3784 on 679 degrees of freedom
#> Multiple R-squared: 0.6344, Adjusted R-squared: 0.629
#> F-statistic: 117.8 on 10 and 679 DF, p-value: < 0.00000000000000022
With all predictor, we get Adjusted R-squared: 0.629. It’s already good. But it can be improved. Let’s try with stepwise regression. First, we try forward selection.
# stepwise regression: forward selection
model_med_none <- lm(PremiumPrice ~ 1, data = med_ins)
model_forward <- stats::step(object = model_med_none,
direction = "forward",
scope = list(upper= med_lm),
trace=F)
summary(model_forward)#>
#> Call:
#> lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases +
#> Weight + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
#> Diabetes, data = med_ins)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13573.5 -2200.4 -284.9 1861.9 24070.4
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4634.138 762.032 6.081 0.000000001707741629 ***
#> Age 329.744 9.722 33.917 < 0.0000000000000002 ***
#> AnyTransplants1 7895.560 521.212 15.148 < 0.0000000000000002 ***
#> AnyChronicDiseases1 2646.353 312.959 8.456 < 0.0000000000000002 ***
#> Weight 69.343 8.386 8.269 0.000000000000000437 ***
#> HistoryOfCancerInFamily1 2348.779 383.734 6.121 0.000000001345479527 ***
#> NumberOfMajorSurgeries -615.646 182.566 -3.372 0.000775 ***
#> Diabetes1 -434.314 249.733 -1.739 0.082330 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3751 on 978 degrees of freedom
#> Multiple R-squared: 0.6422, Adjusted R-squared: 0.6397
#> F-statistic: 250.8 on 7 and 978 DF, p-value: < 0.00000000000000022
Second, we try backward selection.
# stepwise regression: backward elimination
model_backward <- stats::step(object = med_lm,
direction = "backward",
trace = F)
summary(model_backward)#>
#> Call:
#> lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants +
#> AnyChronicDiseases + Weight + KnownAllergies + HistoryOfCancerInFamily +
#> NumberOfMajorSurgeries, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12831.8 -2084.0 -375.2 1833.3 24142.1
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4850.16 935.86 5.183 0.0000002887492 ***
#> Age 332.77 11.83 28.130 < 0.0000000000000002 ***
#> Diabetes1 -575.36 300.89 -1.912 0.0563 .
#> AnyTransplants1 7316.01 640.69 11.419 < 0.0000000000000002 ***
#> AnyChronicDiseases1 2596.02 382.25 6.791 0.0000000000242 ***
#> Weight 62.21 10.29 6.046 0.0000000024462 ***
#> KnownAllergies1 572.69 362.51 1.580 0.1146
#> HistoryOfCancerInFamily1 1980.66 469.45 4.219 0.0000278521360 ***
#> NumberOfMajorSurgeries -483.02 228.87 -2.110 0.0352 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3780 on 681 degrees of freedom
#> Multiple R-squared: 0.634, Adjusted R-squared: 0.6297
#> F-statistic: 147.5 on 8 and 681 DF, p-value: < 0.00000000000000022
Finally, we try both selection.
model_both <- stats::step(object = model_med_none,
direction = "both",
scope = list(upper= med_lm),trace=F)
summary(model_both)#>
#> Call:
#> lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases +
#> Weight + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
#> Diabetes, data = med_ins)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13573.5 -2200.4 -284.9 1861.9 24070.4
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4634.138 762.032 6.081 0.000000001707741629 ***
#> Age 329.744 9.722 33.917 < 0.0000000000000002 ***
#> AnyTransplants1 7895.560 521.212 15.148 < 0.0000000000000002 ***
#> AnyChronicDiseases1 2646.353 312.959 8.456 < 0.0000000000000002 ***
#> Weight 69.343 8.386 8.269 0.000000000000000437 ***
#> HistoryOfCancerInFamily1 2348.779 383.734 6.121 0.000000001345479527 ***
#> NumberOfMajorSurgeries -615.646 182.566 -3.372 0.000775 ***
#> Diabetes1 -434.314 249.733 -1.739 0.082330 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3751 on 978 degrees of freedom
#> Multiple R-squared: 0.6422, Adjusted R-squared: 0.6397
#> F-statistic: 250.8 on 7 and 978 DF, p-value: < 0.00000000000000022
We can see that model forward and model both has the same adjusted R-Squared with highest value among others. So we choose one of them. I choose model forward.
We are going to evaluate our model performance with RMSE (Root Mean Squared Error). The less RMSE is the better.
lm_pred <- predict(model_forward, newdata = data_test %>% select(-PremiumPrice))
# RMSE of train dataset
RMSE(pred = model_forward$fitted.values, obs = data_train$PremiumPrice)#> [1] 8061.498
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$PremiumPrice)#> [1] 3646.988
The difference between train and test are too much. Let’s try Model Backward.
lm_pred2 <- predict(model_backward, newdata = data_test %>% select(-PremiumPrice))
# RMSE of train dataset
RMSE(pred = model_backward$fitted.values, obs = data_train$PremiumPrice)#> [1] 3755.45
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test$PremiumPrice)#> [1] 3719.442
The result is better. We use model backward.
resact <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())It’s not good because the error has pattern.
shapiro.test(model_backward$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.91629, p-value < 0.00000000000000022
with p-value < 0,05, our residual is not normally distributed
set.seed(1)
durbinWatsonTest(model_backward)#> lag Autocorrelation D-W Statistic p-value
#> 1 0.02338776 1.952094 0.55
#> Alternative hypothesis: rho != 0
With p-value > 0.05, we can conclude that autocorrelation is not present.
bptest(model_backward)#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 50.46, df = 8, p-value = 0.00000003334
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())With p-value < 0.05, we can conclude that heterocesdasticity is present.
vif(model_backward)#> Age Diabetes AnyTransplants
#> 1.290261 1.075947 1.005869
#> AnyChronicDiseases Weight KnownAllergies
#> 1.020258 1.004052 1.042539
#> HistoryOfCancerInFamily NumberOfMajorSurgeries
#> 1.090759 1.315478
There is no variable that over 10. So, there is no Multicollinearity.
We can see that this model is not good enough to predict Premium Price. My opinion is we can invest more to get more data or obsevartion. So we can build a better model. This model has error pattern, we can get the conclusion that our model didn’t work good enough for prediciton beacuse some data is not represented in our model.