# Introduction

A Medical Insurance Company Has Released Data For Almost 1000 Customers. Create A Model That Predicts The Yearly Medical Cover Cost. The Data Is Voluntarily Given By Customers. We ara going to use this dataset to build a model.

We are going to make prediction from Medical Insurance. We want to know relationship between Premium Price with other variable. From that, we are going to predict Premium Price based on historical data. 

# Data Preparation

Load packages that we need


```r
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)

options(scipen = 100, max.print = 1e+06)

Read and check the dataset

med_ins <- read.csv("Medicalpremium.csv")
med_ins %>% head()
# Check Data Structure
str(med_ins)
#> 'data.frame':    986 obs. of  11 variables:
#>  $ Age                    : int  45 60 36 52 38 30 33 23 48 38 ...
#>  $ Diabetes               : int  0 1 1 1 0 0 0 0 1 0 ...
#>  $ BloodPressureProblems  : int  0 0 1 1 0 0 0 0 0 0 ...
#>  $ AnyTransplants         : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AnyChronicDiseases     : int  0 0 0 1 1 0 0 0 0 0 ...
#>  $ Height                 : int  155 180 158 183 166 160 150 181 169 182 ...
#>  $ Weight                 : int  57 73 59 93 88 69 54 79 74 93 ...
#>  $ KnownAllergies         : int  0 0 0 0 0 1 0 1 1 0 ...
#>  $ HistoryOfCancerInFamily: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ NumberOfMajorSurgeries : int  0 0 1 2 1 1 0 0 0 0 ...
#>  $ PremiumPrice           : int  25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...

The data has 986 rows and 11 columns. Our target variable is PremiumPrice. We can see that the datatype still not right. For example Diabetes. It is a yes or no answer. So, we must change to factor. Let’s change the data type so we can use it further.

# Change data type
med_ins <- med_ins %>% mutate(
  Diabetes = as.factor(Diabetes),
  BloodPressureProblems = as.factor(BloodPressureProblems),
  AnyTransplants = as.factor(AnyTransplants),
  AnyChronicDiseases = as.factor(AnyChronicDiseases),
  KnownAllergies = as.factor(KnownAllergies),
  HistoryOfCancerInFamily = as.factor(HistoryOfCancerInFamily),
)

str(med_ins)
#> 'data.frame':    986 obs. of  11 variables:
#>  $ Age                    : int  45 60 36 52 38 30 33 23 48 38 ...
#>  $ Diabetes               : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 1 ...
#>  $ BloodPressureProblems  : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
#>  $ AnyTransplants         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ AnyChronicDiseases     : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
#>  $ Height                 : int  155 180 158 183 166 160 150 181 169 182 ...
#>  $ Weight                 : int  57 73 59 93 88 69 54 79 74 93 ...
#>  $ KnownAllergies         : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
#>  $ HistoryOfCancerInFamily: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ NumberOfMajorSurgeries : int  0 0 1 2 1 1 0 0 0 0 ...
#>  $ PremiumPrice           : int  25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...

Exploratory Data Analysis

Before we try to make model, we check all data variable. We see if any corellation between them.

ggcorr(med_ins, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

We can see that only Age has strong correlation to PremiumPrice

Modeling

Holdout Data

To avoid any chance of overfitting, we are going to hold out some data from data set, that we are going to use for test. We will use 70% of the data as the training and 30% for testing data.

set.seed(70)
samplesize <- round(0.7 * nrow(med_ins), 0)
index <- sample(seq_len(nrow(med_ins)), size = samplesize)

data_train <- med_ins[index, ]
data_test <- med_ins[-index, ]

Linear Regression

We are going to make model with all varible. Then see how the model works.

set.seed(70)
med_lm <- lm(PremiumPrice ~ ., data = data_train)

summary(med_lm)
#> 
#> Call:
#> lm(formula = PremiumPrice ~ ., data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12905.7  -2086.7   -386.2   1828.3  24240.7 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)               6510.42    2516.51   2.587              0.00989 ** 
#> Age                        333.22      11.99  27.795 < 0.0000000000000002 ***
#> Diabetes1                 -593.53     302.37  -1.963              0.05006 .  
#> BloodPressureProblems1     114.19     304.91   0.375              0.70814    
#> AnyTransplants1           7314.00     641.76  11.397 < 0.0000000000000002 ***
#> AnyChronicDiseases1       2595.70     383.83   6.763      0.0000000000293 ***
#> Height                     -10.48      14.42  -0.727              0.46763    
#> Weight                      62.84      10.33   6.084      0.0000000019585 ***
#> KnownAllergies1            570.91     363.99   1.568              0.11724    
#> HistoryOfCancerInFamily1  1989.55     470.52   4.228      0.0000267579690 ***
#> NumberOfMajorSurgeries    -500.62     232.67  -2.152              0.03178 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3784 on 679 degrees of freedom
#> Multiple R-squared:  0.6344, Adjusted R-squared:  0.629 
#> F-statistic: 117.8 on 10 and 679 DF,  p-value: < 0.00000000000000022

With all predictor, we get Adjusted R-squared: 0.629. It’s already good. But it can be improved. Let’s try with stepwise regression. First, we try forward selection.

# stepwise regression: forward selection
model_med_none <- lm(PremiumPrice ~ 1, data = med_ins)

model_forward <- stats::step(object = model_med_none,
                      direction = "forward",
                      scope = list(upper= med_lm),
                      trace=F)

summary(model_forward)
#> 
#> Call:
#> lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases + 
#>     Weight + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
#>     Diabetes, data = med_ins)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13573.5  -2200.4   -284.9   1861.9  24070.4 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)              4634.138    762.032   6.081 0.000000001707741629 ***
#> Age                       329.744      9.722  33.917 < 0.0000000000000002 ***
#> AnyTransplants1          7895.560    521.212  15.148 < 0.0000000000000002 ***
#> AnyChronicDiseases1      2646.353    312.959   8.456 < 0.0000000000000002 ***
#> Weight                     69.343      8.386   8.269 0.000000000000000437 ***
#> HistoryOfCancerInFamily1 2348.779    383.734   6.121 0.000000001345479527 ***
#> NumberOfMajorSurgeries   -615.646    182.566  -3.372             0.000775 ***
#> Diabetes1                -434.314    249.733  -1.739             0.082330 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3751 on 978 degrees of freedom
#> Multiple R-squared:  0.6422, Adjusted R-squared:  0.6397 
#> F-statistic: 250.8 on 7 and 978 DF,  p-value: < 0.00000000000000022

Second, we try backward selection.

# stepwise regression: backward elimination
model_backward <- stats::step(object = med_lm,  
                       direction = "backward", 
                       trace = F) 

summary(model_backward)
#> 
#> Call:
#> lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants + 
#>     AnyChronicDiseases + Weight + KnownAllergies + HistoryOfCancerInFamily + 
#>     NumberOfMajorSurgeries, data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12831.8  -2084.0   -375.2   1833.3  24142.1 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)               4850.16     935.86   5.183      0.0000002887492 ***
#> Age                        332.77      11.83  28.130 < 0.0000000000000002 ***
#> Diabetes1                 -575.36     300.89  -1.912               0.0563 .  
#> AnyTransplants1           7316.01     640.69  11.419 < 0.0000000000000002 ***
#> AnyChronicDiseases1       2596.02     382.25   6.791      0.0000000000242 ***
#> Weight                      62.21      10.29   6.046      0.0000000024462 ***
#> KnownAllergies1            572.69     362.51   1.580               0.1146    
#> HistoryOfCancerInFamily1  1980.66     469.45   4.219      0.0000278521360 ***
#> NumberOfMajorSurgeries    -483.02     228.87  -2.110               0.0352 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3780 on 681 degrees of freedom
#> Multiple R-squared:  0.634,  Adjusted R-squared:  0.6297 
#> F-statistic: 147.5 on 8 and 681 DF,  p-value: < 0.00000000000000022

Finally, we try both selection.

model_both <- stats::step(object = model_med_none,
                      direction = "both",
                      scope = list(upper= med_lm),trace=F)
summary(model_both)
#> 
#> Call:
#> lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases + 
#>     Weight + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
#>     Diabetes, data = med_ins)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13573.5  -2200.4   -284.9   1861.9  24070.4 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)              4634.138    762.032   6.081 0.000000001707741629 ***
#> Age                       329.744      9.722  33.917 < 0.0000000000000002 ***
#> AnyTransplants1          7895.560    521.212  15.148 < 0.0000000000000002 ***
#> AnyChronicDiseases1      2646.353    312.959   8.456 < 0.0000000000000002 ***
#> Weight                     69.343      8.386   8.269 0.000000000000000437 ***
#> HistoryOfCancerInFamily1 2348.779    383.734   6.121 0.000000001345479527 ***
#> NumberOfMajorSurgeries   -615.646    182.566  -3.372             0.000775 ***
#> Diabetes1                -434.314    249.733  -1.739             0.082330 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3751 on 978 degrees of freedom
#> Multiple R-squared:  0.6422, Adjusted R-squared:  0.6397 
#> F-statistic: 250.8 on 7 and 978 DF,  p-value: < 0.00000000000000022

We can see that model forward and model both has the same adjusted R-Squared with highest value among others. So we choose one of them. I choose model forward.

Evaluation

We are going to evaluate our model performance with RMSE (Root Mean Squared Error). The less RMSE is the better.

lm_pred <- predict(model_forward, newdata = data_test %>% select(-PremiumPrice))

# RMSE of train dataset
RMSE(pred = model_forward$fitted.values, obs =  data_train$PremiumPrice)
#> [1] 8061.498
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$PremiumPrice)
#> [1] 3646.988

The difference between train and test are too much. Let’s try Model Backward.

lm_pred2 <- predict(model_backward, newdata = data_test %>% select(-PremiumPrice))

# RMSE of train dataset
RMSE(pred = model_backward$fitted.values, obs =  data_train$PremiumPrice)
#> [1] 3755.45
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test$PremiumPrice)
#> [1] 3719.442

The result is better. We use model backward.

Assumption

Linearity

resact <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

It’s not good because the error has pattern.

Normality Test

shapiro.test(model_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.91629, p-value < 0.00000000000000022

with p-value < 0,05, our residual is not normally distributed

Autocorrelation

set.seed(1)
durbinWatsonTest(model_backward)
#>  lag Autocorrelation D-W Statistic p-value
#>    1      0.02338776      1.952094    0.55
#>  Alternative hypothesis: rho != 0

With p-value > 0.05, we can conclude that autocorrelation is not present.

Heterocedasticity

bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 50.46, df = 8, p-value = 0.00000003334
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

With p-value < 0.05, we can conclude that heterocesdasticity is present.

Multicollinearity

vif(model_backward)
#>                     Age                Diabetes          AnyTransplants 
#>                1.290261                1.075947                1.005869 
#>      AnyChronicDiseases                  Weight          KnownAllergies 
#>                1.020258                1.004052                1.042539 
#> HistoryOfCancerInFamily  NumberOfMajorSurgeries 
#>                1.090759                1.315478

There is no variable that over 10. So, there is no Multicollinearity.

Conclusion

We can see that this model is not good enough to predict Premium Price. My opinion is we can invest more to get more data or obsevartion. So we can build a better model. This model has error pattern, we can get the conclusion that our model didn’t work good enough for prediciton beacuse some data is not represented in our model.