Intro

Regression

      Regression is a technique used to predict value of one variable(Dependent Variable) on the basis of other variables(Independent Variables). It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy.

      Regression is a part of supervised learning which basically means that we train our models on the basis of given training data and our model tries to relate between the dependent and the independent variable. It does this using various functions that maps the independent variables to the dependent variables. When the model is completely trained and the error is minumised then we are able to make predictions on testing data as well.

Linear Regression

      Linear regression is a supervised learining algorithm used when target / dependent variable continues real number. It establishes relationship between dependent variable y and one or more independent variable x using best fit line. It work on the principle of ordinary least square (OLS) / Mean square errror (MSE) . In statistics ols is method to estimated unkown parameter of linear regression function, it’s goal is to minimize sum of square difference between observed dependent variable in the given data set and those predicted by linear regression fuction.

Health Insurance

      Health insurance is insurance that covers the whole or a part of the risk of a person incurring medical expenses, spreading the risk over a large number of persons. By estimating the overall risk of health care and health system expenses over the risk pool, an insurer can develop a routine finance structure, such as a monthly premium or payroll tax, to provide the money to pay for the health care benefits specified in the insurance agreement. The benefit is administered by a central organization such as a government agency, private business, or not-for-profit entity.

      According to the Health Insurance Association of America, health insurance is defined as “coverage that provides for the payments of benefits as a result of sickness or injury. It includes insurance for losses from accident, medical expense, disability, or accidental death and dismemberment.”

The goal of this project is to predict accurately the insurance costs. Besides, we want to know what factor affects the price the most.

Data Preparation

Library

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(DT)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.0.3
## -- Attaching packages -------------------------------------- tidymodels 0.1.1 --
## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.9      v rsample   0.0.8 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.2.1 
## v parsnip   0.1.4      v yardstick 0.0.7
## Warning: package 'dials' was built under R version 4.0.3
## Warning: package 'infer' was built under R version 4.0.3
## Warning: package 'modeldata' was built under R version 4.0.3
## Warning: package 'rsample' was built under R version 4.0.3
## Warning: package 'tune' was built under R version 4.0.3
## Warning: package 'workflows' was built under R version 4.0.3
## Warning: package 'yardstick' was built under R version 4.0.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x plotly::filter()  masks dplyr::filter(), stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(scales)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(corrplot)
## corrplot 0.84 loaded
library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall

Dataset

This dataset has 7 variable:

  1. Age: age of primary beneficiary.

  2. Sex: insurance contractor gender, female, male.

  3. BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m^2) using the ratio of height to weight, ideally 18.5 to 24.9.

  4. Children: Number of children covered by health insurance/Number of dependents.

  5. Smoker: Is the person a smoker or not.

  6. Region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

  7. Charges: Individual medical costs billed by health insurance.

insurance <- read.csv("data_input/insurance.csv")
datatable(insurance, colnames = c('Age', 'Sex', 'BMI', 'Children', 'Smoker', 'Region', 'Charges'))

glimpse(insurance)
## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27...
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "femal...
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27...
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "north...
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.6...

The data has 1338 rows and 7 columns. We’re working with a rather small data set but there’s an interesting thing here, is at variable cost, that’s what we’re going to try to predict.

insurance %>% 
  is.na() %>% 
  colSums()
##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

There is no missing values.

Adjust Data Type

insurance <- insurance %>% 
  mutate_if(~is.character(.), 
            ~as.factor(.)) %>% 
  mutate(children = as.factor(children))

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is part of the data science process. EDA is very important before doing feature engineering and modeling because at this stage we have to understand the data first.

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or two sets of data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.

Correlation plots are a great way of exploring data and seeing if there are any interaction terms.

corrplot(cor(select(insurance,-c(sex,
                                 children,
                                 smoker,
                                 region))), method = "number")

x <- ggplot(insurance, aes(age, charges)) +
  geom_jitter(color = "blue", alpha = 0.5) +
    theme_light()

y <- ggplot(insurance, aes(bmi, charges)) +
  geom_jitter(color = "green", alpha = 0.5) +
  theme_light()

p <- plot_grid(x, y) 
title <- ggdraw() + draw_label("1. Correlation between Charges and Age / BMI", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

x <- ggplot(insurance, aes(sex, charges)) +
  geom_jitter(aes(color = sex), alpha = 0.7) +
  theme_light()

y <- ggplot(insurance, aes(children, charges)) +
  geom_jitter(aes(color = children), alpha = 0.7) +
  theme_light()

p <- plot_grid(x, y) 
title <- ggdraw() + draw_label("2. Correlation between Charges and Sex / Children covered by insurance", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

x <- ggplot(insurance, aes(smoker, charges)) +
  geom_jitter(aes(color = smoker), alpha = 0.7) +
  theme_light()

y <- ggplot(insurance, aes(region, charges)) +
  geom_jitter(aes(color = region), alpha = 0.7) +
  theme_light()

p <- plot_grid(x, y) 
title <- ggdraw() + draw_label("3. Correlation between Charges and Smoker / Region", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

Modeling

Data Splitting

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)
samplesize <- round(0.8 * nrow(insurance), 0)
index <- sample(nrow(insurance), size = samplesize)

data_train <- insurance[index, ]
data_test <- insurance[-index, ]

Linear Regression Model

set.seed(123)
insurance_lm <- lm(charges ~ ., data = insurance)

summary(insurance_lm)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11689.4  -2902.6   -943.7   1492.2  30042.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11927.17     993.66 -12.003  < 2e-16 ***
## age                257.19      11.91  21.587  < 2e-16 ***
## sexmale           -128.16     332.83  -0.385 0.700254    
## bmi                336.91      28.61  11.775  < 2e-16 ***
## children1          390.98     421.35   0.928 0.353619    
## children2         1635.78     466.67   3.505 0.000471 ***
## children3          964.34     548.10   1.759 0.078735 .  
## children4         2947.37    1239.16   2.379 0.017524 *  
## children5         1116.04    1456.02   0.767 0.443514    
## smokeryes        23836.41     414.14  57.557  < 2e-16 ***
## regionnorthwest   -380.04     476.56  -0.797 0.425318    
## regionsoutheast  -1033.14     479.14  -2.156 0.031245 *  
## regionsouthwest   -952.89     478.15  -1.993 0.046483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7497 
## F-statistic: 334.7 on 12 and 1325 DF,  p-value: < 2.2e-16

We get more coefficients than the variables we have. This is because categorical variables are converted into dummy variables. From the available data, we can change the children variable into some dummy variables using contrast.

contrasts(x = insurance$children)
##   1 2 3 4 5
## 0 0 0 0 0 0
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 1

From the example above, it can be seen that the children variable is made dummy according to the value of the variable. The way to read the dummy data is if the value of children1 is 0, children2 is 0, children3 is 0, children4 is 0 and children5 is 0, it means that the owner of the insurance has 0 children or in other words, they do not have children.

The summary of insurance_lm model shows a lot of information. Below is the meaning behind each section of summay().

  1. Call: This is an R feature that shows what function and parameters were used to create the model.

  2. Residuals: Difference between what the model predicted and the actual value of y. You can calculate the Residuals section like so:

\[ summary(y-model$fitted.values) \]

  1. Coefficients: These are the weights that minimize the sum of the square of the errors.
  • Estimates is the coefficient of each variable.
  • Std. Error is Residual Standard Error divided by the square root of the sum of the square of that particular x variable.
  • t value is Estimate divided by Std. Error
  • Pr(>|t|) is the signifance level of the variable toward the model.If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable.

An example of interpreting the coefficient, for example, with every increase in the value of 1 unit-point in bmi will contribute to an increase in medical cost by 336.91.

  1. Residual Standard Error: the square root of the residual sum of squares divided by the residual degrees of freedom.

  2. Multiple R-Squared: Also called the coefficient of determination, this is an oft-cited measurement of how well your model fits to the data.

  3. Adjusted R-Squared compensates for the addition of variables and only increases if the new predictor enhances the model above what would be obtained by probability. Conversely, it will decrease when a predictor improves the model less than what is predicted by chance.

7.The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero. The reason for this test is based on the fact that if you run multiple hypothesis tests (namely, on your coefficients), you’re likely to include a variable that isn’t actually significant.

Simpler Model

insurance2 <- insurance %>% 
  select(age,
         bmi,
         smoker,
         charges)

data_train2 <- insurance2[index, ]
data_test2 <- insurance2[-index, ]

set.seed(123)
insurance_lm2 <- lm(charges ~ ., data = data_train2)

summary(insurance_lm2)
## 
## Call:
## lm(formula = charges ~ ., data = data_train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12468.6  -3091.6   -920.8   1721.2  28807.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11794.84    1075.13  -10.97   <2e-16 ***
## age            245.79      13.70   17.94   <2e-16 ***
## bmi            341.34      31.27   10.92   <2e-16 ***
## smokeryes    24156.36     460.79   52.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6179 on 1066 degrees of freedom
## Multiple R-squared:  0.7494, Adjusted R-squared:  0.7487 
## F-statistic:  1063 on 3 and 1066 DF,  p-value: < 2.2e-16

We have removed the non-significant variables. Too see if this action affect our model, we can check the Adjusted R-Squared value from our two previous models. The first model with complete variables has adjusted R-squared of 0.7497, meaning that the model can explain 74.97% of variance in the target variable (charges). Meanwhile, our simpler model has adjusted R-squared of 74.87%, no big difference with our first model. This shows that it is safe to remove variables that has no significant coefficient values.

Model Evaluation

Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from MLmetrics package. Below is the first model (with complete variables) performance.

lm_pred <- predict(insurance_lm, newdata = data_test %>% select(-charges))

# RMSE of train dataset
RMSE(y_pred = insurance_lm$fitted.values, 
     y_true = data_train$charges)
## [1] 16320.54
# RMSE of test dataset
RMSE(y_pred = lm_pred, 
     y_true = data_test$charges)
## [1] 5662.34

Below is the second model (with removed variables) performance.

lm_pred2 <- predict(insurance_lm2, newdata = data_test2 %>% select(-charges))

# RMSE of train dataset
RMSE(y_pred = insurance_lm2$fitted.values, 
     y_true = data_train2$charges)
## [1] 6167.181
# RMSE of test dataset
RMSE(y_pred = lm_pred2, 
     y_true = data_test2$charges)
## [1] 5764.014

Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).

  1. Linearity

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.

resact <- data.frame(residual = insurance_lm2$residuals, fitted = insurance_lm2$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.

  1. Normality Test

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
shapiro.test(insurance_lm2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  insurance_lm2$residuals
## W = 0.9087, p-value < 2.2e-16

the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

  1. Autocorrelation

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

durbinWatsonTest(insurance_lm2)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.01695841      1.964946   0.578
##  Alternative hypothesis: rho != 0

The result shows that fail to reject the null hypothesis, meaning that our residual has no autocorrelation in it.

  1. Heterocedasticity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.

bptest(insurance_lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  insurance_lm2
## BP = 90.332, df = 3, p-value < 2.2e-16
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

  1. Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(insurance_lm2)
##      age      bmi   smoker 
## 1.012672 1.011733 1.001029

Conclusion

Variables that are useful to describe the variances in medical cost are age, bmi, smoker and charges. Our final model has satisfied some of the classical assumptions. The R-squared of the model is not to high, with 74.87% of the variables can explain the variances in the medical cost. The accuracy of the model in predicting the charges is measured with RMSE, with training data has RMSE of 6167.181 and testing data has RMSE of 5764.014.

We have already learn how to build a linear regression model and what need to be concerned when building the model. So, for next step we can still improve the model to reduce the error and to satisfied all of the classical assumptions.