Description of dataset

The dataset for our regression Machine Learning Project consist of various attributes of individuals including the total amount of medical expense incurred for one year.

For this project we are going to estimate the total medical charges of an individual by using different models.

In order to do that, we use a set of individuals characteristics contain in Insurance dataset. It takes from Kaggle and it contains 1338 observations (rows) and 7 features (columns):

  • Age = age of individuals

  • Bmi = Body Mass Index, Weight(Kg)/(Height(m)²)

  • Children = number of children

  • Expenses = our target = Total medical expense charged to the plan for the calendar year

  • Sex = gender of individuals

  • Smoker = whether the insurance contractor is a smoker or not

  • Region = residential area in US: Northeast, Southeast, Northwest, Southwest

## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27...
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "femal...
## $ bmi      <dbl> 27.9, 33.8, 33.0, 22.7, 28.9, 25.7, 33.4, 27.7, 29.8, 25.8...
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "north...
## $ expenses <dbl> 16884.92, 1725.55, 4449.46, 21984.47, 3866.86, 3756.62, 82...

Analyses of the data

Insurance companies are extremely interested in predicting the future.

Accurate estimation offers the organization a chance to reduce its financial losses. Payment mistakes made by insurance providers when making claims are a significant cause of increased costs.

Thus, the purposes of this project is to create several model using different Machine Learning technique in order to predict future medical expenses of individuals that help medical insurance to make decision on charging the premium.

We examine how machine learning algorithms developed improve significantly the prediction accuracy in term of Mean Squared Error and how the importance of the characteristics of the subjects change within the prediction.

In the end of analysis, we will try to predict the medical expenditure for 3 future consumers.

Clean data and Factorization of variables

After making sure there are no missing values, we decided to explore our target variables in order to discover some outlier! Using the interquartile rule, we exclude the value that are considered as outlier.

## [1] 0

##    0    1 
## 1199  139

As we can see from the structure showed above, there are 3 nominal features that we could be convert into factors with numerical value designated for each level.

Insurance$sex <- as.factor(Insurance$sex)
Insurance$smoker <- as.factor(Insurance$smoker)
Insurance$region <- as.factor(Insurance$region)
Insurance <- Insurance %>% as_tibble()
Insurance_cor<-Insurance

First view of variables

Target variables

Definitely, the main continuous Variable is Expenses which represent Total medical expense.

This variable is showed below with histogram and density plots.

Continuos variables

We create a barplot in order to show the distribution of age.

The most frequency value are present in age under 20 years.

We show an Histogram in order to illustrate the distribution of Body mass index.

Categorical variables

The barplots below show the different categories inside the factorial variables.

Correlation

The following plot evidences that the variable smoker is the most correlated with expenses.

The medical expense is also higly correlated with age.

Training/test Insurance division

In order to split the data into train and test we used the function createDataPartition.

We are going to assign:

  • 70% on our Train that will be 804 observations
  • 30% on our Test that will be 340 observations
set.seed(123)
training_obs <- createDataPartition(Insurance$expenses, 
                                    p = 0.7, 
                                    list = FALSE) 

Pre-processing tranformation variables

Before starting to create a model we are going to rescale our dataset.

In fact we have to split the data into Continous variables and Factorial variables.

Continous variables

Regarding continuous variables dataset that we call it Insurance2, we fit again train and test.

## tibble [840 x 4] (S3: tbl_df/tbl/data.frame)
##  $ age     : int [1:840] 18 33 37 37 25 62 23 19 52 23 ...
##  $ bmi     : num [1:840] 33.8 22.7 27.7 29.8 26.2 26.3 34.4 24.6 30.8 23.8 ...
##  $ children: int [1:840] 1 0 3 2 0 0 0 1 1 0 ...
##  $ expenses: num [1:840] 1726 21984 7282 6406 2721 ...

Factorial variables

On other hand we needed to transform categorical dataset to some type of numerical format.

## 'data.frame':    840 obs. of  5 variables:
##  $ sexmale        : num  1 1 0 1 1 0 1 1 0 1 ...
##  $ smokeryes      : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ regionnorthwest: num  0 1 1 0 0 0 0 0 0 0 ...
##  $ regionsoutheast: num  1 0 0 0 0 1 0 0 0 0 ...
##  $ regionsouthwest: num  0 0 0 0 0 0 1 1 0 0 ...

Finally we merge the continuous and discrete variable in a unique Tibble.

Insurance.train <- as_tibble(cbind(Insurance.train2, Insurance.train3))
Insurance.test <- as_tibble(cbind(Insurance.test2, Insurance.test3))
str(Insurance.train)
## tibble [840 x 9] (S3: tbl_df/tbl/data.frame)
##  $ age            : int [1:840] 18 33 37 37 25 62 23 19 52 23 ...
##  $ bmi            : num [1:840] 33.8 22.7 27.7 29.8 26.2 26.3 34.4 24.6 30.8 23.8 ...
##  $ children       : int [1:840] 1 0 3 2 0 0 0 1 1 0 ...
##  $ expenses       : num [1:840] 1726 21984 7282 6406 2721 ...
##  $ sexmale        : num [1:840] 1 1 0 1 1 0 1 1 0 1 ...
##  $ smokeryes      : num [1:840] 0 0 0 0 0 1 0 0 0 0 ...
##  $ regionnorthwest: num [1:840] 0 1 1 0 0 0 0 0 0 0 ...
##  $ regionsoutheast: num [1:840] 1 0 0 0 0 1 0 0 0 0 ...
##  $ regionsouthwest: num [1:840] 0 0 0 0 0 0 1 1 0 0 ...

Definition of Formula and train/test targets

We decide to keep all variables in our formula in order to create model for our prediction.

Moreover, we define the targets on train and test data created before.

Boosting technique

In data science activities, boosting algorithms are one of the most commonly used algorithms.

Boosting gives strength to models of machine learning to increase their prediction precision.

To form a strong law, it combines weak learners as defined as base learners and finds a weak rule implementing algorithms with a different distribution. Each time the simple learning algorithm is used, a new weak prediction is produced.

This is an iterative process.

After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.

Catboost

CatBoost name comes from two words Category and Boosting.

The library works well with multiple Categories of data, such as audio, text, image including historical data.

Boost comes from gradient boosting machine learning algorithm.

Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges.

The main reasons we use CatBoost are that it is simple to use and efficient.


First of all, we build the object with the training dataset.

Then, we defined the parameters in order to create our model.

train_pool <- catboost.load_pool(data = train_data, label = train_target)

params <- list(iterations = 1000,
               learning_rate = 0.075,
               depth = 3,
               loss_function = 'RMSE',
               eval_metric = 'RMSE',
               random_seed = 560,
               od_type = 'Iter',
               metric_period = 50,
               od_wait = 10,
               use_best_model = F)

Now we are able to use catboost.train() function.

model <- catboost.train(train_pool, params = params)
## 0:   learn: 6967.8796647 total: 158ms    remaining: 2m 38s
## 50:  learn: 4192.6268491 total: 183ms    remaining: 3.4s
## 100: learn: 4003.1411708 total: 204ms    remaining: 1.82s
## 150: learn: 3850.1678379 total: 225ms    remaining: 1.26s
## 200: learn: 3734.7629777 total: 251ms    remaining: 996ms
## 250: learn: 3638.0830109 total: 272ms    remaining: 810ms
## 300: learn: 3551.6747193 total: 293ms    remaining: 681ms
## 350: learn: 3480.0781804 total: 316ms    remaining: 584ms
## 400: learn: 3408.5103516 total: 338ms    remaining: 505ms
## 450: learn: 3346.2493683 total: 359ms    remaining: 437ms
## 500: learn: 3295.3611096 total: 384ms    remaining: 383ms
## 550: learn: 3250.8091025 total: 405ms    remaining: 330ms
## 600: learn: 3212.0174926 total: 426ms    remaining: 283ms
## 650: learn: 3163.5060664 total: 450ms    remaining: 241ms
## 700: learn: 3107.7856513 total: 472ms    remaining: 201ms
## 750: learn: 3068.2684286 total: 493ms    remaining: 164ms
## 800: learn: 3032.9748026 total: 515ms    remaining: 128ms
## 850: learn: 2987.7922286 total: 538ms    remaining: 94.3ms
## 900: learn: 2947.9912981 total: 559ms    remaining: 61.4ms
## 950: learn: 2913.9574050 total: 581ms    remaining: 30ms
## 999: learn: 2879.4208732 total: 604ms    remaining: 0us

Finally we are ready to compare prediction with the actual values on the testing dataset.

The prediction seem linear, despite there are any over/under fitted values in the model.

Here are present the Mean Squared Error and R-squared:

## [1] "MSE is: 23278041.53"
## [1] "R-Squared is: 0.552"

Let’s now analyse the plot with variables importance.

It easy to see that the most important variable is whether the insurance contractor is a smoker yes.

Gradient Boosting Machine

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

We are looking for the optimal parameters for our model, so we create a grid that will be applied in our GBM model. As we can see from the output:

  • n.trees = 500,
  • interaction.depth = 3,
  • shrinkage = 0.01
  • n.minobsinnode = 5
## Stochastic Gradient Boosting 
## 
## 840 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 840, 840, 840, 840, 840, 840, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE      Rsquared 
##   0.01       3                   5               100     4959.746  0.6036794
##   0.01       3                   5               500     4303.580  0.6383572
##   0.01       3                   5              1000     4352.506  0.6292665
##   0.01       3                   7               100     4965.604  0.6003298
##   0.01       3                   7               500     4328.052  0.6339811
##   0.01       3                   7              1000     4371.553  0.6262249
##   0.01       3                  10               100     4969.368  0.5977144
##   0.01       3                  10               500     4364.088  0.6277912
##   0.01       3                  10              1000     4409.682  0.6202580
##   0.01       3                  15               100     4970.217  0.5959084
##   0.01       3                  15               500     4398.804  0.6219976
##   0.01       3                  15              1000     4428.930  0.6175659
##   0.01       7                   5               100     4801.145  0.6330853
##   0.01       7                   5               500     4327.434  0.6329325
##   0.01       7                   5              1000     4461.238  0.6134293
##   0.01       7                   7               100     4811.889  0.6289383
##   0.01       7                   7               500     4355.383  0.6284763
##   0.01       7                   7              1000     4481.095  0.6100913
##   0.01       7                  10               100     4832.496  0.6221768
##   0.01       7                  10               500     4395.304  0.6221102
##   0.01       7                  10              1000     4507.997  0.6059607
##   0.01       7                  15               100     4853.676  0.6138957
##   0.01       7                  15               500     4427.314  0.6171570
##   0.01       7                  15              1000     4517.725  0.6044803
##   0.10       3                   5               100     4394.436  0.6226106
##   0.10       3                   5               500     4725.962  0.5759602
##   0.10       3                   5              1000     4896.720  0.5532574
##   0.10       3                   7               100     4413.035  0.6195957
##   0.10       3                   7               500     4741.269  0.5725511
##   0.10       3                   7              1000     4912.335  0.5495099
##   0.10       3                  10               100     4429.947  0.6175066
##   0.10       3                  10               500     4717.013  0.5773895
##   0.10       3                  10              1000     4879.575  0.5557203
##   0.10       3                  15               100     4468.327  0.6113256
##   0.10       3                  15               500     4695.503  0.5793620
##   0.10       3                  15              1000     4855.344  0.5574418
##   0.10       7                   5               100     4536.568  0.6019040
##   0.10       7                   5               500     4918.567  0.5505316
##   0.10       7                   5              1000     5099.137  0.5274952
##   0.10       7                   7               100     4555.910  0.5993252
##   0.10       7                   7               500     4926.964  0.5489927
##   0.10       7                   7              1000     5118.336  0.5239142
##   0.10       7                  10               100     4547.307  0.6003118
##   0.10       7                  10               500     4924.214  0.5489045
##   0.10       7                  10              1000     5121.154  0.5241119
##   0.10       7                  15               100     4544.352  0.6008735
##   0.10       7                  15               500     4892.240  0.5534057
##   0.10       7                  15              1000     5094.395  0.5279048
##   MAE     
##   3424.435
##   2498.249
##   2476.415
##   3419.830
##   2520.905
##   2497.755
##   3419.446
##   2540.164
##   2528.833
##   3412.284
##   2578.518
##   2568.821
##   3312.765
##   2442.944
##   2524.449
##   3311.658
##   2468.153
##   2556.214
##   3313.960
##   2503.867
##   2601.695
##   3311.391
##   2556.999
##   2653.283
##   2518.066
##   2823.696
##   3015.259
##   2538.535
##   2861.464
##   3058.019
##   2550.029
##   2876.314
##   3061.384
##   2613.351
##   2875.338
##   3057.908
##   2597.807
##   3019.066
##   3191.808
##   2642.162
##   3064.963
##   3250.070
##   2662.824
##   3086.041
##   3284.989
##   2710.620
##   3102.520
##   3305.208
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 5.

Now we should apply the best parameters and *fit again the GBM model**.

set.seed(123)
Insurance.gbm <- gbm(model1.formula,
                   data = Insurance.train,
                   distribution = "gaussian",
                   n.trees = 500,
                   interaction.depth = 3,
                   shrinkage = 0.01,
                   n.minobsinnode = 5,
                   verbose = F)

Now we can see how fit the prediction.

Here are present the Gradient Boosting Machine Mean Squared Error and R-squared:

## [1] "MSE is: 20285845.51"
## [1] "R-Squared is: 0.6096"

Let’s now analyse the plot with variables importance.

It easy to see that the most important variable is again whether the insurance contractor is a smoker yes.

In comparison, the variable Age seems more importance than catboost model.

Then we compare the Mean Squared Error between Catboost and GBM.

Mean Squared Error
Catboost 23278042
GBM 20285846

In term of MSE, the model with GBM technique results better than one with Catboost!

Linear model

Before proceeding with neural networks, we decide to fit a classical linear model.

Our dependent variable remain the same: expenses that stay for medical charges.

We calculate predictions with the linear model, plot it.

Here are present the linear Mean Squared Error and R-squared:

## [1] "MSE is: 23155410.89"
## [1] "R-Squared is: 0.5544"

Let’s now analyse the plot with variables importance.

The importance seems more spread into all variables:

smoker yes remain the most importance altough the value decreases.

Neural Network

Neural networks are a set of algorithms, loosely modeled after the human brain, designed to recognize patterns.

The patterns they recognize are numerical, contained in vectors, to which all data in the real world must be translated.

The stacked neural networks is a networks composed of several layers.

The layers are made of nodes.

A node combines data input with a set of coefficients or weights that either amplifies or dampens that input, thereby assigning significance to inputs for the task that the algorithm is trying to learn.

The output of each layer is the input of the next layer at the same time, starting from the initial input layer receiving your data.

Insurance scaled data

In order to scale all variable inside, we use the maximun and minimum values of each single variable.

## 'data.frame':    840 obs. of  9 variables:
##  $ age            : num  0 0.326 0.413 0.413 0.152 ...
##  $ bmi            : num  0.48 0.181 0.315 0.372 0.275 ...
##  $ children       : num  0.2 0 0.6 0.4 0 0 0 0.2 0.2 0 ...
##  $ expenses       : num  0.0181 0.6255 0.1847 0.1585 0.048 ...
##  $ sexmale        : num  1 1 0 1 1 0 1 1 0 1 ...
##  $ smokeryes      : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ regionnorthwest: num  0 1 1 0 0 0 0 0 0 0 ...
##  $ regionsoutheast: num  1 0 0 0 0 1 0 0 0 0 ...
##  $ regionsouthwest: num  0 0 0 0 0 0 1 1 0 0 ...

Using the function nnet and using the same seed choosed before, we are able to create the model with 1 hidden neural.

The parts of the plot below are classify as follow:

  • l = layer, h = hidden, b = bias, o = output
  • every layer lines present connection between layer and weights
  • every bias lines present the bias of the model in each step
  • Green stay for positive value, Red for negative value
  • the thickness increases as the absolute value increases
## # weights:  11
## initial  value 49.490228 
## iter  10 value 16.321008
## iter  20 value 14.841199
## iter  30 value 14.832719
## iter  40 value 14.827985
## iter  50 value 14.821719
## final  value 14.821718 
## converged

We unscaled the prediction in order to discover the MSE and R-squared and plot it.

Here are present the neural network Mean Squared Error and R-squared:

## [1] "MSE is: 22706292.9348"
## [1] "R-Squared is: 0.563"

Then we analyse the plot with variables importance.

The difference, compared to the previous models, are the further growth of importance of smokeryes and age.

Now let’s try to increase number of hidden layer to 2.

## # weights:  21
## initial  value 71.132800 
## iter  10 value 15.850134
## iter  20 value 14.450364
## iter  30 value 14.306039
## iter  40 value 14.064247
## iter  50 value 13.991688
## iter  60 value 13.837119
## iter  70 value 13.728204
## iter  80 value 13.693852
## iter  90 value 13.680662
## iter 100 value 13.674768
## final  value 13.674768 
## stopped after 100 iterations

Consequently as we have made before, we unscaled the prediction in order to discover the MSE and R-squared and plot it.

Here are present the neural network Mean Squared Error and R-squared:

## [1] "MSE is: 21987377.75"
## [1] "R-Squared is: 0.5769"

This table clarify the comparison between the linear model and Neural networks models:

Mean Squared Error
lm 23155411
nn1 22706293
nn2 21981012

The best model is the one with neural network 2 hidden layer.

Increasing the number of hidden layer produces a degrowth of the MSE.

Then we analyse the plot with variables importance for the neural network model with 2 hidden layers.

It easy to see that the most important variable became Body Mass Index followed by smokeryes.

Cross validation

Cross-validation is a procedure of resampling used on a limited data sample to evaluate machine learning models.

The procedure has a single parameter called k that refers to the number of groups that are to be divided into a given data sample.

We decide to resampling with k = 10 in order to test our linear Model and Neural Network with 1 hidden neural

Linear model

We used the cv.glm() function with k=10 and we calculate the cross validation error.

## [1] 20264583

Neural network

On other hand the situation became more difficult for the cross validation with k=10 in neural network.

We choose 1 hidden layers for a more practically way.

In first step we create an empty index.

Then, we iterated a for loop that repeat the following step for each k:

  • scale
  • train
  • predict
  • rescale
  • calculate the error

Here there is the mean of MSE:

## [1] 21206647

Cross validation Neural Network Boxplot

Then, we showed the set of all cv.nn.errors and we represented them on boxplot.

##  [1] 47174377 23237290 22927357 26781945  6446507  9733600 33914802 18411697
##  [9] 21407620  2031271

In conclusion we are able to compare the the two CV prediction error.

CV prediction error
cv.lm 20264583
cv.nn1 21206647

Thanks to Cross Validation resampling we are able to see that the linear Model fit better than Neural Network with 1 hidden neural.

Conclusion

As conclusion we represent a table which contain all the Mean Squared Error and R-squared.

From this table we can see that the smallest MSE (and highest R-squared) is provided by the Model Gradient Boosting Machine technique.

Therefore, we recommend this model in order to use it for predict future medical expenses of individuals and allow the company to make decision on charging the premium.

Simulated Prediction

In order to try our expense prediction, we apply our model on 3 simulated new customers.

We can see that:

  • The first guy, Marcus, despite he is young his charges are high because he is a smoker.
  • The woman, Matilda, has a charges which is quite small and this is due to the fact that she is not an advanced age and she doesn’t smoke.
  • The last man, Stefan, has a charges which is little smaller than the first boy because, even though he is elderly, he is not a smoker.

Marcus

  • Age = 24
  • Body Mass Index = 22.6
  • No children
  • Male
  • Smoker
  • Region = North-West
## [1] "Health care charges for Marcus: 17997.98"

Matilda

  • Age = 42
  • Body Mass Index = 20.5
  • Number of children = 3
  • Female
  • No Smoker
  • Region = South-West
## [1] "Health care charges for Maltilda: 7605.43"

Stefan

  • Age = 76
  • Body Mass Index = 23.7
  • Number of children = 2
  • Male
  • No Smoker
  • Region = South-East
## [1] "Health care charges for Stefan: 16244.46"