The dataset for our regression Machine Learning Project consist of various attributes of individuals including the total amount of medical expense incurred for one year.
For this project we are going to estimate the total medical charges of an individual by using different models.
In order to do that, we use a set of individuals characteristics contain in Insurance dataset. It takes from Kaggle and it contains 1338 observations (rows) and 7 features (columns):
Age = age of individuals
Bmi = Body Mass Index, Weight(Kg)/(Height(m)²)
Children = number of children
Expenses = our target = Total medical expense charged to the plan for the calendar year
Sex = gender of individuals
Smoker = whether the insurance contractor is a smoker or not
Region = residential area in US: Northeast, Southeast, Northwest, Southwest
## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27...
## $ sex <chr> "female", "male", "male", "male", "male", "female", "femal...
## $ bmi <dbl> 27.9, 33.8, 33.0, 22.7, 28.9, 25.7, 33.4, 27.7, 29.8, 25.8...
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "north...
## $ expenses <dbl> 16884.92, 1725.55, 4449.46, 21984.47, 3866.86, 3756.62, 82...
Insurance companies are extremely interested in predicting the future.
Accurate estimation offers the organization a chance to reduce its financial losses. Payment mistakes made by insurance providers when making claims are a significant cause of increased costs.
Thus, the purposes of this project is to create several model using different Machine Learning technique in order to predict future medical expenses of individuals that help medical insurance to make decision on charging the premium.
We examine how machine learning algorithms developed improve significantly the prediction accuracy in term of Mean Squared Error and how the importance of the characteristics of the subjects change within the prediction.
In the end of analysis, we will try to predict the medical expenditure for 3 future consumers.
After making sure there are no missing values, we decided to explore our target variables in order to discover some outlier! Using the interquartile rule, we exclude the value that are considered as outlier.
## [1] 0
## 0 1
## 1199 139
As we can see from the structure showed above, there are 3 nominal features that we could be convert into factors with numerical value designated for each level.
Definitely, the main continuous Variable is Expenses which represent Total medical expense.
This variable is showed below with histogram and density plots.
We create a barplot in order to show the distribution of age.
The most frequency value are present in age under 20 years.
We show an Histogram in order to illustrate the distribution of Body mass index.
The barplots below show the different categories inside the factorial variables.
The following plot evidences that the variable smoker is the most correlated with expenses.
The medical expense is also higly correlated with age.
In order to split the data into train and test we used the function createDataPartition.
We are going to assign:
Before starting to create a model we are going to rescale our dataset.
In fact we have to split the data into Continous variables and Factorial variables.
Regarding continuous variables dataset that we call it Insurance2, we fit again train and test.
## tibble [840 x 4] (S3: tbl_df/tbl/data.frame)
## $ age : int [1:840] 18 33 37 37 25 62 23 19 52 23 ...
## $ bmi : num [1:840] 33.8 22.7 27.7 29.8 26.2 26.3 34.4 24.6 30.8 23.8 ...
## $ children: int [1:840] 1 0 3 2 0 0 0 1 1 0 ...
## $ expenses: num [1:840] 1726 21984 7282 6406 2721 ...
On other hand we needed to transform categorical dataset to some type of numerical format.
## 'data.frame': 840 obs. of 5 variables:
## $ sexmale : num 1 1 0 1 1 0 1 1 0 1 ...
## $ smokeryes : num 0 0 0 0 0 1 0 0 0 0 ...
## $ regionnorthwest: num 0 1 1 0 0 0 0 0 0 0 ...
## $ regionsoutheast: num 1 0 0 0 0 1 0 0 0 0 ...
## $ regionsouthwest: num 0 0 0 0 0 0 1 1 0 0 ...
Finally we merge the continuous and discrete variable in a unique Tibble.
Insurance.train <- as_tibble(cbind(Insurance.train2, Insurance.train3))
Insurance.test <- as_tibble(cbind(Insurance.test2, Insurance.test3))
str(Insurance.train)## tibble [840 x 9] (S3: tbl_df/tbl/data.frame)
## $ age : int [1:840] 18 33 37 37 25 62 23 19 52 23 ...
## $ bmi : num [1:840] 33.8 22.7 27.7 29.8 26.2 26.3 34.4 24.6 30.8 23.8 ...
## $ children : int [1:840] 1 0 3 2 0 0 0 1 1 0 ...
## $ expenses : num [1:840] 1726 21984 7282 6406 2721 ...
## $ sexmale : num [1:840] 1 1 0 1 1 0 1 1 0 1 ...
## $ smokeryes : num [1:840] 0 0 0 0 0 1 0 0 0 0 ...
## $ regionnorthwest: num [1:840] 0 1 1 0 0 0 0 0 0 0 ...
## $ regionsoutheast: num [1:840] 1 0 0 0 0 1 0 0 0 0 ...
## $ regionsouthwest: num [1:840] 0 0 0 0 0 0 1 1 0 0 ...
We decide to keep all variables in our formula in order to create model for our prediction.
Moreover, we define the targets on train and test data created before.
In data science activities, boosting algorithms are one of the most commonly used algorithms.
Boosting gives strength to models of machine learning to increase their prediction precision.
To form a strong law, it combines weak learners as defined as base learners and finds a weak rule implementing algorithms with a different distribution. Each time the simple learning algorithm is used, a new weak prediction is produced.
This is an iterative process.
After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.
CatBoost name comes from two words Category and Boosting.
The library works well with multiple Categories of data, such as audio, text, image including historical data.
Boost comes from gradient boosting machine learning algorithm.
Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges.
The main reasons we use CatBoost are that it is simple to use and efficient.
First of all, we build the object with the training dataset.
Then, we defined the parameters in order to create our model.
train_pool <- catboost.load_pool(data = train_data, label = train_target)
params <- list(iterations = 1000,
learning_rate = 0.075,
depth = 3,
loss_function = 'RMSE',
eval_metric = 'RMSE',
random_seed = 560,
od_type = 'Iter',
metric_period = 50,
od_wait = 10,
use_best_model = F)Now we are able to use catboost.train() function.
## 0: learn: 6967.8796647 total: 158ms remaining: 2m 38s
## 50: learn: 4192.6268491 total: 183ms remaining: 3.4s
## 100: learn: 4003.1411708 total: 204ms remaining: 1.82s
## 150: learn: 3850.1678379 total: 225ms remaining: 1.26s
## 200: learn: 3734.7629777 total: 251ms remaining: 996ms
## 250: learn: 3638.0830109 total: 272ms remaining: 810ms
## 300: learn: 3551.6747193 total: 293ms remaining: 681ms
## 350: learn: 3480.0781804 total: 316ms remaining: 584ms
## 400: learn: 3408.5103516 total: 338ms remaining: 505ms
## 450: learn: 3346.2493683 total: 359ms remaining: 437ms
## 500: learn: 3295.3611096 total: 384ms remaining: 383ms
## 550: learn: 3250.8091025 total: 405ms remaining: 330ms
## 600: learn: 3212.0174926 total: 426ms remaining: 283ms
## 650: learn: 3163.5060664 total: 450ms remaining: 241ms
## 700: learn: 3107.7856513 total: 472ms remaining: 201ms
## 750: learn: 3068.2684286 total: 493ms remaining: 164ms
## 800: learn: 3032.9748026 total: 515ms remaining: 128ms
## 850: learn: 2987.7922286 total: 538ms remaining: 94.3ms
## 900: learn: 2947.9912981 total: 559ms remaining: 61.4ms
## 950: learn: 2913.9574050 total: 581ms remaining: 30ms
## 999: learn: 2879.4208732 total: 604ms remaining: 0us
Finally we are ready to compare prediction with the actual values on the testing dataset.
The prediction seem linear, despite there are any over/under fitted values in the model.
Here are present the Mean Squared Error and R-squared:
## [1] "MSE is: 23278041.53"
## [1] "R-Squared is: 0.552"
Let’s now analyse the plot with variables importance.
It easy to see that the most important variable is whether the insurance contractor is a smoker yes.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
We are looking for the optimal parameters for our model, so we create a grid that will be applied in our GBM model. As we can see from the output:
n.trees = 500,interaction.depth = 3,shrinkage = 0.01n.minobsinnode = 5## Stochastic Gradient Boosting
##
## 840 samples
## 8 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 840, 840, 840, 840, 840, 840, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 3 5 100 4959.746 0.6036794
## 0.01 3 5 500 4303.580 0.6383572
## 0.01 3 5 1000 4352.506 0.6292665
## 0.01 3 7 100 4965.604 0.6003298
## 0.01 3 7 500 4328.052 0.6339811
## 0.01 3 7 1000 4371.553 0.6262249
## 0.01 3 10 100 4969.368 0.5977144
## 0.01 3 10 500 4364.088 0.6277912
## 0.01 3 10 1000 4409.682 0.6202580
## 0.01 3 15 100 4970.217 0.5959084
## 0.01 3 15 500 4398.804 0.6219976
## 0.01 3 15 1000 4428.930 0.6175659
## 0.01 7 5 100 4801.145 0.6330853
## 0.01 7 5 500 4327.434 0.6329325
## 0.01 7 5 1000 4461.238 0.6134293
## 0.01 7 7 100 4811.889 0.6289383
## 0.01 7 7 500 4355.383 0.6284763
## 0.01 7 7 1000 4481.095 0.6100913
## 0.01 7 10 100 4832.496 0.6221768
## 0.01 7 10 500 4395.304 0.6221102
## 0.01 7 10 1000 4507.997 0.6059607
## 0.01 7 15 100 4853.676 0.6138957
## 0.01 7 15 500 4427.314 0.6171570
## 0.01 7 15 1000 4517.725 0.6044803
## 0.10 3 5 100 4394.436 0.6226106
## 0.10 3 5 500 4725.962 0.5759602
## 0.10 3 5 1000 4896.720 0.5532574
## 0.10 3 7 100 4413.035 0.6195957
## 0.10 3 7 500 4741.269 0.5725511
## 0.10 3 7 1000 4912.335 0.5495099
## 0.10 3 10 100 4429.947 0.6175066
## 0.10 3 10 500 4717.013 0.5773895
## 0.10 3 10 1000 4879.575 0.5557203
## 0.10 3 15 100 4468.327 0.6113256
## 0.10 3 15 500 4695.503 0.5793620
## 0.10 3 15 1000 4855.344 0.5574418
## 0.10 7 5 100 4536.568 0.6019040
## 0.10 7 5 500 4918.567 0.5505316
## 0.10 7 5 1000 5099.137 0.5274952
## 0.10 7 7 100 4555.910 0.5993252
## 0.10 7 7 500 4926.964 0.5489927
## 0.10 7 7 1000 5118.336 0.5239142
## 0.10 7 10 100 4547.307 0.6003118
## 0.10 7 10 500 4924.214 0.5489045
## 0.10 7 10 1000 5121.154 0.5241119
## 0.10 7 15 100 4544.352 0.6008735
## 0.10 7 15 500 4892.240 0.5534057
## 0.10 7 15 1000 5094.395 0.5279048
## MAE
## 3424.435
## 2498.249
## 2476.415
## 3419.830
## 2520.905
## 2497.755
## 3419.446
## 2540.164
## 2528.833
## 3412.284
## 2578.518
## 2568.821
## 3312.765
## 2442.944
## 2524.449
## 3311.658
## 2468.153
## 2556.214
## 3313.960
## 2503.867
## 2601.695
## 3311.391
## 2556.999
## 2653.283
## 2518.066
## 2823.696
## 3015.259
## 2538.535
## 2861.464
## 3058.019
## 2550.029
## 2876.314
## 3061.384
## 2613.351
## 2875.338
## 3057.908
## 2597.807
## 3019.066
## 3191.808
## 2642.162
## 3064.963
## 3250.070
## 2662.824
## 3086.041
## 3284.989
## 2710.620
## 3102.520
## 3305.208
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 5.
Now we should apply the best parameters and *fit again the GBM model**.
set.seed(123)
Insurance.gbm <- gbm(model1.formula,
data = Insurance.train,
distribution = "gaussian",
n.trees = 500,
interaction.depth = 3,
shrinkage = 0.01,
n.minobsinnode = 5,
verbose = F)Now we can see how fit the prediction.
Here are present the Gradient Boosting Machine Mean Squared Error and R-squared:
## [1] "MSE is: 20285845.51"
## [1] "R-Squared is: 0.6096"
Let’s now analyse the plot with variables importance.
It easy to see that the most important variable is again whether the insurance contractor is a smoker yes.
In comparison, the variable Age seems more importance than catboost model.
Then we compare the Mean Squared Error between Catboost and GBM.
| Mean Squared Error | |
|---|---|
| Catboost | 23278042 |
| GBM | 20285846 |
In term of MSE, the model with GBM technique results better than one with Catboost!
Before proceeding with neural networks, we decide to fit a classical linear model.
Our dependent variable remain the same: expenses that stay for medical charges.
We calculate predictions with the linear model, plot it.
Here are present the linear Mean Squared Error and R-squared:
## [1] "MSE is: 23155410.89"
## [1] "R-Squared is: 0.5544"
Let’s now analyse the plot with variables importance.
The importance seems more spread into all variables:
smoker yes remain the most importance altough the value decreases.
Neural networks are a set of algorithms, loosely modeled after the human brain, designed to recognize patterns.
The patterns they recognize are numerical, contained in vectors, to which all data in the real world must be translated.
The stacked neural networks is a networks composed of several layers.
The layers are made of nodes.
A node combines data input with a set of coefficients or weights that either amplifies or dampens that input, thereby assigning significance to inputs for the task that the algorithm is trying to learn.
The output of each layer is the input of the next layer at the same time, starting from the initial input layer receiving your data.
In order to scale all variable inside, we use the maximun and minimum values of each single variable.
## 'data.frame': 840 obs. of 9 variables:
## $ age : num 0 0.326 0.413 0.413 0.152 ...
## $ bmi : num 0.48 0.181 0.315 0.372 0.275 ...
## $ children : num 0.2 0 0.6 0.4 0 0 0 0.2 0.2 0 ...
## $ expenses : num 0.0181 0.6255 0.1847 0.1585 0.048 ...
## $ sexmale : num 1 1 0 1 1 0 1 1 0 1 ...
## $ smokeryes : num 0 0 0 0 0 1 0 0 0 0 ...
## $ regionnorthwest: num 0 1 1 0 0 0 0 0 0 0 ...
## $ regionsoutheast: num 1 0 0 0 0 1 0 0 0 0 ...
## $ regionsouthwest: num 0 0 0 0 0 0 1 1 0 0 ...
Using the function nnet and using the same seed choosed before, we are able to create the model with 1 hidden neural.
The parts of the plot below are classify as follow:
l = layer, h = hidden, b = bias, o = outputGreen stay for positive value, Red for negative value## # weights: 11
## initial value 49.490228
## iter 10 value 16.321008
## iter 20 value 14.841199
## iter 30 value 14.832719
## iter 40 value 14.827985
## iter 50 value 14.821719
## final value 14.821718
## converged
We unscaled the prediction in order to discover the MSE and R-squared and plot it.
Here are present the neural network Mean Squared Error and R-squared:
## [1] "MSE is: 22706292.9348"
## [1] "R-Squared is: 0.563"
Then we analyse the plot with variables importance.
The difference, compared to the previous models, are the further growth of importance of smokeryes and age.
Now let’s try to increase number of hidden layer to 2.
## # weights: 21
## initial value 71.132800
## iter 10 value 15.850134
## iter 20 value 14.450364
## iter 30 value 14.306039
## iter 40 value 14.064247
## iter 50 value 13.991688
## iter 60 value 13.837119
## iter 70 value 13.728204
## iter 80 value 13.693852
## iter 90 value 13.680662
## iter 100 value 13.674768
## final value 13.674768
## stopped after 100 iterations
Consequently as we have made before, we unscaled the prediction in order to discover the MSE and R-squared and plot it.
Here are present the neural network Mean Squared Error and R-squared:
## [1] "MSE is: 21987377.75"
## [1] "R-Squared is: 0.5769"
This table clarify the comparison between the linear model and Neural networks models:
| Mean Squared Error | |
|---|---|
| lm | 23155411 |
| nn1 | 22706293 |
| nn2 | 21981012 |
The best model is the one with neural network 2 hidden layer.
Increasing the number of hidden layer produces a degrowth of the MSE.
Then we analyse the plot with variables importance for the neural network model with 2 hidden layers.
It easy to see that the most important variable became Body Mass Index followed by smokeryes.
Cross-validation is a procedure of resampling used on a limited data sample to evaluate machine learning models.
The procedure has a single parameter called k that refers to the number of groups that are to be divided into a given data sample.
We decide to resampling with k = 10 in order to test our linear Model and Neural Network with 1 hidden neural
We used the cv.glm() function with k=10 and we calculate the cross validation error.
## [1] 20264583
On other hand the situation became more difficult for the cross validation with k=10 in neural network.
We choose 1 hidden layers for a more practically way.
In first step we create an empty index.
Then, we iterated a for loop that repeat the following step for each k:
Here there is the mean of MSE:
## [1] 21206647
Cross validation Neural Network Boxplot
Then, we showed the set of all cv.nn.errors and we represented them on boxplot.
## [1] 47174377 23237290 22927357 26781945 6446507 9733600 33914802 18411697
## [9] 21407620 2031271
In conclusion we are able to compare the the two CV prediction error.
| CV prediction error | |
|---|---|
| cv.lm | 20264583 |
| cv.nn1 | 21206647 |
Thanks to Cross Validation resampling we are able to see that the linear Model fit better than Neural Network with 1 hidden neural.
As conclusion we represent a table which contain all the Mean Squared Error and R-squared.
From this table we can see that the smallest MSE (and highest R-squared) is provided by the Model Gradient Boosting Machine technique.
Therefore, we recommend this model in order to use it for predict future medical expenses of individuals and allow the company to make decision on charging the premium.
In order to try our expense prediction, we apply our model on 3 simulated new customers.
We can see that:
Marcus
## [1] "Health care charges for Marcus: 17997.98"
Matilda
## [1] "Health care charges for Maltilda: 7605.43"
Stefan
## [1] "Health care charges for Stefan: 16244.46"