Introduction

Medical costs incurred by insurance companies are constantly changing, our nation faces a medical epidemic; cost of treatment is relatively high and the service provided is sub par. Medical providers have been charging insurance companies billions of dollars for treatment meanwhile two individuals with the same health condition will never be charged the same as we do not have a system in place and everything is subjective. Extracting medical charges and patient data, as well as incorporating some of the most recent machine learning models to predict the cost of medical charges based on patient information.

Summary Statistics

The following dataset is in regards to medical costs that an insurance company is charged. There are 1338 observations with 7 different variables; 1 dependent (charges) and 6 independent (age, BMI, sex, smoker, children, region). Charges is how much the insurance is charged for medical services which is a numerical value. Age, BMI, children are all numerical integers as well. Sex and smoker are binary answers (Male or Female & Yes or No). Region is a categorical status of location. The follow 6 variables all effect the amount charged.

I expect that age, BMI and smoker to have a positive effect of the charge rendered. I believe that the data will support my assumption that when individuals get older, they tend to get more diseases or sicker hence they require more medical care. BMI is a measure of body mass index, ultimately this has the highest correlation, when an individual has a high BMI level, it means their health is not in the recommended health level, therefore the higher the BMI level, the more medical care that individual will need which will cost more. Smoking is negatively attributed to lack of health, smokers tend to be at a higher risk at cancer and other diseases. I am unaware how sex will be affect this but I do know that woman tend to outlive men which might mean that men experience more medical issues throughout their life but it can also just mean that women are more likely to go to the doctor and men are avoid it.

##       age             sex              bmi           children    
##  Min.   :18.00   Min.   :0.0000   Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:0.0000   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Median :0.0000   Median :30.40   Median :1.000  
##  Mean   :39.21   Mean   :0.4948   Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00   Max.   :1.0000   Max.   :53.13   Max.   :5.000  
##      smoker          charges     
##  Min.   :0.0000   Min.   : 1122  
##  1st Qu.:0.0000   1st Qu.: 4740  
##  Median :0.0000   Median : 9382  
##  Mean   :0.2048   Mean   :13270  
##  3rd Qu.:0.0000   3rd Qu.:16640  
##  Max.   :1.0000   Max.   :63770

The following chart provides us with a summary of the data. We can see all our variables as well as see the summary of what type of data we are working with.

This graph compares the correlation between age verses charges. Which indicates that there is a strong positive correlation between age and charges.

This graph compares the correlation between BMI verses charges. Which indicates that there is a strong positive correlation between BMI and charges.

This graph compares the correlation between number of children verses charges. There does not seem to be any correlation or information regarding the relationship between these two variables.There

Linear Regression

Upon running multiple different linear regressons using the same Y but different combintions of X. I used adjusted R2 as the compartive metric as it adjusts to how much the regressive model explains the results which I have recieved. The X’s that best explain Y are age, bmi and the binary variable smoker. This linear regression accounts for nearly of Y which is a great regressive model. The relationship between Age and BMI which is an indicator of body mass index is highly relavent as we age and we are more out of shape, we can expect that our medical bills will increase, smoking as well has the strongest level of harm on our bodies which is justified by the data.

## 
## Call:
## lm(formula = charges ~ age, data = Copy_of_ML_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8059  -6671  -5939   5440  47829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3165.9      937.1   3.378 0.000751 ***
## age            257.7       22.5  11.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared:  0.08941,    Adjusted R-squared:  0.08872 
## F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 2.2e-16

As we can see from the above summary, age only accounts for roughly 9% of the justification of the price Y.

## 
## Call:
## lm(formula = charges ~ age + bmi + sex, data = Copy_of_ML_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14974  -7073  -5072   6953  47348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5642.35    1779.00  -3.172  0.00155 ** 
## age           243.19      22.28  10.917  < 2e-16 ***
## bmi           327.54      51.37   6.377 2.49e-10 ***
## sex         -1344.46     622.66  -2.159  0.03101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11370 on 1334 degrees of freedom
## Multiple R-squared:  0.1203, Adjusted R-squared:  0.1183 
## F-statistic: 60.78 on 3 and 1334 DF,  p-value: < 2.2e-16

We see can see that Sex has the smallest indicator on the price charges by medical providers. There is barely any difference in charge for males and females. This finding is very interesting.

## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = Copy_of_ML_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.4  -2970.9   -980.5   1480.0  28971.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11676.83     937.57  -12.45   <2e-16 ***
## age            259.55      11.93   21.75   <2e-16 ***
## bmi            322.62      27.49   11.74   <2e-16 ***
## smoker       23823.68     412.87   57.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7469 
## F-statistic:  1316 on 3 and 1334 DF,  p-value: < 2.2e-16
## [1] 37005396

The above linear regression model is the best one that I was able to create, it accounts for nearly 75% of explaining the charges incurred. The effect of being a smoker on the amount charged is the strongest, accounting for nearly more than half the justification for the amount charged. The coeffecients of age increases by $260 per year, each % of BMI increases the cost incurred by 322 dollars and being a smoker increases the estimated cost by nearly twenty-four thousand dollars. Thus concluding that the intercept are signicantly different from zero, the F statistic shows how different the means are from one to another. MSE = 37005396

The following graph is a relationship between predicted Y verus actual Y (yhat v y). This showcases that our model is a postivie plot which is highly evident to show that this is a great model.

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -11676.8304  937.56870 -12.45437 9.208145e-34
## age            259.5475   11.93418  21.74825 5.241160e-90
## bmi            322.6151   27.48741  11.73683 2.418558e-30
## smoker       23823.6845  412.86668  57.70309 0.000000e+00

This chart shows the estimates and standard errors for the model. The only estimate that is significant is the Smoker variable.

The following charts depict the frequency of residuals which showcases how much of a varaince there is between the predicted and actual data. As the graph shows, most of the data is centered around 0 which is another indicator of this being the best model.

Lasso and Ridge

## [1] 1078.513

## [1] 35277049
## (Intercept)         age         sex         bmi    children      smoker 
##  -3471.0868    188.3051      0.0000    164.3036      0.0000  21098.0114

Using cross validation, I was able to tune the model to find the flexibility of the model. The Lambda that minimizes MSE is 1078.513. This lambda basically changed the penaltey paramater which would minimize the MSE. Since our values are very large, I normalized the values of Lambda in order to get a better understanding of what is going on.

We can see from both graphs that the best log of Lambda is in between 8 and 8.5 because this would minimize the MSE as well as is the transition of changing the coeffients using lambda chosen by CV. The ridge model showcases a higher MSE than the OLS model. The MSE of ridge is 35277049 versus the MSE of OLS 37005396, displaying that ridge is a better model to predict medical costs.

Decision Tree

After building a regression tree and pruning it to the best of 5, it resulted in what I expected to be, that smoker or not would be the main split that would create a cost difference than bmi if someone is a smoker than age versus if one is not a smoker, the main factor would be age. This model compares a lot worse than both Ridge/Lasso and OLS, the MSE for this pruned tree model is 274498810 meanwhile MSE of ridge is 35277049 versus the MSE of OLS 37005396. The main reason is because it is just one decision tree with 5 splits which increases the variance.

Bagging

Bagging which basically entails repeating multiple decision trees and aggerating their mean is expected to have a lower MSE than the other models. As we can tell from the bag.ML graph (out of bag) the error term continues to decrease as we start to pivot near 75 trees at an exponential rate. Meanwhile the predicted bagging estimation versus actual graph continues to regress towards the mean. This showcases that bagging decreases our MSE which is 25971680, since bagging is an aggregation of decision trees, it provides us with a better parameter and more bias graph. The MSE between bagging and decision tree is 10* less which is highly significant.

This importance matrix showcases the difference between the variables and how strong they are in this model, as expected smoker is the highest and the difference between age and bmi is very minimal.

Random Forest

30 | 2.538e+07 17.67 | 40 | 2.574e+07 17.92 | 50 | 2.528e+07 17.60 | 60 | 2.532e+07 17.62 |

This showcases that nearly after 58 trees we start to see the out of bag error being at its mimimum.

The MSE for Random Forest which is a bit worse model than bagging increases the MSE to 27305889, Though this performs a lot better than ridge and OLS. We can see that the MSE levels for each variable has decreased compared to the bagging model. The last importance matrix graph showcases this.

Boosting

##           var  rel.inf
## bmi       bmi 43.14103
## smoker smoker 30.24228
## age       age 26.61669

This boosting model has changed the relative influence entirely from Random Forest and Bagging; bmi now has the strongest relative influence to this model instead of smoker.

The top three graphs respectively produce partial dependence plots for these variables. These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables.

The yhat boost test is relatively comparable to the actuals in the training data. The MSE of this model is a lot less, MSE: 23452311, than all the other models, it performs a bit better than Bagging: 25971680 and Random Forest: 27305889 but there is a huge difference between Boosting and ridge: 35277049 & OLS: 37005396.

XGBoosting

With XGBoosting the only relative important variable is smoker. The MSE is 321887457 which is almost the same as Ridge which is better than OLS. This model performs worse than Bagging: 25971680 and Random Forest: 27305889. The difference between Boosting:23452311 and XGBoosting: 321887457 is highly signiifcant which makes me question how accurate the relative importance level is.

Neural Networks

## [1] "Training entries: 5020, labels: 1004"
##      age sex  bmi children smoker
## [1,]  46   0 27.6        0      0
## [2,]  45   0 28.7        2      0

The Neural Network model did not run due to a technical glitch with my computer. I received a lot of issues in regards to python not being able to connect with Tensorflow.

Conclusion

Upon exploring how machine learning can be used to predict medical costs, I can attest that some models work better than others. One of the main factors that some models perfrom better is due to variance-bias tradeoff, some models penalize at a stronger rate than others. After testing Medical Costs against all the above models, boosting performed the best in regards to having the lowest MSE. Boosting > Random Forest > Bagging > Ridge > Bagging > XGBoosting < OLS.

This report allowed me to understand the importance of testing data against multiple models in order to get the best result possible and to always continue on improving our current models. Being able to utalize these models can allow us to better predict the costs incurred by insurance companies from Medical providers. Thit can be used to levarage insurance holders and understand the risk levels and what price to charge for their premiums based on their personal data such as age, BMI, sex…

Overall, the main reason that boosting worked best is due to the fact that it does not use bootstrap which is better than OLS and ridge but rather boosting allows for a more unbias parameter level.

Code

Code without comments

“library(tree) library(ISLR) library(dplyr) library(ggplot2) library(MASS)

read_excel() ML = Copy_of_ML_data names(ML)

summary(ML_data)

plot(age, charges, xlab = “Age”, ylab = “Charges”, main = “Age Vs Charges”)

plot(ages, charges, )

plot(bmi, charges, xlab = “BMI”, ylab = “Charges”, main = “BMI Vs Charges”)

plot(children, charges, xlab = “Children”, ylab = “Charges”, main = “Children Vs Charges”)

Part 2 five regressions lm.fit=lm(charges~age) summary(lm.fit)

lm.fit=lm(charges~age+bmi) summary(lm.fit)

lm.fit=lm(charges~age+bmi+ smoker) summary(lm.fit)

lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) summary(lm.fit)

lm.fit=lm(charges~age+bmi+ smoker+ children+region , data= ML_data) summary(lm.fit)

regression best lm.fit=lm(charges~age+bmi+ smoker , data= ML_data)

PLOt Y vs Y hat ggplot() + geom_point(aes(x = ML_data\(charges, y = predict(lm.fit)), colour = 'blue') + geom_line(aes(x = ML_data\)charges, y = ML_data$charges ), colour = ‘red’) + ggtitle(‘Predicted median value vs actual median value’) + xlab(‘actual y’) + ylab(‘predicted y’)

standard error lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) lm.summary = summary(lm.fit) lm.summary$coefficients

histogram lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) ggplot(ML_data, aes(x=residuals(lm.fit), y=predict(lm.fit))) + geom_point(color=‘blue’, size = .1) + labs(y=“fitted values”, x=“residuals”)

ggplot(ML_data) + labs(y=“Frequency”, x=“Residuals”) + geom_histogram(aes(x=residuals(lm.fit)),binwidth = 10, colour=‘grey’)

part 3 library(glmnet) library(dplyr)

sum(is.na(ML_data))

x = model.matrix(charges~., ML_data)[,-1] # trim off the first column y = ML_data$charges

grid = 10^seq(10, -2, length = 100) ridge_mod = glmnet(x, y, alpha = 0, lambda = grid)

dim(coef(ridge_mod))

ridge_mod$lambda[2] #Display second lambda value used.

coef(ridge_mod, s = 50)

library(ggplot2)

both your X_train and X_test should be in matrix format. x_train = model.matrix(charges~., train)[,-1] x_test = model.matrix(charges~., test)[,-1] y_train = train\(charges y_test = test\)charges

ridge_mod = glmnet(x_train, y_train, alpha=0, lambda = grid) predict using the trained model and your X_test set ridge_pred = predict(ridge_mod, s = 4, newx = x_test) calculate MSE mean((ridge_pred - y_test)^2)

set.seed(1) cv.out = cv.glmnet(x_train, y_train, alpha = 0) # Fit ridge regression model on training data bestlam = cv.out$lambda.min # Select lamda that minimizes training MSE bestlam

plot(cv.out) # Draw plot of training MSE as a function of lambda

ridge_pred = predict(ridge_mod, s = bestlam, newx = x_test) # Use best lambda to predict test data mean((ridge_pred - y_test)^2) # Calculate test MSE

out = glmnet(x, y, alpha = 1) # Fit ridge regression model on the FULL dataset (train and test) predict(out, type = “coefficients”, s = bestlam)[1:9,] # Display coefficients using lambda chosen by CV

plot(out, xvar = “lambda”)

part 2 of reprot 3

library(tree) library(ISLR) library(dplyr) library(ggplot2) library(MASS)

tree_ML =tree(charges~age+bmi+children.,train) summary(train)

plot(tree_ML) text(tree_ML, pretty =0)

cv.ML = cv.tree(tree_ML) prune_ML = prune.tree(tree_ML,best = 6) plot(prune_ML) text(prune_ML,pretty= 0) title(‘Median Charge Regression Tree’)

single_tree_estimate = predict(prune_ML, newdata = test)

ggplot() + geom_point(aes(x = test$charges, y = single_tree_estimate), color = ‘blue’) + geom_abline(color = ‘red’)+ labs(x=‘test subset median value’ , y= ‘predicted median value’)

mean((single_tree_estimate - test$charges)^2)

lm_ML=lm(charges~age+bmi+children., train) lm_estimate = predict(lm_ML, newdata = test) mean((lm_estimate - test$charges)^2)

part 4 library(MASS) library(randomForest) library(dplyr) library(ggplot2)

bagging ls(ML_data) summary(ML_data)

set.seed(1) bag.ML = randomForest(charges~age+bmi+children , data=train, mtry=ncol(train), importance=TRUE) bag.ML

plot(bag.boston)

yhat.bag = predict(bag.ML, newdata = test)

ggplot() + geom_point(aes(x = test$charges, y = yhat.bag)) + geom_abline()+ labs(x=“Median value”, y=“Bagging estimation of median value”)

round(mean((yhat.bag-test$charges)^2),2)

random forest set.seed(1) rf.ML_data = randomForest(charges~age+bmi+children, data = train, mtry = 1, importance = TRUE, do.trace = 100) #do.trace gives you the OOB MSE for every 100 trees

yhat.rf = predict(rf.ML_data, newdata = test)

round(mean((yhat.rf - test$charges)^2),2)

plot(rf.ML_data)

importance(rf.ML_data)

boosting

library(gbm) library(MASS) library(dplyr) library(ggplot2)

boost.ML_data = gbm(charges~age+bmi+children, data = train, distribution = “gaussian”, n.trees = 5000, interaction.depth = 4)

summary(boost.ML_data)

xgboost library(xgboost) Y_train <- as.matrix(train[,“charges”]) X_train <- as.matrix(train[!names(train) %in% c(“charges”)]) dtrain <- xgb.DMatrix(data = X_train, label = Y_train)

X_test <- as.matrix(test[!names(train) %in% c(“charges”)])

part 5 library(keras) library(ISLR) library(dplyr)

set.seed(1) # we are going to change the seed later to see what happens! It’s ‘1’ for now. train = ML_data %>% sample_frac(.7) test = Boston %>% setdiff(train)

train_labels <- as.matrix(train[,“charges”]) # This is the Y variable train_data <- as.matrix(train[!names(train) %in% c(“charges”)]) #These are the X variables test_data <- as.matrix(test[!names(train) %in% c(“charges”)]) #These are the X variables test_labels <- as.matrix(test[,“charges”]) # This is the Y variable

paste0(“Training entries:”, length(train_data), “, labels:”, length(train_labels))

train_data[1:2, ]