Medical costs incurred by insurance companies are constantly changing, our nation faces a medical epidemic; cost of treatment is relatively high and the service provided is sub par. Medical providers have been charging insurance companies billions of dollars for treatment meanwhile two individuals with the same health condition will never be charged the same as we do not have a system in place and everything is subjective. Extracting medical charges and patient data, as well as incorporating some of the most recent machine learning models to predict the cost of medical charges based on patient information.
The following dataset is in regards to medical costs that an insurance company is charged. There are 1338 observations with 7 different variables; 1 dependent (charges) and 6 independent (age, BMI, sex, smoker, children, region). Charges is how much the insurance is charged for medical services which is a numerical value. Age, BMI, children are all numerical integers as well. Sex and smoker are binary answers (Male or Female & Yes or No). Region is a categorical status of location. The follow 6 variables all effect the amount charged.
I expect that age, BMI and smoker to have a positive effect of the charge rendered. I believe that the data will support my assumption that when individuals get older, they tend to get more diseases or sicker hence they require more medical care. BMI is a measure of body mass index, ultimately this has the highest correlation, when an individual has a high BMI level, it means their health is not in the recommended health level, therefore the higher the BMI level, the more medical care that individual will need which will cost more. Smoking is negatively attributed to lack of health, smokers tend to be at a higher risk at cancer and other diseases. I am unaware how sex will be affect this but I do know that woman tend to outlive men which might mean that men experience more medical issues throughout their life but it can also just mean that women are more likely to go to the doctor and men are avoid it.
## age sex bmi children
## Min. :18.00 Min. :0.0000 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 1st Qu.:0.0000 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Median :0.0000 Median :30.40 Median :1.000
## Mean :39.21 Mean :0.4948 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :1.0000 Max. :53.13 Max. :5.000
## smoker charges
## Min. :0.0000 Min. : 1122
## 1st Qu.:0.0000 1st Qu.: 4740
## Median :0.0000 Median : 9382
## Mean :0.2048 Mean :13270
## 3rd Qu.:0.0000 3rd Qu.:16640
## Max. :1.0000 Max. :63770
The following chart provides us with a summary of the data. We can see all our variables as well as see the summary of what type of data we are working with.
This graph compares the correlation between age verses charges. Which indicates that there is a strong positive correlation between age and charges.
This graph compares the correlation between BMI verses charges. Which indicates that there is a strong positive correlation between BMI and charges.
This graph compares the correlation between number of children verses charges. There does not seem to be any correlation or information regarding the relationship between these two variables.There
Upon running multiple different linear regressons using the same Y but different combintions of X. I used adjusted R2 as the compartive metric as it adjusts to how much the regressive model explains the results which I have recieved. The X’s that best explain Y are age, bmi and the binary variable smoker. This linear regression accounts for nearly of Y which is a great regressive model. The relationship between Age and BMI which is an indicator of body mass index is highly relavent as we age and we are more out of shape, we can expect that our medical bills will increase, smoking as well has the strongest level of harm on our bodies which is justified by the data.
##
## Call:
## lm(formula = charges ~ age, data = Copy_of_ML_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16
As we can see from the above summary, age only accounts for roughly 9% of the justification of the price Y.
##
## Call:
## lm(formula = charges ~ age + bmi + sex, data = Copy_of_ML_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14974 -7073 -5072 6953 47348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5642.35 1779.00 -3.172 0.00155 **
## age 243.19 22.28 10.917 < 2e-16 ***
## bmi 327.54 51.37 6.377 2.49e-10 ***
## sex -1344.46 622.66 -2.159 0.03101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11370 on 1334 degrees of freedom
## Multiple R-squared: 0.1203, Adjusted R-squared: 0.1183
## F-statistic: 60.78 on 3 and 1334 DF, p-value: < 2.2e-16
We see can see that Sex has the smallest indicator on the price charges by medical providers. There is barely any difference in charge for males and females. This finding is very interesting.
##
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = Copy_of_ML_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12415.4 -2970.9 -980.5 1480.0 28971.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11676.83 937.57 -12.45 <2e-16 ***
## age 259.55 11.93 21.75 <2e-16 ***
## bmi 322.62 27.49 11.74 <2e-16 ***
## smoker 23823.68 412.87 57.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7469
## F-statistic: 1316 on 3 and 1334 DF, p-value: < 2.2e-16
## [1] 37005396
The above linear regression model is the best one that I was able to create, it accounts for nearly 75% of explaining the charges incurred. The effect of being a smoker on the amount charged is the strongest, accounting for nearly more than half the justification for the amount charged. The coeffecients of age increases by $260 per year, each % of BMI increases the cost incurred by 322 dollars and being a smoker increases the estimated cost by nearly twenty-four thousand dollars. Thus concluding that the intercept are signicantly different from zero, the F statistic shows how different the means are from one to another. MSE = 37005396
The following graph is a relationship between predicted Y verus actual Y (yhat v y). This showcases that our model is a postivie plot which is highly evident to show that this is a great model.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11676.8304 937.56870 -12.45437 9.208145e-34
## age 259.5475 11.93418 21.74825 5.241160e-90
## bmi 322.6151 27.48741 11.73683 2.418558e-30
## smoker 23823.6845 412.86668 57.70309 0.000000e+00
This chart shows the estimates and standard errors for the model. The only estimate that is significant is the Smoker variable.
The following charts depict the frequency of residuals which showcases how much of a varaince there is between the predicted and actual data. As the graph shows, most of the data is centered around 0 which is another indicator of this being the best model.
## [1] 1078.513
## [1] 35277049
## (Intercept) age sex bmi children smoker
## -3471.0868 188.3051 0.0000 164.3036 0.0000 21098.0114
Using cross validation, I was able to tune the model to find the flexibility of the model. The Lambda that minimizes MSE is 1078.513. This lambda basically changed the penaltey paramater which would minimize the MSE. Since our values are very large, I normalized the values of Lambda in order to get a better understanding of what is going on.
We can see from both graphs that the best log of Lambda is in between 8 and 8.5 because this would minimize the MSE as well as is the transition of changing the coeffients using lambda chosen by CV. The ridge model showcases a higher MSE than the OLS model. The MSE of ridge is 35277049 versus the MSE of OLS 37005396, displaying that ridge is a better model to predict medical costs.
After building a regression tree and pruning it to the best of 5, it resulted in what I expected to be, that smoker or not would be the main split that would create a cost difference than bmi if someone is a smoker than age versus if one is not a smoker, the main factor would be age. This model compares a lot worse than both Ridge/Lasso and OLS, the MSE for this pruned tree model is 274498810 meanwhile MSE of ridge is 35277049 versus the MSE of OLS 37005396. The main reason is because it is just one decision tree with 5 splits which increases the variance.
Bagging which basically entails repeating multiple decision trees and aggerating their mean is expected to have a lower MSE than the other models. As we can tell from the bag.ML graph (out of bag) the error term continues to decrease as we start to pivot near 75 trees at an exponential rate. Meanwhile the predicted bagging estimation versus actual graph continues to regress towards the mean. This showcases that bagging decreases our MSE which is 25971680, since bagging is an aggregation of decision trees, it provides us with a better parameter and more bias graph. The MSE between bagging and decision tree is 10* less which is highly significant.
This importance matrix showcases the difference between the variables and how strong they are in this model, as expected smoker is the highest and the difference between age and bmi is very minimal.
30 | 2.538e+07 17.67 | 40 | 2.574e+07 17.92 | 50 | 2.528e+07 17.60 | 60 | 2.532e+07 17.62 |
This showcases that nearly after 58 trees we start to see the out of bag error being at its mimimum.
The MSE for Random Forest which is a bit worse model than bagging increases the MSE to 27305889, Though this performs a lot better than ridge and OLS. We can see that the MSE levels for each variable has decreased compared to the bagging model. The last importance matrix graph showcases this.
## var rel.inf
## bmi bmi 43.14103
## smoker smoker 30.24228
## age age 26.61669
This boosting model has changed the relative influence entirely from Random Forest and Bagging; bmi now has the strongest relative influence to this model instead of smoker.
The top three graphs respectively produce partial dependence plots for these variables. These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables.
The yhat boost test is relatively comparable to the actuals in the training data. The MSE of this model is a lot less, MSE: 23452311, than all the other models, it performs a bit better than Bagging: 25971680 and Random Forest: 27305889 but there is a huge difference between Boosting and ridge: 35277049 & OLS: 37005396.
With XGBoosting the only relative important variable is smoker. The MSE is 321887457 which is almost the same as Ridge which is better than OLS. This model performs worse than Bagging: 25971680 and Random Forest: 27305889. The difference between Boosting:23452311 and XGBoosting: 321887457 is highly signiifcant which makes me question how accurate the relative importance level is.
## [1] "Training entries: 5020, labels: 1004"
## age sex bmi children smoker
## [1,] 46 0 27.6 0 0
## [2,] 45 0 28.7 2 0
The Neural Network model did not run due to a technical glitch with my computer. I received a lot of issues in regards to python not being able to connect with Tensorflow.
Upon exploring how machine learning can be used to predict medical costs, I can attest that some models work better than others. One of the main factors that some models perfrom better is due to variance-bias tradeoff, some models penalize at a stronger rate than others. After testing Medical Costs against all the above models, boosting performed the best in regards to having the lowest MSE. Boosting > Random Forest > Bagging > Ridge > Bagging > XGBoosting < OLS.
This report allowed me to understand the importance of testing data against multiple models in order to get the best result possible and to always continue on improving our current models. Being able to utalize these models can allow us to better predict the costs incurred by insurance companies from Medical providers. Thit can be used to levarage insurance holders and understand the risk levels and what price to charge for their premiums based on their personal data such as age, BMI, sex…
Overall, the main reason that boosting worked best is due to the fact that it does not use bootstrap which is better than OLS and ridge but rather boosting allows for a more unbias parameter level.
“library(tree) library(ISLR) library(dplyr) library(ggplot2) library(MASS)
read_excel() ML = Copy_of_ML_data names(ML)
summary(ML_data)
plot(age, charges, xlab = “Age”, ylab = “Charges”, main = “Age Vs Charges”)
plot(ages, charges, )
plot(bmi, charges, xlab = “BMI”, ylab = “Charges”, main = “BMI Vs Charges”)
plot(children, charges, xlab = “Children”, ylab = “Charges”, main = “Children Vs Charges”)
Part 2 five regressions lm.fit=lm(charges~age) summary(lm.fit)
lm.fit=lm(charges~age+bmi) summary(lm.fit)
lm.fit=lm(charges~age+bmi+ smoker) summary(lm.fit)
lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) summary(lm.fit)
lm.fit=lm(charges~age+bmi+ smoker+ children+region , data= ML_data) summary(lm.fit)
regression best lm.fit=lm(charges~age+bmi+ smoker , data= ML_data)
PLOt Y vs Y hat ggplot() + geom_point(aes(x = ML_data\(charges, y = predict(lm.fit)), colour = 'blue') + geom_line(aes(x = ML_data\)charges, y = ML_data$charges ), colour = ‘red’) + ggtitle(‘Predicted median value vs actual median value’) + xlab(‘actual y’) + ylab(‘predicted y’)
standard error lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) lm.summary = summary(lm.fit) lm.summary$coefficients
histogram lm.fit=lm(charges~age+bmi+ smoker , data= ML_data) ggplot(ML_data, aes(x=residuals(lm.fit), y=predict(lm.fit))) + geom_point(color=‘blue’, size = .1) + labs(y=“fitted values”, x=“residuals”)
ggplot(ML_data) + labs(y=“Frequency”, x=“Residuals”) + geom_histogram(aes(x=residuals(lm.fit)),binwidth = 10, colour=‘grey’)
part 3 library(glmnet) library(dplyr)
sum(is.na(ML_data))
x = model.matrix(charges~., ML_data)[,-1] # trim off the first column y = ML_data$charges
grid = 10^seq(10, -2, length = 100) ridge_mod = glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge_mod))
ridge_mod$lambda[2] #Display second lambda value used.
coef(ridge_mod, s = 50)
library(ggplot2)
both your X_train and X_test should be in matrix format. x_train = model.matrix(charges~., train)[,-1] x_test = model.matrix(charges~., test)[,-1] y_train = train\(charges y_test = test\)charges
ridge_mod = glmnet(x_train, y_train, alpha=0, lambda = grid) predict using the trained model and your X_test set ridge_pred = predict(ridge_mod, s = 4, newx = x_test) calculate MSE mean((ridge_pred - y_test)^2)
set.seed(1) cv.out = cv.glmnet(x_train, y_train, alpha = 0) # Fit ridge regression model on training data bestlam = cv.out$lambda.min # Select lamda that minimizes training MSE bestlam
plot(cv.out) # Draw plot of training MSE as a function of lambda
ridge_pred = predict(ridge_mod, s = bestlam, newx = x_test) # Use best lambda to predict test data mean((ridge_pred - y_test)^2) # Calculate test MSE
out = glmnet(x, y, alpha = 1) # Fit ridge regression model on the FULL dataset (train and test) predict(out, type = “coefficients”, s = bestlam)[1:9,] # Display coefficients using lambda chosen by CV
plot(out, xvar = “lambda”)
part 2 of reprot 3
library(tree) library(ISLR) library(dplyr) library(ggplot2) library(MASS)
tree_ML =tree(charges~age+bmi+children.,train) summary(train)
plot(tree_ML) text(tree_ML, pretty =0)
cv.ML = cv.tree(tree_ML) prune_ML = prune.tree(tree_ML,best = 6) plot(prune_ML) text(prune_ML,pretty= 0) title(‘Median Charge Regression Tree’)
single_tree_estimate = predict(prune_ML, newdata = test)
ggplot() + geom_point(aes(x = test$charges, y = single_tree_estimate), color = ‘blue’) + geom_abline(color = ‘red’)+ labs(x=‘test subset median value’ , y= ‘predicted median value’)
mean((single_tree_estimate - test$charges)^2)
lm_ML=lm(charges~age+bmi+children., train) lm_estimate = predict(lm_ML, newdata = test) mean((lm_estimate - test$charges)^2)
part 4 library(MASS) library(randomForest) library(dplyr) library(ggplot2)
bagging ls(ML_data) summary(ML_data)
set.seed(1) bag.ML = randomForest(charges~age+bmi+children , data=train, mtry=ncol(train), importance=TRUE) bag.ML
plot(bag.boston)
yhat.bag = predict(bag.ML, newdata = test)
ggplot() + geom_point(aes(x = test$charges, y = yhat.bag)) + geom_abline()+ labs(x=“Median value”, y=“Bagging estimation of median value”)
round(mean((yhat.bag-test$charges)^2),2)
random forest set.seed(1) rf.ML_data = randomForest(charges~age+bmi+children, data = train, mtry = 1, importance = TRUE, do.trace = 100) #do.trace gives you the OOB MSE for every 100 trees
yhat.rf = predict(rf.ML_data, newdata = test)
round(mean((yhat.rf - test$charges)^2),2)
plot(rf.ML_data)
importance(rf.ML_data)
boosting
library(gbm) library(MASS) library(dplyr) library(ggplot2)
boost.ML_data = gbm(charges~age+bmi+children, data = train, distribution = “gaussian”, n.trees = 5000, interaction.depth = 4)
summary(boost.ML_data)
xgboost library(xgboost) Y_train <- as.matrix(train[,“charges”]) X_train <- as.matrix(train[!names(train) %in% c(“charges”)]) dtrain <- xgb.DMatrix(data = X_train, label = Y_train)
X_test <- as.matrix(test[!names(train) %in% c(“charges”)])
part 5 library(keras) library(ISLR) library(dplyr)
set.seed(1) # we are going to change the seed later to see what happens! It’s ‘1’ for now. train = ML_data %>% sample_frac(.7) test = Boston %>% setdiff(train)
train_labels <- as.matrix(train[,“charges”]) # This is the Y variable train_data <- as.matrix(train[!names(train) %in% c(“charges”)]) #These are the X variables test_data <- as.matrix(test[!names(train) %in% c(“charges”)]) #These are the X variables test_labels <- as.matrix(test[,“charges”]) # This is the Y variable
paste0(“Training entries:”, length(train_data), “, labels:”, length(train_labels))
train_data[1:2, ]