In this document, we explore insights from an analytic point of view and report the process of building predictive machine learning models that can be used to estimate a person’s medical cost (As far as the data set used is concerned). The data set used is named “Medical Cost Personal Datasets” and can be viewed in Kaggle using this link.
To begin with, we install the packages and load the respective libraries used in the subsequent analyses throughout this notebook.
Exploring the data thoroughly and having a firm grip on its context is vital for any kind of analysis. Since everything following is built on top of our understanding about this particular data, studying the data carefully is vital to moving further.
After running the code chunk we see that data consists of 7 columns — age, sex, bmi, children, smoker, region, and charges. The first 6 columns are the attributes of the individuals while the last column (charges) shows the medical costs billed by health insurance. After making sure there are no n/a values in the data set and checking if the column’s data types are correctly recorded to explore data further, some graphics can be very convenient. These univariate analyses inform us about how the variables are distributed, helping us make better sense of the matter at hand. For instance, we can easily notice that the majority of the insured have no children or that the majority of the charges are concentrated in the range from 0 to 20000.
head(insurance_data)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
sum(is.na.data.frame(insurance_data))
## [1] 0
summary(insurance_data)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
par(mfrow=c(2,2))
hist(insurance_data$age,
breaks = sqrt(nrow(insurance_data)), main = "Distribution of Age", xlab = "Age")
hist(insurance_data$bmi,
breaks = sqrt(nrow(insurance_data)), main = "Distribution of BMI", xlab = "BMI")
hist(insurance_data$children, main = "Number of children Covered", xlab = "Children")
hist(insurance_data$charges,
breaks = sqrt(nrow(insurance_data)), main = "Individual medical charges billed ", xlab = "Charges")
p1<-ggplot(insurance_data) + geom_bar(aes(x=sex)) + labs(title ="Gender Distribution") + theme_minimal()
p2<-ggplot(insurance_data) + geom_bar(aes(x=smoker)) + labs(title = "Smoker/ Non-Smoker Distribution")+ theme_minimal()
p3<-ggplot(insurance_data) + geom_bar(aes(x=region))+ labs(title = "Distribution by Region")+ theme_minimal()
(p1 +p2)/p3
Box-plots explain the variability present in the data also indicating the outliers that might pose an issue to our models especially to the linear ones.
p4<-ggplot(insurance_data) +
aes(x = "", y = age) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
p5<-ggplot(insurance_data) +
aes(x = "", y = bmi) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
p6<-ggplot(insurance_data) +
aes(x = "", y = children) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
p7<-ggplot(insurance_data) +
aes(x = "", y = charges) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
(p4 + p5)/(p6+p7)
With the help of a few graphs/charts, we can explore dimensions or interrelationships which might not be very visible at first sight but may hold important information.
For example from the graphs below, we can understand that charges increase by age or that smoking almost doubles the charges for an individual.
Also, we have to make sure that there is no significant correlation between independent variables that will be used as input into the models. Just like it might be seen from the last two entries, results show no significant correlation between variables in our example.
ggplot(insurance_data, aes(x=bmi, y= charges, color=age)) + geom_point()
ggplot(insurance_data, aes(x=age, y= charges, color=bmi)) + geom_point()
ggplot(insurance_data, aes(x=smoker, y= charges, color=sex)) + geom_boxplot()
numeric_variables_for_cor <- data.frame(age, bmi, children) %>% cor() %>%
corrplot(method = "pie", type="lower")
categoric_variables_for_cor <- data.frame(sex, smoker, region) %>%
GKtauDataframe()
print.table(categoric_variables_for_cor)
## sex smoker region
## sex 2.000 0.006 0.000
## smoker 0.006 2.000 0.002
## region 0.000 0.005 4.000
Let’s try to fit a multivariate linear regression model without making any changes to the data.
Using the default diagnostic tools available in R we conclude that the model does not satisfy normality assumptions and the plots are pretty disturbing in the means of proper model fitting.
This can also be confirmed using the package gvlma to automatically check if the assumptions are satisfied.
mrm <- lm(charges ~ age + sex + bmi + children + region, data = insurance_data)
summary(mrm)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + region, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14768 -6966 -4918 6738 47284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6525.80 1840.87 -3.545 0.000406 ***
## age 242.25 22.27 10.878 < 2e-16 ***
## sexmale 1311.61 621.52 2.110 0.035018 *
## bmi 314.60 53.53 5.877 5.28e-09 ***
## children 555.69 257.96 2.154 0.031406 *
## regionnorthwest -1025.98 891.33 -1.151 0.249916
## regionsoutheast 69.99 895.41 0.078 0.937712
## regionsouthwest -1603.32 894.46 -1.792 0.073280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11350 on 1330 degrees of freedom
## Multiple R-squared: 0.1264, Adjusted R-squared: 0.1218
## F-statistic: 27.5 on 7 and 1330 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mrm)
mean(mrm$residuals)
## [1] 4.652024e-13
install.packages("gvlma")
library(gvlma)
gvlma::gvlma(mrm)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + region, data = insurance_data)
##
## Coefficients:
## (Intercept) age sexmale bmi
## -6525.80 242.25 1311.61 314.60
## children regionnorthwest regionsoutheast regionsouthwest
## 555.69 -1025.98 69.99 -1603.32
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = mrm)
##
## Value p-value Decision
## Global Stat 596.76086 0.0000 Assumptions NOT satisfied!
## Skewness 506.17542 0.0000 Assumptions NOT satisfied!
## Kurtosis 89.60758 0.0000 Assumptions NOT satisfied!
## Link Function 0.90452 0.3416 Assumptions acceptable.
## Heteroscedasticity 0.07334 0.7865 Assumptions acceptable.
By now it’s clear that we have to transform the data in order to deal with the skewed distributions present in the variables. There are many ways that an analyst can use to deal with problems of this nature but often the most reliable solution is taking the natural logarithm of the data at hand.
As a demonstration of this, we tried different ways of data transformation and confirmed that even in our case natural logarithm transformation is the most appropriate choice. Below we can see the variable Charges transformed to normally distributed one as we take its natural logarithm.
par(mfrow=c(2,2))
hist(charges, main = "Untransformed Charges", xlab ="Charges" )
hist(log(charges + 1), main = "Log transformation", xlab ="Log-Charges")
hist(sqrt(charges + 1), main ="SQRT transformation", xlab ="SQRT-Charges")
hist(scale(charges + 1), main ="Scaled Charges", xlab ="Scaled Charges")
Here we transform the data using log function available in R
#lets normalize data using logarithmic function
log_insurance_data <- insurance_data
log_insurance_data$age <- log(insurance_data$age + 1) #here we add 1 to avoid complications with values that are 0
log_insurance_data$bmi <- log(insurance_data$bmi + 1)
log_insurance_data$children <- log(insurance_data$children + 1)
log_insurance_data$charges <- log(insurance_data$charges + 1)
Now, we build another multivariate linear regression model, but this time using the data transformed with natural logarithm.
Judging only by the R-square value, if we compare the first model with the latter one, there is an astonishing improvement from 12% at the first model to 76.7% to this second. To better visualize the model’s performance we graphed a part of the predicted and actual values toward an X-index demonstrating the good performance of the model. But still there is an issue. The model does not meet the assumptions, therefore the results in other unseen data points might be highly unreliable.
log_model <- lm(charges ~ age+ sex+ bmi+ children+ smoker + region, data = log_insurance_data)
summary(log_model)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = log_insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88874 -0.22720 -0.08146 0.08731 2.21245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.61522 0.23800 10.988 < 2e-16 ***
## age 1.28035 0.03238 39.545 < 2e-16 ***
## sexmale -0.07495 0.02433 -3.080 0.002110 **
## bmi 0.45074 0.06492 6.943 5.99e-12 ***
## children 0.17895 0.02184 8.194 5.89e-16 ***
## smokeryes 1.55097 0.03020 51.361 < 2e-16 ***
## regionnorthwest -0.06743 0.03482 -1.937 0.052979 .
## regionsoutheast -0.16051 0.03494 -4.595 4.75e-06 ***
## regionsouthwest -0.13222 0.03495 -3.783 0.000162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.443 on 1329 degrees of freedom
## Multiple R-squared: 0.7692, Adjusted R-squared: 0.7678
## F-statistic: 553.5 on 8 and 1329 DF, p-value: < 2.2e-16
library(caret)
gvlma::gvlma(log_model)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = log_insurance_data)
##
## Coefficients:
## (Intercept) age sexmale bmi
## 2.61522 1.28035 -0.07495 0.45074
## children smokeryes regionnorthwest regionsoutheast
## 0.17895 1.55097 -0.06743 -0.16051
## regionsouthwest
## -0.13222
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = log_model)
##
## Value p-value Decision
## Global Stat 2134.6808 0.0000 Assumptions NOT satisfied!
## Skewness 759.0545 0.0000 Assumptions NOT satisfied!
## Kurtosis 1188.0604 0.0000 Assumptions NOT satisfied!
## Link Function 187.1535 0.0000 Assumptions NOT satisfied!
## Heteroscedasticity 0.4124 0.5208 Assumptions acceptable.
par(mfrow=c(2,2))
plot(log_model)
predict(log_model, data.frame(log_insurance_data[1:299,-7])) ->log_model_predicted_charges
actual_charges <- log_insurance_data[1:299,7]
log_model_actual_vs_predicted <- data.frame(actual_charges,log_model_predicted_charges)
par(mfrow=c(1,1))
plot( y=log_model_actual_vs_predicted$actual_charges, x=1:299, pch = 19 , col = "blue",ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)")
points(log_model_actual_vs_predicted$log_model_predicted_charges, col= "red")
RMSE(actual_charges,log_model_predicted_charges)
## [1] 0.4232411
Even though the improvement made to the model was considerably good, further attempts can be taken to produce more reliable and serious models.
For this reason, we decided to remove outliers since it can be very helpful while attempting to create more compact models.
#identifying the rows with outliers
boxplot.stats(insurance_data$charges)$out -> outlier_charges
boxplot.stats(insurance_data$bmi)$out -> outlier_bmi
outlier_bmi_rows <- which(insurance_data$bmi %in% c(outlier_bmi))
outlier_charges_rows <- which(insurance_data$charges %in% c(outlier_charges))
insurance_cleaned <-insurance_data[c(-outlier_bmi_rows,-outlier_charges_rows),]
par(mfrow=c(2,2))
boxplot(charges ,main="Charges Before", col= "#177924")
boxplot(insurance_cleaned$charges,main="Charges After", col="#0c4c8a")
boxplot(bmi, main="Bmi Before", col="#177924")
boxplot(insurance_cleaned$bmi, main="Bmi After", col ="#0c4c8a")
Let’s create a new data frame with transformed and outlier-cleaned variables
transformed_cleaned_insurance_data <- insurance_cleaned
transformed_cleaned_insurance_data$age <- log(transformed_cleaned_insurance_data$age + 1)
transformed_cleaned_insurance_data$bmi <- log(transformed_cleaned_insurance_data$bmi + 1)
transformed_cleaned_insurance_data$children <- log(transformed_cleaned_insurance_data$children + 1)
transformed_cleaned_insurance_data$charges <- log(transformed_cleaned_insurance_data$charges + 1)
After making these important transformations we built another Linear Regression model to assess if there are any improvements. After fitting the model and checking for its diagnostics we can see that model has a lower \(R^2\) value(70%), indicating lower explainability, while the RMSE value is slightly lower meaning the model is doing better at this particular data. But still, the model does not meet the Linear regression assumptions therefore its not reliable.
new_log_model <- lm(charges ~ age+ sex+ bmi+ children+ smoker + region, data = transformed_cleaned_insurance_data)
summary(new_log_model)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = transformed_cleaned_insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71376 -0.20025 -0.10003 0.04029 2.31836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.03565 0.25440 11.933 < 2e-16 ***
## age 1.36682 0.03328 41.069 < 2e-16 ***
## sexmale -0.08498 0.02498 -3.402 0.000692 ***
## bmi 0.24050 0.06981 3.445 0.000590 ***
## children 0.18745 0.02241 8.366 < 2e-16 ***
## smokeryes 1.32723 0.04045 32.810 < 2e-16 ***
## regionnorthwest -0.06855 0.03522 -1.947 0.051820 .
## regionsoutheast -0.17887 0.03623 -4.937 9.07e-07 ***
## regionsouthwest -0.16557 0.03579 -4.626 4.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4309 on 1184 degrees of freedom
## Multiple R-squared: 0.7086, Adjusted R-squared: 0.7067
## F-statistic: 359.9 on 8 and 1184 DF, p-value: < 2.2e-16
predict(new_log_model, data.frame(transformed_cleaned_insurance_data[1:299,-7])) ->new_log_model_predicted_charges
actual_charges <- transformed_cleaned_insurance_data[1:299,7]
new_log_model_actual_vs_predicted <- data.frame(actual_charges,new_log_model_predicted_charges)
par(mfrow=c(1,1))
plot( y=new_log_model_actual_vs_predicted$actual_charges, x=1:299,ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)", pch = 19 , col = "blue")
points(new_log_model_actual_vs_predicted$new_log_model_predicted_charges, col= "red")
par(mfrow=c(2,2))
plot(new_log_model)
RMSE(actual_charges,new_log_model_predicted_charges)
## [1] 0.4127277
Before we get into other ML techniques to try different models, we first create dummy variables for our factor variables in this way avoiding any potential complications.
#moving on trying other models
library(fastDummies)
set.seed(99)
# Create dummy variable
insurance_dummy <- dummy_cols(transformed_cleaned_insurance_data,
select_columns = c("sex", "smoker", "region"))
insurance_dummy <- select(insurance_dummy, c(-sex,-smoker, -region))
head(insurance_dummy)
## age bmi children charges sex_female sex_male smoker_no smoker_yes
## 1 2.995732 3.363842 0.0000000 9.734236 1 0 0 1
## 2 2.944439 3.548755 0.6931472 7.453882 0 1 1 0
## 3 3.367296 3.526361 1.3862944 8.400763 0 1 1 0
## 4 3.526361 3.165686 0.0000000 9.998137 0 1 1 0
## 5 3.496508 3.397189 0.0000000 8.260455 0 1 1 0
## 6 3.465736 3.286161 0.0000000 8.231541 1 0 1 0
## region_northeast region_northwest region_southeast region_southwest
## 1 0 0 0 1
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 0 0 1 0
#data partition
#stackoverflow help rsample
set.seed(99)
library(rsample)
data_split <- initial_split(insurance_dummy, prop = 0.75)
insurance_train <- training(data_split)
insurance_test <- testing(data_split)
After fitting a linear regression model using cross-validation we conclude that this worsens the model’s performance also yielding a large RMSE value indicating a poor performance in the unseen data. This raises the question of the trustworthiness of the \(R^2\) value, is \(R^2\) telling the truth about the model?
At this point, we try to fit different models besides linear ones because non-linear relationships might not be properly modeled by the linear models.
library(caret)
set.seed(99)
crossV_lm_model <- train(charges ~. , method="lm", trControl=trainControl(method = "cv", number = 10), data=insurance_train)
summary(crossV_lm_model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66118 -0.19965 -0.09530 0.04223 2.30065
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.182893 0.284828 14.686 < 2e-16 ***
## age 1.351315 0.038190 35.384 < 2e-16 ***
## bmi 0.232893 0.079783 2.919 0.003599 **
## children 0.187479 0.026112 7.180 1.48e-12 ***
## sex_female 0.095369 0.028814 3.310 0.000971 ***
## sex_male NA NA NA NA
## smoker_no -1.324333 0.045706 -28.975 < 2e-16 ***
## smoker_yes NA NA NA NA
## region_northeast 0.158545 0.041792 3.794 0.000159 ***
## region_northwest 0.091303 0.040971 2.228 0.026101 *
## region_southeast -0.009745 0.040992 -0.238 0.812153
## region_southwest NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4305 on 885 degrees of freedom
## Multiple R-squared: 0.7125, Adjusted R-squared: 0.7099
## F-statistic: 274.1 on 8 and 885 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(crossV_lm_model$finalModel)
predict(crossV_lm_model, data.frame(insurance_train[1:299,-4])) ->crossV_lm_model_predicted_charges
actual_charges <- insurance_test[,4]
crossV_lm_model_actual_vs_predicted <- data.frame(actual_charges,crossV_lm_model_predicted_charges)
par(mfrow=c(1,1))
index=1:299
plot( y=crossV_lm_model_actual_vs_predicted$actual_charges, x=index, pch = 19 , ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)" , col = "blue")
points(crossV_lm_model_actual_vs_predicted$crossV_lm_model_predicted_charges, col= "red")
RMSE(actual_charges,crossV_lm_model_predicted_charges)
## [1] 1.051001
After trying to fit different linear regression models we realized that these models are not a proper fit for this particular data. Frankly, Linear regression’s assumptions are tough to be met. That’s why we thought it would be proper to use a wider class of regression models named Generalized Linear Models (GLM). This class of models is more flexible therefore allows for situations like non-linearity or non-normal distribution of the residuals.
set.seed(99)
control <- trainControl(method = "cv", number = 10)
glm_model <- train(charges ~ ., method="glm", data=insurance_train)
glm_model
## Generalized Linear Model
##
## 894 samples
## 11 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 894, 894, 894, 894, 894, 894, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4309079 0.709822 0.2681891
par(mfrow=c(2,2))
plot(glm_model$finalModel)
predict(glm_model,newdata = insurance_test[,-4]) ->glm_predicted_charges
glm_actual_vs_predicted <- data.frame(actual_charges,glm_predicted_charges)
par(mfrow=c(1,1))
plot( y=glm_actual_vs_predicted$actual_charges, x=1:299, pch = 19 , ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)" , col = "blue")
points(glm_actual_vs_predicted$glm_predicted_charges, col= "red")
RMSE(predict(glm_model,newdata = insurance_test[,-4]),insurance_test[,4])
## [1] 0.4327908
With the help of the Caret package, we can change the machine learning algorithm easily without needing to twist the code much. Using this chunk code we use the k-nearest neighbors algorithm to predict the charges in the unseen data. The results are quite impressive.
set.seed(99)
knn_model <- train(charges ~ . , data=insurance_train,
method="knn", tuneGrid = data.frame(k = seq(3, 71, 2)),trace = FALSE, trControl=trainControl(method = "cv", number = 10))
knn_model$finalModel
## 9-nearest neighbor regression model
par(mfrow=c(2,2))
ggplot(knn_model, highlight = TRUE)
predict(knn_model,newdata = insurance_test[,-4]) -> knn_predicted_charges
actual_charges <- insurance_test[,4]
knn_actual_vs_predicted <- data.frame(actual_charges,knn_predicted_charges)
par(mfrow=c(1,1))
index=1:299
plot( y=knn_actual_vs_predicted$actual_charges, x=index, pch = 19 , ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)" , col = "blue")
points(knn_actual_vs_predicted$knn_predicted_charges, col= "red")
RMSE(predict(knn_model,newdata = insurance_test[,-4]), insurance_test[,4])
## [1] 0.4259729
rss <- sum((knn_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(r_square)
## [1] 0.7032965
As a review of the models, we have created let’s see how these models perform in the test data ,and let’s compute their \(R^2\) value manually concerning the test data.
rss <- sum((log_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Log model R-square", r_square))
## [1] "Log model R-square" "-1.18856343187918"
RMSE(actual_charges,log_model_predicted_charges)
## [1] 1.156912
rss <- sum((new_log_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("New-log model R-square", r_square))
## [1] "New-log model R-square" "-0.803973724043065"
RMSE(actual_charges,new_log_model_predicted_charges)
## [1] 1.050354
rss <- sum((crossV_lm_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated LM R-square", r_square))
## [1] "Cross validated LM R-square" "-0.806196181189394"
RMSE(actual_charges,crossV_lm_model_predicted_charges)
## [1] 1.051001
rss <- sum((knn_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print( c("Knn R-square", r_square))
## [1] "Knn R-square" "0.703296478023315"
RMSE(predict(knn_model,newdata = insurance_test[,-4]), insurance_test[,4])
## [1] 0.4259729
rss <- sum((glm_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated Generalized-LM R-square", r_square))
## [1] "Cross validated Generalized-LM R-square"
## [2] "0.693722750903194"
RMSE(predict(glm_model,newdata = insurance_test[,-4]),insurance_test[,4])
## [1] 0.4327908
#importing the ploting function from Github
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
library(reshape)
set.seed(99)
rows <- sample(nrow(insurance_train))
insurance_train <- insurance_train[rows, ]
set.seed(99)
grid<- expand.grid(size = seq(from = 1, to = 80, by = 10), decay = seq(from = 0.01, to = 0.4, by = 0.02))
ann_model <-train(charges ~ ., method="nnet", preProcess = c("center", "scale"), data=insurance_train, tuneGrid = grid,trace = FALSE, linout = TRUE)
ggplot(ann_model, highlight = TRUE)
predict(ann_model,newdata = insurance_test[,-4]) ->ann_predicted_charges
ann_actual_vs_predicted <- data.frame(actual_charges,ann_predicted_charges)
index<-1:299
plot( y=ann_actual_vs_predicted$actual_charges, x=index, pch = 19 , col = "blue", ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)")
points(ann_actual_vs_predicted$ann_predicted_charges, col= "red")
RMSE(actual_charges,ann_predicted_charges)
## [1] 0.4057017
plot.nnet(ann_model)
#Now lets calculate its R-square value
rss <- sum((ann_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated ANN R-square value", r_square))
## [1] "Cross validated ANN R-square value" "0.730863631680642"
Fitting the right model to the data often is described as a combined process of art and science. Here it is emphasized the fact that tuning the parameters and choosing the right algorithms does not always guarantee the most optimal solution (Also often defining “right” can be tough). As per our example, the linear regression model seemed to do well if judged only by the visuals we created for better interpretation. But, when checked for RMSE and \(R^2\) values we see the picture much clearer. Except for the Generalized linear model that performed relatively well the other Linear regressions that we tried to fit didn’t pay off. That might be because GLM’s are more flexible when it comes to the assumptions which were very hard to meet here. Also, we saw how misleading the model’s output can be if the assumptions aren’t met. While checking for the diagnostics after fitting the linear models we also noticed some non-linearity, this can be another important contributor to the linear regression model’s inability to deal with this data. Knn model was also a good performer but it is outpaced by the power of flexibility of the Artificial Neural Networks that had the best performance judged by \(R^2\) (73%) and RMSE value(0.4) altogether.
RPubs - Visual Tests for the Key Assumptions of Multiple Linear Regression. (2017). Retrieved 15 October 2021, from https://rpubs.com/ajdowny_student/300663
Medical Cost Personal Datasets. (2021). Retrieved 15 October 2021, from https://www.kaggle.com/mirichoi0218/insurance
1.3 - The Simple Linear Regression Model | STAT 501. (2021). Retrieved 15 October 2021, from https://online.stat.psu.edu/stat501/lesson/1/1.3
10 Assumptions of Linear Regression - Full List with Examples and Code. (2021). Retrieved 15 October 2021, from http://r-statistics.co/Assumptions-of-Linear-Regression.html
(2021). Retrieved 15 October 2021, from https://www.statisticssolutions.com/wp-content/uploads/wp-post-to-pdf-enhanced-cache/1/assumptions-of-multiple-linear-regression.pdf
Quick-R: Multiple Regression. (2021). Retrieved 15 October 2021, from https://www.statmethods.net/stats/regression.html
(2021). Retrieved 15 October 2021, from https://www.ics.uci.edu/~jutts/110/Lecture3.pdf
Outliers detection in R. (2021). Retrieved 15 October 2021, from https://statsandr.com/blog/outliers-detection-in-r/
Understanding Diagnostic Plots for Linear Regression Analysis | University of Virginia Library Research Data Services + Sciences. (2015). Retrieved 15 October 2021, from https://data.library.virginia.edu/diagnostic-plots/
Models, N. (2018). Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models - Articles - STHDA. Retrieved 15 October 2021, from http://www.sthda.com/english/articles/40-regression-analysis/162-nonlinear-regression-essentials-in-r-polynomial-and-spline-regression-models/
Holtz, Y. (2021). Animated 3d chart with R. Retrieved 15 October 2021, from https://www.r-graph-gallery.com/3-r-animated-cube.html
10 Assumptions of Linear Regression - Full List with Examples and Code. (2021). Retrieved 15 October 2021, from http://r-statistics.co/Assumptions-of-Linear-Regression.html#Check%20Assumptions%20Automatically
Cross-Validation Tutorial | QuantDev Methodology. (2021). Retrieved 15 October 2021, from https://quantdev.ssri.psu.edu/tutorials/cross-validation-tutorial
6.1 - Introduction to Generalized Linear Models | STAT 504. (2021). Retrieved 15 October 2021, from https://online.stat.psu.edu/stat504/lesson/6/6.1
Kuhn, M. (2021). 6 Available Models | The caret Package. Retrieved 15 October 2021, from https://topepo.github.io/caret/available-models.html
Is R-squared Useless? | University of Virginia Library Research Data Services + Sciences. (2015). Retrieved 15 October 2021, from https://data.library.virginia.edu/is-r-squared-useless/
Implement Machine Learning With Caret In R. (2016). Retrieved 15 October 2021, from https://www.analyticsvidhya.com/blog/2016/12/practical-guide-to-implement-machine-learning-with-caret-package-in-r-with-practice-problem/
Caret Package - A Complete Guide to Build Machine Learning in R. (2021). Retrieved 15 October 2021, from https://www.machinelearningplus.com/machine-learning/caret-package/