In this document, we explore insights from an analytic point of view and report the process of building predictive machine learning models that can be used to estimate a person’s medical cost (As far as the data set used is concerned). The data set used is named “Medical Cost Personal Datasets” and can be viewed in Kaggle using this link.

To begin with, we install the packages and load the respective libraries used in the subsequent analyses throughout this notebook.

Getting familiar with the data

Exploring the data thoroughly and having a firm grip on its context is vital for any kind of analysis. Since everything following is built on top of our understanding about this particular data, studying the data carefully is vital to moving further.

After running the code chunk we see that data consists of 7 columns age, sex, bmi, children, smoker, region, and charges. The first 6 columns are the attributes of the individuals while the last column (charges) shows the medical costs billed by health insurance. After making sure there are no n/a values in the data set and checking if the column’s data types are correctly recorded to explore data further, some graphics can be very convenient. These univariate analyses inform us about how the variables are distributed, helping us make better sense of the matter at hand. For instance, we can easily notice that the majority of the insured have no children or that the majority of the charges are concentrated in the range from 0 to 20000.

head(insurance_data)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
sum(is.na.data.frame(insurance_data))
## [1] 0
summary(insurance_data)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770
par(mfrow=c(2,2))

hist(insurance_data$age,
     breaks = sqrt(nrow(insurance_data)), main = "Distribution of Age", xlab = "Age")
hist(insurance_data$bmi,
     breaks = sqrt(nrow(insurance_data)), main = "Distribution of BMI", xlab = "BMI")
hist(insurance_data$children, main = "Number of children Covered", xlab = "Children")
hist(insurance_data$charges,
     breaks = sqrt(nrow(insurance_data)), main = "Individual medical charges billed ", xlab = "Charges")

p1<-ggplot(insurance_data) + geom_bar(aes(x=sex)) + labs(title ="Gender Distribution") + theme_minimal()

p2<-ggplot(insurance_data) + geom_bar(aes(x=smoker)) + labs(title = "Smoker/ Non-Smoker Distribution")+ theme_minimal()

p3<-ggplot(insurance_data) + geom_bar(aes(x=region))+ labs(title = "Distribution by Region")+ theme_minimal()

(p1 +p2)/p3

Box-plots explain the variability present in the data also indicating the outliers that might pose an issue to our models especially to the linear ones.

p4<-ggplot(insurance_data) +
  aes(x = "", y = age) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()
p5<-ggplot(insurance_data) +
  aes(x = "", y = bmi) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()
p6<-ggplot(insurance_data) +
  aes(x = "", y = children) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()
p7<-ggplot(insurance_data) +
  aes(x = "", y = charges) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

(p4 + p5)/(p6+p7)

With the help of a few graphs/charts, we can explore dimensions or interrelationships which might not be very visible at first sight but may hold important information.

For example from the graphs below, we can understand that charges increase by age or that smoking almost doubles the charges for an individual.

Also, we have to make sure that there is no significant correlation between independent variables that will be used as input into the models. Just like it might be seen from the last two entries, results show no significant correlation between variables in our example.

ggplot(insurance_data, aes(x=bmi, y= charges, color=age)) + geom_point()

ggplot(insurance_data, aes(x=age, y= charges, color=bmi)) + geom_point()

ggplot(insurance_data, aes(x=smoker, y= charges, color=sex)) + geom_boxplot()

numeric_variables_for_cor <- data.frame(age, bmi, children) %>%  cor() %>% 
  corrplot(method = "pie", type="lower")

categoric_variables_for_cor <- data.frame(sex, smoker, region) %>% 
  GKtauDataframe()
print.table(categoric_variables_for_cor)
##          sex smoker region
## sex    2.000  0.006  0.000
## smoker 0.006  2.000  0.002
## region 0.000  0.005  4.000

Getting started to fitting models

Let’s try to fit a multivariate linear regression model without making any changes to the data.

Using the default diagnostic tools available in R we conclude that the model does not satisfy normality assumptions and the plots are pretty disturbing in the means of proper model fitting.

This can also be confirmed using the package gvlma to automatically check if the assumptions are satisfied.

mrm <- lm(charges ~ age + sex + bmi + children + region, data = insurance_data)
summary(mrm)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + region, data = insurance_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14768  -6966  -4918   6738  47284 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6525.80    1840.87  -3.545 0.000406 ***
## age               242.25      22.27  10.878  < 2e-16 ***
## sexmale          1311.61     621.52   2.110 0.035018 *  
## bmi               314.60      53.53   5.877 5.28e-09 ***
## children          555.69     257.96   2.154 0.031406 *  
## regionnorthwest -1025.98     891.33  -1.151 0.249916    
## regionsoutheast    69.99     895.41   0.078 0.937712    
## regionsouthwest -1603.32     894.46  -1.792 0.073280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11350 on 1330 degrees of freedom
## Multiple R-squared:  0.1264, Adjusted R-squared:  0.1218 
## F-statistic:  27.5 on 7 and 1330 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mrm)

mean(mrm$residuals)
## [1] 4.652024e-13
install.packages("gvlma")
library(gvlma)
gvlma::gvlma(mrm)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + region, data = insurance_data)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -6525.80           242.25          1311.61           314.60  
##        children  regionnorthwest  regionsoutheast  regionsouthwest  
##          555.69         -1025.98            69.99         -1603.32  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = mrm) 
## 
##                        Value p-value                   Decision
## Global Stat        596.76086  0.0000 Assumptions NOT satisfied!
## Skewness           506.17542  0.0000 Assumptions NOT satisfied!
## Kurtosis            89.60758  0.0000 Assumptions NOT satisfied!
## Link Function        0.90452  0.3416    Assumptions acceptable.
## Heteroscedasticity   0.07334  0.7865    Assumptions acceptable.

Transforming the data

By now it’s clear that we have to transform the data in order to deal with the skewed distributions present in the variables. There are many ways that an analyst can use to deal with problems of this nature but often the most reliable solution is taking the natural logarithm of the data at hand.

As a demonstration of this, we tried different ways of data transformation and confirmed that even in our case natural logarithm transformation is the most appropriate choice. Below we can see the variable Charges transformed to normally distributed one as we take its natural logarithm.

par(mfrow=c(2,2))
hist(charges, main = "Untransformed Charges", xlab ="Charges" )
hist(log(charges + 1),  main = "Log transformation", xlab ="Log-Charges")
hist(sqrt(charges + 1),  main ="SQRT transformation", xlab ="SQRT-Charges")
hist(scale(charges + 1),  main ="Scaled Charges", xlab ="Scaled Charges")

Here we transform the data using log function available in R

#lets normalize data using logarithmic function


log_insurance_data <- insurance_data
log_insurance_data$age <- log(insurance_data$age + 1) #here we add 1 to avoid complications with values that are 0
log_insurance_data$bmi <- log(insurance_data$bmi + 1)
log_insurance_data$children <- log(insurance_data$children + 1)
log_insurance_data$charges <- log(insurance_data$charges + 1)

Now, we build another multivariate linear regression model, but this time using the data transformed with natural logarithm.

Judging only by the R-square value, if we compare the first model with the latter one, there is an astonishing improvement from 12% at the first model to 76.7% to this second. To better visualize the model’s performance we graphed a part of the predicted and actual values toward an X-index demonstrating the good performance of the model. But still there is an issue. The model does not meet the assumptions, therefore the results in other unseen data points might be highly unreliable.

log_model <- lm(charges ~ age+ sex+ bmi+ children+ smoker + region, data = log_insurance_data)
summary(log_model)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = log_insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88874 -0.22720 -0.08146  0.08731  2.21245 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.61522    0.23800  10.988  < 2e-16 ***
## age              1.28035    0.03238  39.545  < 2e-16 ***
## sexmale         -0.07495    0.02433  -3.080 0.002110 ** 
## bmi              0.45074    0.06492   6.943 5.99e-12 ***
## children         0.17895    0.02184   8.194 5.89e-16 ***
## smokeryes        1.55097    0.03020  51.361  < 2e-16 ***
## regionnorthwest -0.06743    0.03482  -1.937 0.052979 .  
## regionsoutheast -0.16051    0.03494  -4.595 4.75e-06 ***
## regionsouthwest -0.13222    0.03495  -3.783 0.000162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.443 on 1329 degrees of freedom
## Multiple R-squared:  0.7692, Adjusted R-squared:  0.7678 
## F-statistic: 553.5 on 8 and 1329 DF,  p-value: < 2.2e-16
library(caret)

gvlma::gvlma(log_model)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = log_insurance_data)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##         2.61522          1.28035         -0.07495          0.45074  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##         0.17895          1.55097         -0.06743         -0.16051  
## regionsouthwest  
##        -0.13222  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = log_model) 
## 
##                        Value p-value                   Decision
## Global Stat        2134.6808  0.0000 Assumptions NOT satisfied!
## Skewness            759.0545  0.0000 Assumptions NOT satisfied!
## Kurtosis           1188.0604  0.0000 Assumptions NOT satisfied!
## Link Function       187.1535  0.0000 Assumptions NOT satisfied!
## Heteroscedasticity    0.4124  0.5208    Assumptions acceptable.
par(mfrow=c(2,2))
plot(log_model)

predict(log_model, data.frame(log_insurance_data[1:299,-7])) ->log_model_predicted_charges

actual_charges <- log_insurance_data[1:299,7]
log_model_actual_vs_predicted <- data.frame(actual_charges,log_model_predicted_charges)
par(mfrow=c(1,1))
plot( y=log_model_actual_vs_predicted$actual_charges, x=1:299,  pch = 19 , col = "blue",ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)")
points(log_model_actual_vs_predicted$log_model_predicted_charges, col= "red")

RMSE(actual_charges,log_model_predicted_charges)
## [1] 0.4232411

Even though the improvement made to the model was considerably good, further attempts can be taken to produce more reliable and serious models.

For this reason, we decided to remove outliers since it can be very helpful while attempting to create more compact models.

#identifying the rows with outliers
boxplot.stats(insurance_data$charges)$out -> outlier_charges
boxplot.stats(insurance_data$bmi)$out -> outlier_bmi

outlier_bmi_rows <- which(insurance_data$bmi %in% c(outlier_bmi))

outlier_charges_rows <- which(insurance_data$charges %in% c(outlier_charges))


insurance_cleaned <-insurance_data[c(-outlier_bmi_rows,-outlier_charges_rows),]


par(mfrow=c(2,2))
boxplot(charges ,main="Charges Before", col= "#177924")
boxplot(insurance_cleaned$charges,main="Charges After", col="#0c4c8a")
boxplot(bmi, main="Bmi Before", col="#177924")
boxplot(insurance_cleaned$bmi, main="Bmi After", col ="#0c4c8a")

Let’s create a new data frame with transformed and outlier-cleaned variables

transformed_cleaned_insurance_data <- insurance_cleaned
transformed_cleaned_insurance_data$age <- log(transformed_cleaned_insurance_data$age + 1)
transformed_cleaned_insurance_data$bmi <- log(transformed_cleaned_insurance_data$bmi + 1)
transformed_cleaned_insurance_data$children <- log(transformed_cleaned_insurance_data$children + 1)
transformed_cleaned_insurance_data$charges <- log(transformed_cleaned_insurance_data$charges + 1)

After making these important transformations we built another Linear Regression model to assess if there are any improvements. After fitting the model and checking for its diagnostics we can see that model has a lower \(R^2\) value(70%), indicating lower explainability, while the RMSE value is slightly lower meaning the model is doing better at this particular data. But still, the model does not meet the Linear regression assumptions therefore its not reliable.

new_log_model <- lm(charges ~ age+ sex+ bmi+ children+ smoker + region, data = transformed_cleaned_insurance_data)
summary(new_log_model)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = transformed_cleaned_insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71376 -0.20025 -0.10003  0.04029  2.31836 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.03565    0.25440  11.933  < 2e-16 ***
## age              1.36682    0.03328  41.069  < 2e-16 ***
## sexmale         -0.08498    0.02498  -3.402 0.000692 ***
## bmi              0.24050    0.06981   3.445 0.000590 ***
## children         0.18745    0.02241   8.366  < 2e-16 ***
## smokeryes        1.32723    0.04045  32.810  < 2e-16 ***
## regionnorthwest -0.06855    0.03522  -1.947 0.051820 .  
## regionsoutheast -0.17887    0.03623  -4.937 9.07e-07 ***
## regionsouthwest -0.16557    0.03579  -4.626 4.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4309 on 1184 degrees of freedom
## Multiple R-squared:  0.7086, Adjusted R-squared:  0.7067 
## F-statistic: 359.9 on 8 and 1184 DF,  p-value: < 2.2e-16
predict(new_log_model, data.frame(transformed_cleaned_insurance_data[1:299,-7])) ->new_log_model_predicted_charges

actual_charges <- transformed_cleaned_insurance_data[1:299,7]
new_log_model_actual_vs_predicted <- data.frame(actual_charges,new_log_model_predicted_charges)
par(mfrow=c(1,1))
plot( y=new_log_model_actual_vs_predicted$actual_charges, x=1:299,ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)",  pch = 19 , col = "blue")
points(new_log_model_actual_vs_predicted$new_log_model_predicted_charges, col= "red")

par(mfrow=c(2,2))
plot(new_log_model)

RMSE(actual_charges,new_log_model_predicted_charges)
## [1] 0.4127277

Before we get into other ML techniques to try different models, we first create dummy variables for our factor variables in this way avoiding any potential complications.

#moving on trying other models


library(fastDummies)
set.seed(99)
# Create dummy variable
insurance_dummy <- dummy_cols(transformed_cleaned_insurance_data,
                                    select_columns = c("sex", "smoker", "region"))
insurance_dummy <- select(insurance_dummy, c(-sex,-smoker, -region))
head(insurance_dummy)
##        age      bmi  children  charges sex_female sex_male smoker_no smoker_yes
## 1 2.995732 3.363842 0.0000000 9.734236          1        0         0          1
## 2 2.944439 3.548755 0.6931472 7.453882          0        1         1          0
## 3 3.367296 3.526361 1.3862944 8.400763          0        1         1          0
## 4 3.526361 3.165686 0.0000000 9.998137          0        1         1          0
## 5 3.496508 3.397189 0.0000000 8.260455          0        1         1          0
## 6 3.465736 3.286161 0.0000000 8.231541          1        0         1          0
##   region_northeast region_northwest region_southeast region_southwest
## 1                0                0                0                1
## 2                0                0                1                0
## 3                0                0                1                0
## 4                0                1                0                0
## 5                0                1                0                0
## 6                0                0                1                0
#data partition

#stackoverflow help rsample
set.seed(99)
library(rsample)
data_split <- initial_split(insurance_dummy, prop = 0.75)
insurance_train <- training(data_split)
insurance_test  <- testing(data_split)

After fitting a linear regression model using cross-validation we conclude that this worsens the model’s performance also yielding a large RMSE value indicating a poor performance in the unseen data. This raises the question of the trustworthiness of the \(R^2\) value, is \(R^2\) telling the truth about the model?

At this point, we try to fit different models besides linear ones because non-linear relationships might not be properly modeled by the linear models.

library(caret)

set.seed(99)

crossV_lm_model <- train(charges ~. ,  method="lm", trControl=trainControl(method = "cv", number = 10), data=insurance_train) 

summary(crossV_lm_model$finalModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66118 -0.19965 -0.09530  0.04223  2.30065 
## 
## Coefficients: (3 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.182893   0.284828  14.686  < 2e-16 ***
## age               1.351315   0.038190  35.384  < 2e-16 ***
## bmi               0.232893   0.079783   2.919 0.003599 ** 
## children          0.187479   0.026112   7.180 1.48e-12 ***
## sex_female        0.095369   0.028814   3.310 0.000971 ***
## sex_male                NA         NA      NA       NA    
## smoker_no        -1.324333   0.045706 -28.975  < 2e-16 ***
## smoker_yes              NA         NA      NA       NA    
## region_northeast  0.158545   0.041792   3.794 0.000159 ***
## region_northwest  0.091303   0.040971   2.228 0.026101 *  
## region_southeast -0.009745   0.040992  -0.238 0.812153    
## region_southwest        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4305 on 885 degrees of freedom
## Multiple R-squared:  0.7125, Adjusted R-squared:  0.7099 
## F-statistic: 274.1 on 8 and 885 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(crossV_lm_model$finalModel)

predict(crossV_lm_model, data.frame(insurance_train[1:299,-4])) ->crossV_lm_model_predicted_charges

actual_charges <- insurance_test[,4]
crossV_lm_model_actual_vs_predicted <- data.frame(actual_charges,crossV_lm_model_predicted_charges)
par(mfrow=c(1,1))
index=1:299
plot( y=crossV_lm_model_actual_vs_predicted$actual_charges, x=index,  pch = 19 , ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)" , col = "blue")
points(crossV_lm_model_actual_vs_predicted$crossV_lm_model_predicted_charges, col= "red")

RMSE(actual_charges,crossV_lm_model_predicted_charges)
## [1] 1.051001

After trying to fit different linear regression models we realized that these models are not a proper fit for this particular data. Frankly, Linear regression’s assumptions are tough to be met. That’s why we thought it would be proper to use a wider class of regression models named Generalized Linear Models (GLM). This class of models is more flexible therefore allows for situations like non-linearity or non-normal distribution of the residuals.

set.seed(99)
control <- trainControl(method = "cv", number = 10)
glm_model <- train(charges ~ ., method="glm", data=insurance_train)


glm_model
## Generalized Linear Model 
## 
## 894 samples
##  11 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 894, 894, 894, 894, 894, 894, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.4309079  0.709822  0.2681891
par(mfrow=c(2,2))
plot(glm_model$finalModel)

predict(glm_model,newdata = insurance_test[,-4]) ->glm_predicted_charges

glm_actual_vs_predicted <- data.frame(actual_charges,glm_predicted_charges)
par(mfrow=c(1,1))
plot( y=glm_actual_vs_predicted$actual_charges, x=1:299,  pch = 19 , ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)"  , col = "blue")
points(glm_actual_vs_predicted$glm_predicted_charges, col= "red")

RMSE(predict(glm_model,newdata = insurance_test[,-4]),insurance_test[,4])
## [1] 0.4327908

With the help of the Caret package, we can change the machine learning algorithm easily without needing to twist the code much. Using this chunk code we use the k-nearest neighbors algorithm to predict the charges in the unseen data. The results are quite impressive.

set.seed(99)
knn_model <- train(charges ~ . , data=insurance_train, 
                   method="knn", tuneGrid = data.frame(k = seq(3, 71, 2)),trace = FALSE, trControl=trainControl(method = "cv", number = 10))
knn_model$finalModel
## 9-nearest neighbor regression model
par(mfrow=c(2,2))
ggplot(knn_model, highlight = TRUE)

predict(knn_model,newdata = insurance_test[,-4]) -> knn_predicted_charges
actual_charges <- insurance_test[,4]
knn_actual_vs_predicted <- data.frame(actual_charges,knn_predicted_charges)
par(mfrow=c(1,1))
index=1:299
plot( y=knn_actual_vs_predicted$actual_charges, x=index,  pch = 19 ,  ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)" , col = "blue")
points(knn_actual_vs_predicted$knn_predicted_charges, col= "red")

RMSE(predict(knn_model,newdata = insurance_test[,-4]), insurance_test[,4])
## [1] 0.4259729
rss <- sum((knn_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)

r_square = 1 - (rss/tss)
print(r_square)
## [1] 0.7032965

Comparing Models

As a review of the models, we have created let’s see how these models perform in the test data ,and let’s compute their \(R^2\) value manually concerning the test data.

rss <- sum((log_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Log model R-square", r_square))
## [1] "Log model R-square" "-1.18856343187918"
RMSE(actual_charges,log_model_predicted_charges)
## [1] 1.156912
rss <- sum((new_log_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("New-log model R-square", r_square))
## [1] "New-log model R-square" "-0.803973724043065"
RMSE(actual_charges,new_log_model_predicted_charges)
## [1] 1.050354
rss <- sum((crossV_lm_model_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated LM R-square", r_square))
## [1] "Cross validated LM R-square" "-0.806196181189394"
RMSE(actual_charges,crossV_lm_model_predicted_charges)
## [1] 1.051001
rss <- sum((knn_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print( c("Knn R-square", r_square))
## [1] "Knn R-square"      "0.703296478023315"
RMSE(predict(knn_model,newdata = insurance_test[,-4]), insurance_test[,4])
## [1] 0.4259729
rss <- sum((glm_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated Generalized-LM R-square", r_square))
## [1] "Cross validated Generalized-LM R-square"
## [2] "0.693722750903194"
RMSE(predict(glm_model,newdata = insurance_test[,-4]),insurance_test[,4])
## [1] 0.4327908
#importing the ploting function from Github

library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')

Fitting a Neural Network

library(reshape)
set.seed(99)
rows <- sample(nrow(insurance_train)) 
insurance_train <- insurance_train[rows, ]


set.seed(99)

grid<- expand.grid(size = seq(from = 1, to = 80, by = 10), decay = seq(from = 0.01, to = 0.4, by = 0.02))

ann_model <-train(charges ~ ., method="nnet", preProcess = c("center", "scale"), data=insurance_train, tuneGrid =  grid,trace = FALSE, linout = TRUE)


ggplot(ann_model, highlight = TRUE)

predict(ann_model,newdata = insurance_test[,-4]) ->ann_predicted_charges

ann_actual_vs_predicted <- data.frame(actual_charges,ann_predicted_charges)
index<-1:299
plot( y=ann_actual_vs_predicted$actual_charges, x=index,  pch = 19 , col = "blue", ylab = "Actual<>Predicted", xlab = "Index", main = "Comparing Actual values(blue) with the Predicted (Red)")
points(ann_actual_vs_predicted$ann_predicted_charges, col= "red")

RMSE(actual_charges,ann_predicted_charges)
## [1] 0.4057017
plot.nnet(ann_model)

#Now lets calculate its R-square value

rss <- sum((ann_predicted_charges - actual_charges) ^ 2)
tss <- sum((actual_charges - mean(actual_charges)) ^ 2)
r_square = 1 - (rss/tss)
print(c("Cross validated ANN R-square value", r_square))
## [1] "Cross validated ANN R-square value" "0.730863631680642"

Conclusion

Fitting the right model to the data often is described as a combined process of art and science. Here it is emphasized the fact that tuning the parameters and choosing the right algorithms does not always guarantee the most optimal solution (Also often defining “right” can be tough). As per our example, the linear regression model seemed to do well if judged only by the visuals we created for better interpretation. But, when checked for RMSE and \(R^2\) values we see the picture much clearer. Except for the Generalized linear model that performed relatively well the other Linear regressions that we tried to fit didn’t pay off. That might be because GLM’s are more flexible when it comes to the assumptions which were very hard to meet here. Also, we saw how misleading the model’s output can be if the assumptions aren’t met. While checking for the diagnostics after fitting the linear models we also noticed some non-linearity, this can be another important contributor to the linear regression model’s inability to deal with this data. Knn model was also a good performer but it is outpaced by the power of flexibility of the Artificial Neural Networks that had the best performance judged by \(R^2\) (73%) and RMSE value(0.4) altogether.

References AND great sources:

RPubs - Visual Tests for the Key Assumptions of Multiple Linear Regression. (2017). Retrieved 15 October 2021, from https://rpubs.com/ajdowny_student/300663

Medical Cost Personal Datasets. (2021). Retrieved 15 October 2021, from https://www.kaggle.com/mirichoi0218/insurance

1.3 - The Simple Linear Regression Model | STAT 501. (2021). Retrieved 15 October 2021, from https://online.stat.psu.edu/stat501/lesson/1/1.3

10 Assumptions of Linear Regression - Full List with Examples and Code. (2021). Retrieved 15 October 2021, from http://r-statistics.co/Assumptions-of-Linear-Regression.html

(2021). Retrieved 15 October 2021, from https://www.statisticssolutions.com/wp-content/uploads/wp-post-to-pdf-enhanced-cache/1/assumptions-of-multiple-linear-regression.pdf

Quick-R: Multiple Regression. (2021). Retrieved 15 October 2021, from https://www.statmethods.net/stats/regression.html

(2021). Retrieved 15 October 2021, from https://www.ics.uci.edu/~jutts/110/Lecture3.pdf

Outliers detection in R. (2021). Retrieved 15 October 2021, from https://statsandr.com/blog/outliers-detection-in-r/

Understanding Diagnostic Plots for Linear Regression Analysis | University of Virginia Library Research Data Services + Sciences. (2015). Retrieved 15 October 2021, from https://data.library.virginia.edu/diagnostic-plots/

Models, N. (2018). Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models - Articles - STHDA. Retrieved 15 October 2021, from http://www.sthda.com/english/articles/40-regression-analysis/162-nonlinear-regression-essentials-in-r-polynomial-and-spline-regression-models/

Holtz, Y. (2021). Animated 3d chart with R. Retrieved 15 October 2021, from https://www.r-graph-gallery.com/3-r-animated-cube.html

10 Assumptions of Linear Regression - Full List with Examples and Code. (2021). Retrieved 15 October 2021, from http://r-statistics.co/Assumptions-of-Linear-Regression.html#Check%20Assumptions%20Automatically

Cross-Validation Tutorial | QuantDev Methodology. (2021). Retrieved 15 October 2021, from https://quantdev.ssri.psu.edu/tutorials/cross-validation-tutorial

6.1 - Introduction to Generalized Linear Models | STAT 504. (2021). Retrieved 15 October 2021, from https://online.stat.psu.edu/stat504/lesson/6/6.1

Kuhn, M. (2021). 6 Available Models | The caret Package. Retrieved 15 October 2021, from https://topepo.github.io/caret/available-models.html

Is R-squared Useless? | University of Virginia Library Research Data Services + Sciences. (2015). Retrieved 15 October 2021, from https://data.library.virginia.edu/is-r-squared-useless/

Implement Machine Learning With Caret In R. (2016). Retrieved 15 October 2021, from https://www.analyticsvidhya.com/blog/2016/12/practical-guide-to-implement-machine-learning-with-caret-package-in-r-with-practice-problem/

Caret Package - A Complete Guide to Build Machine Learning in R. (2021). Retrieved 15 October 2021, from https://www.machinelearningplus.com/machine-learning/caret-package/