Medical Cost Analysis

The data used in this analysis was taken from kaggle and contains a series of data to predict the Indivial Medical Cost billed by health insurance.

The columns within the dataset are the following:

In this notebook we aim to perform an exploratory data analysis in order to describe the main characteristics of our data and then to train and fit a Linear Regression model to predict medical costs with the highest accuracy as possible.

First, we have to import the libraries needed and the database.

'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : chr  "female" "male" "male" "male" ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : chr  "yes" "no" "no" "no" ...
 $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
 $ charges : num  16885 1726 4449 21984 3867 ...

As we observe, the dataframe contains 1338 observations and 7 variables. Now, let’s see if we have missing variables into our columns.

## function to count missing values
Na_Count <- function(feature) {
  sum(is.na(feature))
}

## applying to the data
apply(medical_cost, 2, Na_Count)
     age      sex      bmi children   smoker   region  charges 
       0        0        0        0        0        0        0 

There are no missing values into the data frame. The next step is to convert categorical columns (“sex”, “children”, “smoker”, “region”) into factor, so we can used then later for drawing some plots and to train the model.

## to factor
medical_cost$sex <- as.factor(medical_cost$sex)
medical_cost$children <- as.factor(medical_cost$children)
medical_cost$smoker <- as.factor(medical_cost$smoker)
medical_cost$region <- as.factor(medical_cost$region)

## 
head(medical_cost)
  age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

Now, let’s get some statistics about the data.

## summary
summary(medical_cost)
      age            sex           bmi        children smoker    
 Min.   :18.00   female:662   Min.   :15.96   0:574    no :1064  
 1st Qu.:27.00   male  :676   1st Qu.:26.30   1:324    yes: 274  
 Median :39.00                Median :30.40   2:240              
 Mean   :39.21                Mean   :30.66   3:157              
 3rd Qu.:51.00                3rd Qu.:34.69   4: 25              
 Max.   :64.00                Max.   :53.13   5: 18              
       region       charges     
 northeast:324   Min.   : 1122  
 northwest:325   1st Qu.: 4740  
 southeast:364   Median : 9382  
 southwest:325   Mean   :13270  
                 3rd Qu.:16640  
                 Max.   :63770  

Exploratory Data Analysis

First of all, I want to see the correlation between the continuous features (age and bmi), with the target variable (charges), since it is important to know this values to figure out how linear is that relationship.

We have to get the correlation matrix and then melt the data to make possible drawing a heatmap with ggplot2.

library(reshape)
cormat <- cor(medical_cost[c("age","bmi", "charges")])

## melting
melted_data <- melt(cormat)
head(melted_data)
       X1  X2     value
1     age age 1.0000000
2     bmi age 0.1092719
3 charges age 0.2990082
4     age bmi 0.1092719
5     bmi bmi 1.0000000
6 charges bmi 0.1983410
## plot
g <- ggplot(data = melted_data, aes(x=X1, y=X2, fill=value)) + geom_tile()
g <- g + labs(x="", y="") + labs(title="Correlation heatmap")
g <- g + geom_text(aes(X1,X2, label=round(value,2)), color="white", size=4)
g

Here we see that the linear correlation between our variables is low, so we can say that the real relationship is not linear. I will explore that later in the notebook.

For now, let’s build a series of plot that allow to understand the behavior of the charges relative to every variable in the dataset.

  • Age vs Charges

We already know that the correlation between the variables is not that linear, but it is a good idea to explore the distribution of charges for those ones whose age is above and below the mean value.

## adding the discretization
medical_age <- medical_cost
medical_age$aboveMean<- ifelse(medical_age$age > mean(medical_age$age), "above", "below")
medical_age$aboveMean <- as.factor(medical_age$aboveMean)

## plot
ghist <- ggplot(medical_age, aes(x=charges))+geom_histogram(aes(color=aboveMean, fill=aboveMean))
ghist <- ghist + facet_grid(aboveMean~.) + labs(y="", title="Charges Distribution")
ghist

People older than 39 years tend to spend more money in medical insurances. Both histogram show a multimodal behavior. In both cases we observe charges greater than USD 60000.

Now, lets plot a scatter plot of charges vs age vs sex.

## scatter
gscat <- ggplot(medical_cost, aes(age, charges, color=sex)) + geom_point()
gscat <- gscat + labs(title = "Age vs Charges plot")
gscat <- gscat + geom_smooth()
gscat

In general, charges increase with age. There is no difference between the amount of money paid by men and women.

## scatter
gscat <- ggplot(medical_cost, aes(age, charges)) + geom_point()
gscat <- gscat + labs(title = "Age vs Charges plot")
gscat <- gscat + geom_smooth() + facet_grid(.~region)
gscat

The tendency is the same when we plot the age vs charges by region. The older the person, the more money he/she spends on medical insurances.

  • BMI vs Charges

Again, let’s make a plot of BMI vs Charges, discretizing by sex.

## scatter
gscat2 <- ggplot(medical_cost, aes(bmi, charges, color=sex)) + geom_point()
gscat2 <- gscat2 + labs(x="BMI", y="Charges", title="BMI vs Charges")
gscat2

There is no a clear trend between the charges by BMI. BMI seems to be well distributed between both sex. Now, let’t compare the amount paid by people with a BMI greater and less than the mean.

## bmi
medical_age$bmiAbvMean <- ifelse(medical_age$bmi>mean(medical_age$bmi), "Above", "Below")
medical_age$bmiAbvMean <- as.factor(medical_age$bmiAbvMean)

## plot
ghist <- ggplot(medical_age, aes(x=charges)) + geom_histogram(aes(color=bmiAbvMean, fill=bmiAbvMean))
ghist <- ghist + facet_grid(bmiAbvMean~.) + labs(title="Charges distribution by BMI", y="")
ghist

The previous plot shows that the amount of money spent by people with a BMI above or below the mean doesn’t vary a lot. Now I want to analyze it in terms of the smoker variable.

## smoke
gscat3 <- ggplot(medical_cost, aes(bmi, charges, color=smoker)) + geom_point()
gscat3 <- gscat3 + labs(x="BMI", y="Charges", title="BMI vs Charges")
gscat3

Smokers spend more money in medical insurances regardless of their BMI. Now, let’s see the average amount of money people paid according to their number of children.

## children

gbar <- ggplot(medical_cost, aes(x=children, y=charges)) + stat_summary(fun.y = "mean", geom = "bar", color="lightblue", fill="lightblue")
gbar <- gbar + labs(title="Average Charges by # of Children")
gbar

Having more children doesn’t mean people will spend more money in medical insurances. In fact, people with 5 children pay less money than the others.

gbox <- ggplot(medical_cost, aes(x=children, y=charges)) + geom_boxplot(aes(fill=children))
gbox <- gbox + labs(title="Charges by # of children")
gbox

  • As we see, in every case there are outliers in the money people with a certain amount of children paid.

  • The median charges (the line inside the box) is almost doesn’t vary a lot in a scale of thousands.

  • The range of charges is wider in families with 2 and 3 children, which where found to spend more money in average (see previous plot).

  • Charges by Sex

In this part, I want to observe the charges distribution by sex by drawing a violin plot.

#charges by sex
gvio <- ggplot(medical_cost, aes(x=charges, y=sex)) + geom_violin(aes(fill=sex))
gvio

As we said earlier, there is not much difference between the average amount of money paid by each gender as is shown in the violin plot.

Linear Model

From the previous plot, we saw a weak linear correlation between the continuous features and the target variable. Having that into account, I want to fit a Linear model with the lm() function to see the performance of this model. With those results, I want to see whether it is necessary to transform the variables in order to get a better performance.

Now I’m going to train the linear model with all of the features.

## linear model
model_1 <- lm(charges~., data=medical_cost)
summary(model_1)

Call:
lm(formula = charges ~ ., data = medical_cost)

Residuals:
     Min       1Q   Median       3Q      Max 
-11689.4  -2902.6   -943.7   1492.2  30042.7 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11927.17     993.66 -12.003  < 2e-16 ***
age                257.19      11.91  21.587  < 2e-16 ***
sexmale           -128.16     332.83  -0.385 0.700254    
bmi                336.91      28.61  11.775  < 2e-16 ***
children1          390.98     421.35   0.928 0.353619    
children2         1635.78     466.67   3.505 0.000471 ***
children3          964.34     548.10   1.759 0.078735 .  
children4         2947.37    1239.16   2.379 0.017524 *  
children5         1116.04    1456.02   0.767 0.443514    
smokeryes        23836.41     414.14  57.557  < 2e-16 ***
regionnorthwest   -380.04     476.56  -0.797 0.425318    
regionsoutheast  -1033.14     479.14  -2.156 0.031245 *  
regionsouthwest   -952.89     478.15  -1.993 0.046483 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6059 on 1325 degrees of freedom
Multiple R-squared:  0.7519,    Adjusted R-squared:  0.7497 
F-statistic: 334.7 on 12 and 1325 DF,  p-value: < 2.2e-16

From the obtained p-values, we can see that, in general, the region variable is not significant to the model.
This model also shows a R2 equals to 74.97%, which is low in terms of the performance of the model when predicting. I will train a model without the intercept to figure out if the model improves.

## linear model 2
model_2 <- lm(charges~.-1, data=medical_cost)
summary(model_2)

Call:
lm(formula = charges ~ . - 1, data = medical_cost)

Residuals:
     Min       1Q   Median       3Q      Max 
-11689.4  -2902.6   -943.7   1492.2  30042.7 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
age                257.19      11.91  21.587  < 2e-16 ***
sexfemale       -11927.17     993.66 -12.003  < 2e-16 ***
sexmale         -12055.34    1005.76 -11.986  < 2e-16 ***
bmi                336.91      28.61  11.775  < 2e-16 ***
children1          390.98     421.35   0.928 0.353619    
children2         1635.78     466.67   3.505 0.000471 ***
children3          964.34     548.10   1.759 0.078735 .  
children4         2947.37    1239.16   2.379 0.017524 *  
children5         1116.04    1456.02   0.767 0.443514    
smokeryes        23836.41     414.14  57.557  < 2e-16 ***
regionnorthwest   -380.04     476.56  -0.797 0.425318    
regionsoutheast  -1033.14     479.14  -2.156 0.031245 *  
regionsouthwest   -952.89     478.15  -1.993 0.046483 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6059 on 1325 degrees of freedom
Multiple R-squared:  0.8873,    Adjusted R-squared:  0.8862 
F-statistic: 802.7 on 13 and 1325 DF,  p-value: < 2.2e-16

By removing the intercept we can observe that the model improves, since our R2 now is 88.62%. We can use this model to make predictions using the testing later in this notebook.

For now, let’s get some of the plots R allows us to draw in order to understand better our linear model.

## plots
par(mfrow=c(1,2))

plot(model_2,1)
plot(model_2,2)

In the Residuals vs Fitted plot, we can observe that the point are not randomly distributed (the ideal behavior). In statistics this behavior is attributed to the fact that the variance of the error terms is not constant.

## plots
par(mfrow=c(1,2))

plot(model_2,3)
plot(model_2,4)

The cook distance is a way to estimate the influence of a data point when performing a least-squared linear regression analysis like the one I’m trying to perform. This Cook’s distance measures the effect of deleting a given point on the regression and in a deeper study it is good to examine those point with a high Cook’s distance.

Any point with a Cook’s distance higher than 0.5 is consider to have a high influence on the model, so we don’t have such a problem.

## plots
par(mfrow=c(1,2))

plot(model_2,5)
plot(model_2,6)

The residuals vs leverage plot shows the Standarized Residuals. Here, any point above 2 and below -2 is suspicious to be an outlier. Our model also take 13 predictor (taking into account the codification it takes for the categorical ones), so any point with a leverage higher than (\(2*13/1338 = 0.019\)) is consider to have a high leverage.

From the linear model, we can get the leverage, standarized residuals and Cook’s distance by using some functions. I’m gonna do that and add each of these calculations to the dataset.

## new variables
medical_cost$Leverage <- hatvalues(model_2)
medical_cost$Stand.resid <- rstandard(model_2)
medical_cost$Cooks.dist <- cooks.distance(model_2)

head(medical_cost)
  age    sex    bmi children smoker    region   charges    Leverage Stand.resid
1  19 female 27.900        0    yes southwest 16884.924 0.009669119 -1.38618109
2  18   male 33.770        1     no southeast  1725.552 0.007721241 -0.26242801
3  28   male 33.000        3     no southeast  4449.462 0.010522469 -0.28967340
4  33   male 22.705        0     no northwest 21984.471 0.006163663  3.02694311
5  32   male 28.880        0     no northwest  3866.855 0.005291041 -0.27435914
6  31 female 25.740        0     no southeast  3756.622 0.006041479  0.01190455
    Cooks.dist
1 1.443122e-03
2 4.122213e-05
3 6.864132e-05
4 4.371084e-03
5 3.079928e-05
6 6.626089e-08

We can identify the influential point by filtering the data set using the which() function.

  • High residual point
high_resid <- medical_cost[which(abs(medical_cost$Stand.resid)>2),]
head(high_resid)
    age    sex    bmi children smoker    region  charges    Leverage
4    33   male 22.705        0     no northwest 21984.47 0.006163663
10   60 female 25.840        0     no northwest 28923.14 0.007053819
35   28   male 36.400        1    yes southwest 51194.56 0.010559523
63   64   male 24.700        1     no northwest 30166.62 0.009530033
103  18 female 30.115        0     no northeast 21344.85 0.006660857
116  60   male 28.595        0     no northeast 30260.00 0.006867743
    Stand.resid  Cooks.dist
4      3.026943 0.004371084
10     2.831206 0.004380244
35     3.403272 0.009508336
63     2.890461 0.006183636
103    3.063069 0.004839526
116    2.856909 0.004341666
  • High leverage point
num_coef = length(model_2$coefficients)
num_row = length(model_2$residuals)

high_leverage <- medical_cost[which(medical_cost$Leverage>(2*num_coef/num_row)), ]
head(high_leverage)
    age    sex    bmi children smoker    region   charges   Leverage
33   19 female 28.600        5     no southwest  4687.797 0.05894601
62   25   male 33.660        4     no southeast  4504.662 0.04386263
72   31   male 28.500        5     no northeast  6799.458 0.05935430
84   48 female 41.230        4     no northwest 11033.662 0.04626809
166  47   male 28.215        4     no northeast 10407.086 0.04322164
167  20 female 37.000        5     no southwest  4830.630 0.06063505
    Stand.resid   Cooks.dist
33   0.32829573 5.193103e-04
62  -0.52738013 9.814735e-04
72   0.02788553 3.774335e-06
84  -0.98742019 3.638446e-03
166 -0.35078895 4.276007e-04
167 -0.17282044 1.482983e-04
  • High Cook’s Distance Points
medical_cost[which(medical_cost$Cooks.dist>0.5),]
 [1] age         sex         bmi         children    smoker      region     
 [7] charges     Leverage    Stand.resid Cooks.dist 
<0 rows> (or 0-length row.names)

As we saw earlier, there are no points with a high Cook’s distance.

We can organize the influential points, having high leverage and high residual, in order to study them closely. To do this I’ll use the union() function.

influentials <- union(which(medical_cost$Leverage>(2*num_coef/num_row)), which(abs(medical_cost$Stand.resid)>2))

influentials <- sort(influentials)

## dataset
influential_points <- medical_cost[influentials, ]
head(influential_points)
   age    sex    bmi children smoker    region   charges    Leverage
4   33   male 22.705        0     no northwest 21984.471 0.006163663
10  60 female 25.840        0     no northwest 28923.137 0.007053819
33  19 female 28.600        5     no southwest  4687.797 0.058946006
35  28   male 36.400        1    yes southwest 51194.559 0.010559523
62  25   male 33.660        4     no southeast  4504.662 0.043862635
63  64   male 24.700        1     no northwest 30166.618 0.009530033
   Stand.resid   Cooks.dist
4    3.0269431 0.0043710839
10   2.8312065 0.0043802444
33   0.3282957 0.0005193103
35   3.4032717 0.0095083360
62  -0.5273801 0.0009814735
63   2.8904608 0.0061836362

With this data and more information about the source of the data one could perform a deeper analysis. That work is out of the scope of this analysis.

From the previous analysis, we can say that the linear model may not be the adequate for predicting the Medical Insurances Cost. It could be a good idea to try fitting another model, but that’s not the idea here; I just wanted to discover and understand some of the features related to performing an analysis using the Linear Model.