The data used in this analysis was taken from kaggle and contains a series of data to predict the Indivial Medical Cost billed by health insurance.
The columns within the dataset are the following:
In this notebook we aim to perform an exploratory data analysis in order to describe the main characteristics of our data and then to train and fit a Linear Regression model to predict medical costs with the highest accuracy as possible.
First, we have to import the libraries needed and the database.
'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : chr "female" "male" "male" "male" ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : chr "yes" "no" "no" "no" ...
$ region : chr "southwest" "southeast" "southeast" "northwest" ...
$ charges : num 16885 1726 4449 21984 3867 ...
As we observe, the dataframe contains 1338 observations and 7 variables. Now, let’s see if we have missing variables into our columns.
## function to count missing values
Na_Count <- function(feature) {
sum(is.na(feature))
}
## applying to the data
apply(medical_cost, 2, Na_Count)
age sex bmi children smoker region charges
0 0 0 0 0 0 0
There are no missing values into the data frame. The next step is to convert categorical columns (“sex”, “children”, “smoker”, “region”) into factor, so we can used then later for drawing some plots and to train the model.
## to factor
medical_cost$sex <- as.factor(medical_cost$sex)
medical_cost$children <- as.factor(medical_cost$children)
medical_cost$smoker <- as.factor(medical_cost$smoker)
medical_cost$region <- as.factor(medical_cost$region)
##
head(medical_cost)
age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
Now, let’s get some statistics about the data.
## summary
summary(medical_cost)
age sex bmi children smoker
Min. :18.00 female:662 Min. :15.96 0:574 no :1064
1st Qu.:27.00 male :676 1st Qu.:26.30 1:324 yes: 274
Median :39.00 Median :30.40 2:240
Mean :39.21 Mean :30.66 3:157
3rd Qu.:51.00 3rd Qu.:34.69 4: 25
Max. :64.00 Max. :53.13 5: 18
region charges
northeast:324 Min. : 1122
northwest:325 1st Qu.: 4740
southeast:364 Median : 9382
southwest:325 Mean :13270
3rd Qu.:16640
Max. :63770
First of all, I want to see the correlation between the continuous features (age and bmi), with the target variable (charges), since it is important to know this values to figure out how linear is that relationship.
We have to get the correlation matrix and then melt the data to make possible drawing a heatmap with ggplot2.
library(reshape)
cormat <- cor(medical_cost[c("age","bmi", "charges")])
## melting
melted_data <- melt(cormat)
head(melted_data)
X1 X2 value
1 age age 1.0000000
2 bmi age 0.1092719
3 charges age 0.2990082
4 age bmi 0.1092719
5 bmi bmi 1.0000000
6 charges bmi 0.1983410
## plot
g <- ggplot(data = melted_data, aes(x=X1, y=X2, fill=value)) + geom_tile()
g <- g + labs(x="", y="") + labs(title="Correlation heatmap")
g <- g + geom_text(aes(X1,X2, label=round(value,2)), color="white", size=4)
g
Here we see that the linear correlation between our variables is low, so we can say that the real relationship is not linear. I will explore that later in the notebook.
For now, let’s build a series of plot that allow to understand the behavior of the charges relative to every variable in the dataset.
We already know that the correlation between the variables is not that linear, but it is a good idea to explore the distribution of charges for those ones whose age is above and below the mean value.
## adding the discretization
medical_age <- medical_cost
medical_age$aboveMean<- ifelse(medical_age$age > mean(medical_age$age), "above", "below")
medical_age$aboveMean <- as.factor(medical_age$aboveMean)
## plot
ghist <- ggplot(medical_age, aes(x=charges))+geom_histogram(aes(color=aboveMean, fill=aboveMean))
ghist <- ghist + facet_grid(aboveMean~.) + labs(y="", title="Charges Distribution")
ghist
People older than 39 years tend to spend more money in medical insurances. Both histogram show a multimodal behavior. In both cases we observe charges greater than USD 60000.
Now, lets plot a scatter plot of charges vs age vs sex.
## scatter
gscat <- ggplot(medical_cost, aes(age, charges, color=sex)) + geom_point()
gscat <- gscat + labs(title = "Age vs Charges plot")
gscat <- gscat + geom_smooth()
gscat
In general, charges increase with age. There is no difference between the amount of money paid by men and women.
## scatter
gscat <- ggplot(medical_cost, aes(age, charges)) + geom_point()
gscat <- gscat + labs(title = "Age vs Charges plot")
gscat <- gscat + geom_smooth() + facet_grid(.~region)
gscat
The tendency is the same when we plot the age vs charges by region. The older the person, the more money he/she spends on medical insurances.
Again, let’s make a plot of BMI vs Charges, discretizing by sex.
## scatter
gscat2 <- ggplot(medical_cost, aes(bmi, charges, color=sex)) + geom_point()
gscat2 <- gscat2 + labs(x="BMI", y="Charges", title="BMI vs Charges")
gscat2
There is no a clear trend between the charges by BMI. BMI seems to be well distributed between both sex. Now, let’t compare the amount paid by people with a BMI greater and less than the mean.
## bmi
medical_age$bmiAbvMean <- ifelse(medical_age$bmi>mean(medical_age$bmi), "Above", "Below")
medical_age$bmiAbvMean <- as.factor(medical_age$bmiAbvMean)
## plot
ghist <- ggplot(medical_age, aes(x=charges)) + geom_histogram(aes(color=bmiAbvMean, fill=bmiAbvMean))
ghist <- ghist + facet_grid(bmiAbvMean~.) + labs(title="Charges distribution by BMI", y="")
ghist
The previous plot shows that the amount of money spent by people with a BMI above or below the mean doesn’t vary a lot. Now I want to analyze it in terms of the smoker variable.
## smoke
gscat3 <- ggplot(medical_cost, aes(bmi, charges, color=smoker)) + geom_point()
gscat3 <- gscat3 + labs(x="BMI", y="Charges", title="BMI vs Charges")
gscat3
Smokers spend more money in medical insurances regardless of their BMI. Now, let’s see the average amount of money people paid according to their number of children.
## children
gbar <- ggplot(medical_cost, aes(x=children, y=charges)) + stat_summary(fun.y = "mean", geom = "bar", color="lightblue", fill="lightblue")
gbar <- gbar + labs(title="Average Charges by # of Children")
gbar
Having more children doesn’t mean people will spend more money in medical insurances. In fact, people with 5 children pay less money than the others.
gbox <- ggplot(medical_cost, aes(x=children, y=charges)) + geom_boxplot(aes(fill=children))
gbox <- gbox + labs(title="Charges by # of children")
gbox
As we see, in every case there are outliers in the money people with a certain amount of children paid.
The median charges (the line inside the box) is almost doesn’t vary a lot in a scale of thousands.
The range of charges is wider in families with 2 and 3 children, which where found to spend more money in average (see previous plot).
Charges by Sex
In this part, I want to observe the charges distribution by sex by drawing a violin plot.
#charges by sex
gvio <- ggplot(medical_cost, aes(x=charges, y=sex)) + geom_violin(aes(fill=sex))
gvio
As we said earlier, there is not much difference between the average amount of money paid by each gender as is shown in the violin plot.
From the previous plot, we saw a weak linear correlation between the continuous features and the target variable. Having that into account, I want to fit a Linear model with the lm() function to see the performance of this model. With those results, I want to see whether it is necessary to transform the variables in order to get a better performance.
Now I’m going to train the linear model with all of the features.
## linear model
model_1 <- lm(charges~., data=medical_cost)
summary(model_1)
Call:
lm(formula = charges ~ ., data = medical_cost)
Residuals:
Min 1Q Median 3Q Max
-11689.4 -2902.6 -943.7 1492.2 30042.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11927.17 993.66 -12.003 < 2e-16 ***
age 257.19 11.91 21.587 < 2e-16 ***
sexmale -128.16 332.83 -0.385 0.700254
bmi 336.91 28.61 11.775 < 2e-16 ***
children1 390.98 421.35 0.928 0.353619
children2 1635.78 466.67 3.505 0.000471 ***
children3 964.34 548.10 1.759 0.078735 .
children4 2947.37 1239.16 2.379 0.017524 *
children5 1116.04 1456.02 0.767 0.443514
smokeryes 23836.41 414.14 57.557 < 2e-16 ***
regionnorthwest -380.04 476.56 -0.797 0.425318
regionsoutheast -1033.14 479.14 -2.156 0.031245 *
regionsouthwest -952.89 478.15 -1.993 0.046483 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6059 on 1325 degrees of freedom
Multiple R-squared: 0.7519, Adjusted R-squared: 0.7497
F-statistic: 334.7 on 12 and 1325 DF, p-value: < 2.2e-16
From the obtained p-values, we can see that, in general, the region variable is not significant to the model.
This model also shows a R2 equals to 74.97%, which is low in terms of the performance of the model when predicting. I will train a model without the intercept to figure out if the model improves.
## linear model 2
model_2 <- lm(charges~.-1, data=medical_cost)
summary(model_2)
Call:
lm(formula = charges ~ . - 1, data = medical_cost)
Residuals:
Min 1Q Median 3Q Max
-11689.4 -2902.6 -943.7 1492.2 30042.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
age 257.19 11.91 21.587 < 2e-16 ***
sexfemale -11927.17 993.66 -12.003 < 2e-16 ***
sexmale -12055.34 1005.76 -11.986 < 2e-16 ***
bmi 336.91 28.61 11.775 < 2e-16 ***
children1 390.98 421.35 0.928 0.353619
children2 1635.78 466.67 3.505 0.000471 ***
children3 964.34 548.10 1.759 0.078735 .
children4 2947.37 1239.16 2.379 0.017524 *
children5 1116.04 1456.02 0.767 0.443514
smokeryes 23836.41 414.14 57.557 < 2e-16 ***
regionnorthwest -380.04 476.56 -0.797 0.425318
regionsoutheast -1033.14 479.14 -2.156 0.031245 *
regionsouthwest -952.89 478.15 -1.993 0.046483 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6059 on 1325 degrees of freedom
Multiple R-squared: 0.8873, Adjusted R-squared: 0.8862
F-statistic: 802.7 on 13 and 1325 DF, p-value: < 2.2e-16
By removing the intercept we can observe that the model improves, since our R2 now is 88.62%. We can use this model to make predictions using the testing later in this notebook.
For now, let’s get some of the plots R allows us to draw in order to understand better our linear model.
## plots
par(mfrow=c(1,2))
plot(model_2,1)
plot(model_2,2)
In the Residuals vs Fitted plot, we can observe that the point are not randomly distributed (the ideal behavior). In statistics this behavior is attributed to the fact that the variance of the error terms is not constant.
## plots
par(mfrow=c(1,2))
plot(model_2,3)
plot(model_2,4)
The cook distance is a way to estimate the influence of a data point when performing a least-squared linear regression analysis like the one I’m trying to perform. This Cook’s distance measures the effect of deleting a given point on the regression and in a deeper study it is good to examine those point with a high Cook’s distance.
Any point with a Cook’s distance higher than 0.5 is consider to have a high influence on the model, so we don’t have such a problem.
## plots
par(mfrow=c(1,2))
plot(model_2,5)
plot(model_2,6)
The residuals vs leverage plot shows the Standarized Residuals. Here, any point above 2 and below -2 is suspicious to be an outlier. Our model also take 13 predictor (taking into account the codification it takes for the categorical ones), so any point with a leverage higher than (\(2*13/1338 = 0.019\)) is consider to have a high leverage.
From the linear model, we can get the leverage, standarized residuals and Cook’s distance by using some functions. I’m gonna do that and add each of these calculations to the dataset.
## new variables
medical_cost$Leverage <- hatvalues(model_2)
medical_cost$Stand.resid <- rstandard(model_2)
medical_cost$Cooks.dist <- cooks.distance(model_2)
head(medical_cost)
age sex bmi children smoker region charges Leverage Stand.resid
1 19 female 27.900 0 yes southwest 16884.924 0.009669119 -1.38618109
2 18 male 33.770 1 no southeast 1725.552 0.007721241 -0.26242801
3 28 male 33.000 3 no southeast 4449.462 0.010522469 -0.28967340
4 33 male 22.705 0 no northwest 21984.471 0.006163663 3.02694311
5 32 male 28.880 0 no northwest 3866.855 0.005291041 -0.27435914
6 31 female 25.740 0 no southeast 3756.622 0.006041479 0.01190455
Cooks.dist
1 1.443122e-03
2 4.122213e-05
3 6.864132e-05
4 4.371084e-03
5 3.079928e-05
6 6.626089e-08
We can identify the influential point by filtering the data set using the which() function.
high_resid <- medical_cost[which(abs(medical_cost$Stand.resid)>2),]
head(high_resid)
age sex bmi children smoker region charges Leverage
4 33 male 22.705 0 no northwest 21984.47 0.006163663
10 60 female 25.840 0 no northwest 28923.14 0.007053819
35 28 male 36.400 1 yes southwest 51194.56 0.010559523
63 64 male 24.700 1 no northwest 30166.62 0.009530033
103 18 female 30.115 0 no northeast 21344.85 0.006660857
116 60 male 28.595 0 no northeast 30260.00 0.006867743
Stand.resid Cooks.dist
4 3.026943 0.004371084
10 2.831206 0.004380244
35 3.403272 0.009508336
63 2.890461 0.006183636
103 3.063069 0.004839526
116 2.856909 0.004341666
num_coef = length(model_2$coefficients)
num_row = length(model_2$residuals)
high_leverage <- medical_cost[which(medical_cost$Leverage>(2*num_coef/num_row)), ]
head(high_leverage)
age sex bmi children smoker region charges Leverage
33 19 female 28.600 5 no southwest 4687.797 0.05894601
62 25 male 33.660 4 no southeast 4504.662 0.04386263
72 31 male 28.500 5 no northeast 6799.458 0.05935430
84 48 female 41.230 4 no northwest 11033.662 0.04626809
166 47 male 28.215 4 no northeast 10407.086 0.04322164
167 20 female 37.000 5 no southwest 4830.630 0.06063505
Stand.resid Cooks.dist
33 0.32829573 5.193103e-04
62 -0.52738013 9.814735e-04
72 0.02788553 3.774335e-06
84 -0.98742019 3.638446e-03
166 -0.35078895 4.276007e-04
167 -0.17282044 1.482983e-04
medical_cost[which(medical_cost$Cooks.dist>0.5),]
[1] age sex bmi children smoker region
[7] charges Leverage Stand.resid Cooks.dist
<0 rows> (or 0-length row.names)
As we saw earlier, there are no points with a high Cook’s distance.
We can organize the influential points, having high leverage and high residual, in order to study them closely. To do this I’ll use the union() function.
influentials <- union(which(medical_cost$Leverage>(2*num_coef/num_row)), which(abs(medical_cost$Stand.resid)>2))
influentials <- sort(influentials)
## dataset
influential_points <- medical_cost[influentials, ]
head(influential_points)
age sex bmi children smoker region charges Leverage
4 33 male 22.705 0 no northwest 21984.471 0.006163663
10 60 female 25.840 0 no northwest 28923.137 0.007053819
33 19 female 28.600 5 no southwest 4687.797 0.058946006
35 28 male 36.400 1 yes southwest 51194.559 0.010559523
62 25 male 33.660 4 no southeast 4504.662 0.043862635
63 64 male 24.700 1 no northwest 30166.618 0.009530033
Stand.resid Cooks.dist
4 3.0269431 0.0043710839
10 2.8312065 0.0043802444
33 0.3282957 0.0005193103
35 3.4032717 0.0095083360
62 -0.5273801 0.0009814735
63 2.8904608 0.0061836362
With this data and more information about the source of the data one could perform a deeper analysis. That work is out of the scope of this analysis.
From the previous analysis, we can say that the linear model may not be the adequate for predicting the Medical Insurances Cost. It could be a good idea to try fitting another model, but that’s not the idea here; I just wanted to discover and understand some of the features related to performing an analysis using the Linear Model.