Linear Regression Analysis of Insurance Data

Part 1: Collecting Data ——————-

Insurance.csv file is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6.

Step 2: Exploring and preparing the data —-

The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.

The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

insurance = read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/insurance.csv", stringsAsFactors = T)
str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...

summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :16.00   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.67   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.10   Max.   :5.000             
##        region       expenses    
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

The estimation on the medical expenses of individual based on their personal/family traits is of interest in this exercise. It is always helpful to have some idea of what the target feature is. Thus, a histogram of the medical expenses is plotted, and has shown a right skewed distribution. Having this in mind is helpful for further analysis.

hist(insurance$expenses, breaks = 30)

It is also beneficial to know the proportion distribution of the nominal features such as sex (male/female), smoking condition (yes/no) and living regions in this example. The dataset has contains a pretty belance in proporiton of sex and region, but not in smoking condition. Only about one fifth of the population in this dataset are smokers.

table(insurance$sex)

## 
## female   male 
##    662    676

table(insurance$smoker)

## 
##   no  yes 
## 1064  274

table(insurance$region)

## 
## northeast northwest southeast southwest 
##       324       325       364       325

One way to look at the relationship of numerical features is the used of a correlation matrix. A correlation value is between -1 to 1 when a negative indicates reversed relationship and a positive indicates a parallel relationship; a value closer to the extremes indicates a higher correlation. The correlation matrix below depicted relationship of four features in this dataset, while the number at the intersection of each row and column pair is actually indicating a correlation between two features by that row and column.

In this correlation matrix, the diagonal values are always one because a feature is at most correlated with itself. The values are seperated into two sets off diagonal although they are identical. For example, the highest correlation pair is the age & expenses with about 0.3 (positive correlation), which indicates that an older individual spend more on the medical.

cor(insurance[c("age", "bmi", "children", "expenses")])

##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000

Another useful way of showing the correlation is a graphical visualization using a scatterplot matrices. Names on the diagonal indicates features on the x- & y-axis. Scatterplots off diagonal are seperated into two sets in a reversed order of x- & y-axises.

Here the most obvious correlation is between age vs. expenses that can be seen on the top right scatterplot, which conveys the same information we had mentioned above: the older the age, the more spending on medical. Also, realizing there’s three distincted regression clusters in the same plot, it is quite reasonable to asusme that the three different cluseters are also based upon the number of children one has. Although not shown directly, but pretty convincing given the similar trend in the scatterplot of age vs. children.

pairs(insurance[c("age", "bmi", "children", "expenses")])

Lastly, a super fancy way to show correlation is the use of pairs.panels() function from the psych package. The pairs.panels is splitting into three funcational groups that demonstrated three kind of information. The bottom left portion off diagonal shows similar scatterplot matrices before, with additional information such as the mean of two features value given in a red dot; a lowess curve that show a more flexible relationship between two features, and a correlation ellipse (circle) that shows a correlation strength (higher correlation indicated by a more strethy ellipse). The diagrams in diagonal show the histogram of each respective feature with a distribution curve. Numbers on the top right off diagonal are the correlation value between each two respective features.

library(psych)
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

Step 3: Training a model on the data —-

A multiple linear regression is plotted by using expenses as the dependent variable, and the rest of features as indipendent variables in the regression model. Therefore, a lm() function is used to make such regression by assigning expenses as y before tilda “~”, and using a “.” as a short hand notation to indicate using the rest of the features from the dataset as multiple independent variables regardless of numerical or categorical features. (age, children, bmi, sex, smoker and region).

ins_model <- lm(expenses ~ ., data = insurance)

Step 4: Evaluating model performance —-

Unlike a model for classification that can show its performance by using a confusion matrix for finding the accuracy. The model for numerical prediction is usually rated upon feature selection and model’s p-value. Using the summary() function to call the model object, the range of residuals (errors) is shown on top, the coefficient’s estimation, standard error, and p-values are shown in the middle, and finally R square, the correlation coefficient, and p-value of the model is shown at the bottom.

The estimation on coefficients column shows the estimate values of coefficients in the model for each respective feature to the left, nominal features are given dummy variables depend on their number of levels using n-1 method. So one unit increase of the feature to the left will affect the value of dependent variable, expense by the amount of the coefficient estimation. The p-value, usually compared with alpha = 0.05 levels shows significance of the respective coefficient term. (H0= beta-i = 0 vs. Ha = beta-i not equal 0). A number that is smaller than 0.05 in the p-value column indicates rejection in the null hypothesis and confirmed alternative hypothesis that beta-i term is significant at 0.05 level and that particular beta term shoule be kepted.

Therefore, in this example, a unit increase in age will increase the expense by 257 dollars, while a person living in the southwest might decrease the expense by 1036 dollars (statement confirmed as long as the p-value are small). Also, age, childern, bmi, smokeyes, region southeast and southwest are significant, while the rest are not which can be dropped for model improvement. R-square = 0.75 indicates that about 75% of the variation in expenses is explained by the model.

summary(ins_model)

## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## bmi                339.3       28.6  11.864  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Step 5: Improving model performance —-

The model can be improved by looking deeper into several features. For example, assuming that the increase in age and the expenses is not in a linear fashion; the older one gets, an even larger amount of expenses for medical is needed when compared to a year prior, so a square term of the age, age^2 is added.

Also, thinking that the bmi (body-mass-index) is only affecting the medical expenses extraordinarily when one’s bmi passes certain threshold value as is considered being obese, and not affecting when the bmi is varied but below a threhold. Therefore, another new term, bmi30, is added by categorize the numberical value bmi feature into two portions; 0 for bmi below 30 and 1 for bmi above.

Finally, realizing that the increase in medical expenses is much higher for smoker than those with each unit incrase of bmi from the previous multiple regression model makes it reasonable to assume that the effect of smoking on medical expenses is a lot more; and an obese smoker is spending even more on medical than it were for individual with obesity or is a smoker alone. Because individual is more proned to getting various healthy related issues when s/he is obese and smoking together. So an interaction term of bmi30*smoker is added in the improved model.

insurance$age2 <- insurance$age^2
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)

The improved model with additional terms is shown below. One can observed that the newly added terms, age2, bmi30 and bmi30:smokeyes are all significant by p-values, which means that these terms are reasonable to be kept in the model. The improved model also contains an R-square of about 0.87, which indicates that 87% of variation in the expenses is explained by the improved model, which is a lot better than the R-square value of the original model, taking into the account for additional independent variable terms when comparing the adjusted R-square.

summary(ins_model2)

## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

Conclusion: A multiple linear regression is originally modeled using several numerical and nominal features such as age, sex, number of childern, bmi and regions. The model performance can be evaluate by looking at the model statistics such as residuals range, estimated on coefficients on the regresion, overall model and individual feature p-values, and R-square. An improved mutiple polynomial regression is modeled later using some subject-matter knowledge about how the independent features can be related to the outcome dependent feature in the insurance dataset. Specifically focused on the age, bmi and smoker features, a better model is made with higher correlation coefficient.