Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Dataset: Insurance dataset from kaggle: https://www.kaggle.com/mirichoi0218/insurance

Description: This dataset looks at medical insurance costs charges for various people based on several factors like number of children, region of residency, age etc.

Step 1) Load the dataset

Step 2) Display first few rows of insurance dataset

insurance <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data605/master/week%2012/insurance.csv")
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v readr   1.3.1
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.3.1
## v ggplot2 3.1.0     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts ---------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
insurance <- mutate(insurance,smoker_group = case_when(str_detect(smoker,"yes") ~ 1,
                                                  TRUE ~ 0))

insurance <- mutate(insurance,age_group = case_when(age > 60 ~ 4,
                                       age > 50 ~ 3,
                                       age > 40 ~ 2,
                                       age > 30 ~ 1,
                                       age < 30 ~ 0,
                                                  TRUE ~ 0))

insurance <- mutate(insurance,bmi_group = case_when(bmi > 30 ~ 2,
                                       age > 25 ~ 1,
                                       age < 25 ~ 0,
                                       TRUE ~ 0))

head(insurance)

Step 3) Summarize the dataset and display the dimensions

summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges       smoker_group      age_group    
##  northeast:324   Min.   : 1122   Min.   :0.0000   Min.   :0.000  
##  northwest:325   1st Qu.: 4740   1st Qu.:0.0000   1st Qu.:0.000  
##  southeast:364   Median : 9382   Median :0.0000   Median :1.000  
##  southwest:325   Mean   :13270   Mean   :0.2048   Mean   :1.478  
##                  3rd Qu.:16640   3rd Qu.:0.0000   3rd Qu.:3.000  
##                  Max.   :63770   Max.   :1.0000   Max.   :4.000  
##    bmi_group    
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.413  
##  3rd Qu.:2.000  
##  Max.   :2.000
dim(insurance)
## [1] 1338   10
str(insurance)
## 'data.frame':    1338 obs. of  10 variables:
##  $ age         : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex         : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi         : num  27.9 33.8 33 22.7 28.9 ...
##  $ children    : int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker      : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region      : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges     : num  16885 1726 4449 21984 3867 ...
##  $ smoker_group: num  1 0 0 0 0 0 0 0 0 0 ...
##  $ age_group   : num  0 0 0 1 1 1 2 1 1 3 ...
##  $ bmi_group   : num  0 2 2 1 1 1 2 1 1 1 ...

From the above steps we see that the dataset is tidy and clean.

Step 4) Now, let’s analyze it using graphs

par(mfrow=c(2,2))
hist(insurance$bmi_group, xlab = "BMI (Body Mass Index)",
     main = "Histogram of BMI")
hist(insurance$charges, xlab = "Medical Charges",
     main = "Histogram of Medical Charges") 

hist(insurance$age_group, xlab = "Age_group",
     main = "Histogram of Age group") 

hist(insurance$smoker_group, xlab = "smoker_cond",
     main = "Histogram of smoker_cond group")

Histogram of medical charges looks more right-skewed.

par(mfrow=c(1,3))
with(insurance, plot(charges ~ smoker + sex + region))

I have plotted above graphs using smoker, gender and region.

We see that BMI is nearly normally distributed, medical charges is right-skewed and there are many outliers for high medical charges against both genders and various regions.

We also see that the median is about the same for all regions, and genders.

Note that for smokers, medical charges are much higher than normal ones which we should expect.

Now, let’s fit a multiple regression model, let have the explanatory variables as

sex (categorical)

bmi (numerical, continous)

age (numerical, discrete)

smoker (categorical)

charges (numerical, continous)

Step 5) Let’s make a multiple regression model of the following equation:

\(charges = β0+β1∗Sex+β2∗bmi+β3∗age+β4∗smoker+β5(bmi∗sex)\)

lm_insurance <- lm(charges ~ age*bmi_group+age*smoker_group +age*age_group + age*smoker_group*bmi_group, data = insurance)
summary(lm_insurance)
## 
## Call:
## lm(formula = charges ~ age * bmi_group + age * smoker_group + 
##     age * age_group + age * smoker_group * bmi_group, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10148.7  -1981.3  -1398.6   -293.5  23776.7 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -161.282   1088.565  -0.148 0.882239    
## age                          192.129     41.509   4.629 4.04e-06 ***
## bmi_group                    280.862    536.702   0.523 0.600845    
## smoker_group               24020.686   1806.664  13.296  < 2e-16 ***
## age_group                  -1159.028    703.193  -1.648 0.099541 .  
## age:bmi_group                 -7.385     15.693  -0.471 0.638018    
## age:smoker_group            -636.493     55.530 -11.462  < 2e-16 ***
## age:age_group                 36.434      9.489   3.840 0.000129 ***
## bmi_group:smoker_group      3109.117   1157.505   2.686 0.007320 ** 
## age:bmi_group:smoker_group   350.018     34.287  10.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4663 on 1328 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8517 
## F-statistic: 854.3 on 9 and 1328 DF,  p-value: < 2.2e-16
Conduct residual analysis

Step 6) Plot Histogram of Residuals

par(mfrow=c(2,2))
hist(lm_insurance$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_insurance$residuals, fitted(lm_insurance))
qqnorm(lm_insurance$residuals)
qqline(lm_insurance$residuals)

We see that the residuals histogram is somewhat normal but the residuals vs fitted values doesn’t show constant variance which is not good for a multiple regression model.

The equation of this multiple regression model is as follows:

\(charges = −11717.37+54.1∗sex+325.78∗bmi+259.47∗age+23836.07∗smoker−5.33(bmi∗sex)\)

Note the variables

sex = 1 for male and 0 for female

smoker = 1 for male and 0 for female

Interpret all coefficients

What does this tell us? Let’s look at the details of the summary in more detail

Coefficients:

Intercept: This tells us that leaving all other terms constant, on average the estimated medical charge is about $-11717.36 which logically won’t make sense and is good there are other terms in the model.

Sex: If a person is male and leaving all other terms constant, he can expect to pay about $54.1 in medical costs.

BMI: Leaving all other terms constant, a person can be expected to pay about $325.78 in medical charges per BMI value.

Age: Leaving all other terms constant, a person can be expected to pay about $259.47 in medical expenses multiplied by their age (A 31 year old will pay about $8043.57)

Smoker: A person who smokes and leaving all variables constant can expect to pay $23836.07

Sex*BMI: A male can expect to pay holding all other variables constant can expect to pay $-5.326 which doesn’t make sense logically.

P-values of coefficients:

The p-values of the intercept, bmi, age and male smokers are very low and we can reject the null hypothesis (H_0 = 0) and favor the alternative (H_A != 0) that is the true coefficients is not 0

For Males and Male*bmi, we fail to reject the null hypothesis and thus these coefficients are very close to 0 and can be excluded in our model.

Residual Standard Error: The residual standard error of 6097 is the standard deviation and is a bit far from the good fit of points.

R-squared/Adjusted R^2: values of 0.7475 and 0.7466 respectively, this means that about 75% of the data fall into the regression line.

F-statistic: value of 788.6 with a small p-value < 2.2e-16 means that the features selected are better than the intercept-only model which as described before makes sense as a intercept only model gives a negative medical cost which doesn’t apply or make sense.

Conclusions

The above model does not have much efficiency, as there are coefficients that can be removed or probably added for better accuracy and properly modeling and predicting medical costs. The residual standard error as well as the Q-Q plots show that the model is not a good fit for the data. One good thing I can say about the model is that the BMI and Age coefficients make sense as the more your BMI is and older, you are more likley to have more health problems and have more medical costs to pay.

Future work can be done to add more coefficients, transforms and possibly use non-linear regression to better predict medical costs.