Week 12: Regression Analysis in R: Multiple Linear Regression

Using R, build a multiple regression model for data that interests you.
Include in this model at least one quadratic term,bone dichotomous term, 
and one dichotomous vs. quantitative interaction term. 
Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this discussion, I will look at the Kaggle Dataset “Medical Cost Personal Datasets” link can be found here to download: https://www.kaggle.com/mirichoi0218/insurance
This dataset looks at medical insurance costs charges for various people based on several factors like number of children, region of residency, age etc.
I will make a multiple linear regression model and make a best-fit line for computing medical costs.
Start by loading the data and viewing attributes

insurance <- read.csv("insurance.csv")
dim(insurance)

## [1] 1338    7

summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

head(insurance, n = 10)

##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

Dataset is well clean and tidy; nothing else to do to it. Let’s plot some graphs.

par(mfrow=c(1,2))
hist(insurance$bmi, xlab = "BMI (Body Mass Index)",
     main = "Histogram of BMI")
hist(insurance$charges, xlab = "Medical Charges",
     main = "Histogram of Medical Charges") # looks more right-skewed

# let's do some plotting of charges against other factors like sex and smoker and region
par(mfrow=c(1,3))
with(insurance, plot(charges ~ smoker + sex + region))

We see that BMI is nearly normally distributed, medical charges is right-skewed and there are many outliers for high medical charges against both genders and various regions.
We also see that the median is about the same for all regions, and genders.
Note that for smokers, medical charges are much higher than normal ones which we should expect.
Now, let’s fit a multiple regression model, let have the explanatory variables as
- sex (categorical)
- bmi (numerical, continous)
- age (numerical, discrete)
- smoker (categorical)
- charges (numerical, continous)
Let’s make a multiple regression model of the following equation:

\[ \begin{aligned} \widehat{charges} = \beta_0 + \beta_1 * Sex + \beta_2 * bmi + \beta_3 * age + \beta_4 * smoker + \beta_5 (bmi*sex) \end{aligned} \]

lm_insurance <- lm(charges ~ sex + bmi + age + smoker + bmi*sex, data = insurance)
summary(lm_insurance)

## 
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker + bmi * sex, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12396.7  -2983.0   -985.4   1478.3  29015.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11717.369   1282.175  -9.139  < 2e-16 ***
## sexmale         54.096   1712.963   0.032    0.975    
## bmi            325.779     39.341   8.281 2.94e-16 ***
## age            259.469     11.947  21.718  < 2e-16 ***
## smokeryes    23836.067    414.958  57.442  < 2e-16 ***
## sexmale:bmi     -5.326     54.846  -0.097    0.923    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7466 
## F-statistic: 788.6 on 5 and 1332 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
hist(lm_insurance$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_insurance$residuals, fitted(lm_insurance))
qqnorm(lm_insurance$residuals)
qqline(lm_insurance$residuals)

We see that the residuals histogram is somewhat normal but the residuals vs fitted values doesn’t show constant variance which is not good for a multiple regression model.
The equation of this multiple regression model is as follows:

\[ \begin{aligned} \widehat{charges} = -11717.37 + 54.1*sex + 325.78*bmi + 259.47*age + 23836.07*smoker - 5.33(bmi*sex) \end{aligned} \]

Note the variables
- sex = 1 for male and 0 for female
- smoker = 1 for male and 0 for female
What does this tell us? Let’s look at the details of the summary in more detail
- Coefficients:
  - Intercept: This tells us that leaving all other terms constant, on average the estimated medical charge is about $-11717.36 which logically won’t make sense and is good there are other terms in the model.
  - Sex: If a person is male and leaving all other terms constant, he can expect to pay about $54.1 in medical costs.
  - BMI: Leaving all other terms constant, a person can be expected to pay about $325.78 in medical charges per BMI value.
  - Age: Leaving all other terms constant, a person can be expected to pay about $259.47 in medical expenses multiplied by their age (A 31 year old will pay about $8043.57)
  - Smoker: A person who smokes and leaving all variables constant can expect to pay $23836.07
  - Sex*BMI: A male can expect to pay holding all other variables constant can expect to pay $-5.326 which doesn’t make sense logically.
- P-values of coefficients:
  - The p-values of the intercept, bmi, age and male smokers are very low and we can reject the null hypothesis (H_0 = 0) and favor the alternative (H_A != 0) that is the true coefficients is not 0
  - For Males and Male*bmi, we fail to reject the null hypothesis and thus these coefficients are very close to 0 and can be excluded in our model.
- Residual Standard Error: The residual standard error of 6097 is the standard deviation and is a bit far from the good fit of points.
- R-squared/Adjusted R^2: values of 0.7475 and 0.7466 respectively, this means that about 75% of the data fall into the regression line.
- F-statistic: value of 788.6 with a small p-value < 2.2e-16 means that the features selected are better than the intercept-only model which as described before makes sense as a intercept only model gives a negative medical cost which doesn’t apply or make sense.
Let’s predict the medical cost of myself:
- I’m 31 years old: Age = 31
- I don’t smoke: smoke = 0
- I’m male: male = 1
- My BMI is: 31.3

my_predicted_medical_cost <- predict(lm_insurance,
                                     data.frame(age=31, smoker="no", 
                                                bmi=31.3, sex="male"))
my_predicted_medical_cost

##        1 
## 6410.444

So the multiple regression model that I made says my approximate medical costs is about $6410.44.
So far this year on a personal note, I’ve only spent about $350 in medical costs, which my predicted model is no where near close and overestimates my medical costs!
I would not be comfortable with using the model above as there are coefficients that can be removed or probably added for better accuracy and properly modeling and predicting medical costs. The residual standard error as well as the Q-Q plots show that the model is not a good fit for the data. One good thing I can say about the model is that the BMI and Age coefficients make sense as the more your BMI is and older, you are more likley to have more health problems and have more medical costs to pay.
Future work can be done to add more coefficients, transforms and possibly use non-linear regression to better predict medical costs.

Week 12: Regression Analysis in R: Multiple Linear Regression

Jonathan Hernandez

November 12, 2018