Using R, build a multiple regression model for data that interests you.Conduct residual analysis. Was the linear model appropriate? Why or why not?.

Include in this model at least one quadratic term,bone dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients.

For this discussion, I will look at the Kaggle Dataset “Medical Cost Personal Datasets” link can be found here to download: https://www.kaggle.com/mirichoi0218/insurance

This dataset looks at medical insurance costs charges for various people based on several factors like number of children, region of residency, age etc.

I will make a multiple linear regression model and make a best-fit line for computing medical costs.

Loading the data and viewing attributes

insurance <- read.csv("https://raw.githubusercontent.com/GabrielSantos33/DATA605_W11/main/insurance.csv")
dim(insurance)
## [1] 1338    7
summary(insurance)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
head(insurance, n = 10)
##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

Let’s plot some graphs.

par(mfrow=c(1,2))
hist(insurance$bmi, xlab = "BMI (Body Mass Index)",
     main = "Histogram of BMI", col= "lightblue")
hist(insurance$charges, xlab = "Medical Charges",
     main = "Histogram of Medical Charges", col= "lightblue") 

par(mfrow=c(1,3))
with(insurance, boxplot(charges ~ smoker, col="lightblue", main="Smoker"))
with(insurance, boxplot(charges ~ sex, col="lightblue", main="Sex"))
with(insurance, boxplot(charges ~ region, col="lightblue", main="Region"))

According to the graphs, I see that BMI is nearly normally distributed, medical charges is right-skewed.

I also see that the median is about the same for the both sex, medical charges are slightly higher for men than women. The median is the same for northeast region and southeast region.

In the case of smokers the medical charges are much higher than people who are not smokers.

Using a multiple regression model, let have the explanatory variables as

Let’s make a multiple regression model of the following equation:

\[ \begin{aligned} \widehat{charges} = \beta_0 + \beta_1 * Sex + \beta_2 * bmi + \beta_3 * age + \beta_4 * smoker + \beta_5 (bmi*sex) \end{aligned} \]

lm_insurance <- lm(charges ~ sex + bmi + age + smoker + bmi*sex, data = insurance)
summary(lm_insurance)
## 
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker + bmi * sex, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12396.7  -2983.0   -985.4   1478.3  29015.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11717.369   1282.175  -9.139  < 2e-16 ***
## sexmale         54.096   1712.963   0.032    0.975    
## bmi            325.779     39.341   8.281 2.94e-16 ***
## age            259.469     11.947  21.718  < 2e-16 ***
## smokeryes    23836.067    414.958  57.442  < 2e-16 ***
## sexmale:bmi     -5.326     54.846  -0.097    0.923    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7466 
## F-statistic: 788.6 on 5 and 1332 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
hist(lm_insurance$residuals, main = "Histogram of Residuals", xlab= "", col= "lightblue")
plot(lm_insurance$residuals, fitted(lm_insurance), col= "blue")
qqnorm(lm_insurance$residuals, col= "blue")
qqline(lm_insurance$residuals, col= "black")

I see that the residuals histogram is somewhat normal but the residuals vs fitted values doesn’t show constant variance which is not good for a multiple regression model.

The equation of this multiple regression model is as follows:

\[ \begin{aligned} \widehat{charges} = -11717.369 + 54.096 * Sex + 325.779 * bmi + 259.469 * age + 23836.067 * smoker + (-5.326) (bmi*sex) \end{aligned} \]

sex = 1 for male and 0 for female smoker = 1 for male and 0 for female

Coefficients:

Intercept: The estimated average medical cost is $-11,717,369, keeping all other variables constant. This value is not logical because it is a negative value.

Sex: Approximate medical costs for a male is $54,096, holding all other variables constant.

BMI: A person will pay approximately $325,779 in medical charges per BMI value.

Age: A person will pay proportionally to her age an approximate medical cost of $259,469 for each year of life, keeping the other variables constant.

Smoker: For a person who is a smoker, the approximate medical costs are $23,836,067, keeping other variables constant.

Sex * BMI: A man, in relation to his BMI, will pay approximately $ -5,326 for medical costs, keeping the other variables constant. This value is not logical because it is a negative value.

P-values of coefficients:

Residual standard error: 6097 on 1332 degrees of freedom Multiple R-squared: 0.7475, Adjusted R-squared: 0.7466 F-statistic: 788.6 on 5 and 1332 DF, p-value: < 2.2e-16

Residual Standard Error: the standard deviation is 6097.

R-squared/Adjusted R^2: values of 0.7475 and 0.7466 respectively, this means that about 75% of the data is outside off the regression line.

F-statistic: value of 788.6 with a small p-value < 2.2e-16. F-statistic tries to determine if among a group of independent variables, at least one has the capacity to explain a significant part of the variation of the dependent variable. For this exercise does not have the capability.

I am going to predict the medical cost for my wife:

She’s 43 years old Age = 43 I don’t smoke smoke = 0 She’s female female = 0 Her BMI is: 28.8

my_predicted_medical_cost <- predict(lm_insurance,data.frame(age=43, smoker="no", bmi=28.8, sex="female"))
my_predicted_medical_cost
##        1 
## 8822.231

According to the multiple regression model, the approximate medical costs for my wife are around $8,822,231.

According to reality, the medical costs for my wife are much lower than what the regression model is proposing. So this multiple regression model is not suitable for estimating people’s medical costs.

Conclusion:

The multiple regression model is not appropriate for generating estimates of medical costs.