Include in this model at least one quadratic term,bone dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients.
For this discussion, I will look at the Kaggle Dataset “Medical Cost Personal Datasets” link can be found here to download: https://www.kaggle.com/mirichoi0218/insurance
This dataset looks at medical insurance costs charges for various people based on several factors like number of children, region of residency, age etc.
I will make a multiple linear regression model and make a best-fit line for computing medical costs.
Loading the data and viewing attributes
insurance <- read.csv("https://raw.githubusercontent.com/GabrielSantos33/DATA605_W11/main/insurance.csv")
dim(insurance)
## [1] 1338 7
summary(insurance)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
head(insurance, n = 10)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
## 7 46 female 33.440 1 no southeast 8240.590
## 8 37 female 27.740 3 no northwest 7281.506
## 9 37 male 29.830 2 no northeast 6406.411
## 10 60 female 25.840 0 no northwest 28923.137
Let’s plot some graphs.
par(mfrow=c(1,2))
hist(insurance$bmi, xlab = "BMI (Body Mass Index)",
main = "Histogram of BMI", col= "lightblue")
hist(insurance$charges, xlab = "Medical Charges",
main = "Histogram of Medical Charges", col= "lightblue")
par(mfrow=c(1,3))
with(insurance, boxplot(charges ~ smoker, col="lightblue", main="Smoker"))
with(insurance, boxplot(charges ~ sex, col="lightblue", main="Sex"))
with(insurance, boxplot(charges ~ region, col="lightblue", main="Region"))
According to the graphs, I see that BMI is nearly normally distributed, medical charges is right-skewed.
I also see that the median is about the same for the both sex, medical charges are slightly higher for men than women. The median is the same for northeast region and southeast region.
In the case of smokers the medical charges are much higher than people who are not smokers.
Using a multiple regression model, let have the explanatory variables as
Let’s make a multiple regression model of the following equation:
\[ \begin{aligned} \widehat{charges} = \beta_0 + \beta_1 * Sex + \beta_2 * bmi + \beta_3 * age + \beta_4 * smoker + \beta_5 (bmi*sex) \end{aligned} \]
lm_insurance <- lm(charges ~ sex + bmi + age + smoker + bmi*sex, data = insurance)
summary(lm_insurance)
##
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker + bmi * sex,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12396.7 -2983.0 -985.4 1478.3 29015.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11717.369 1282.175 -9.139 < 2e-16 ***
## sexmale 54.096 1712.963 0.032 0.975
## bmi 325.779 39.341 8.281 2.94e-16 ***
## age 259.469 11.947 21.718 < 2e-16 ***
## smokeryes 23836.067 414.958 57.442 < 2e-16 ***
## sexmale:bmi -5.326 54.846 -0.097 0.923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7466
## F-statistic: 788.6 on 5 and 1332 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
hist(lm_insurance$residuals, main = "Histogram of Residuals", xlab= "", col= "lightblue")
plot(lm_insurance$residuals, fitted(lm_insurance), col= "blue")
qqnorm(lm_insurance$residuals, col= "blue")
qqline(lm_insurance$residuals, col= "black")
I see that the residuals histogram is somewhat normal but the residuals vs fitted values doesn’t show constant variance which is not good for a multiple regression model.
The equation of this multiple regression model is as follows:
\[ \begin{aligned} \widehat{charges} = -11717.369 + 54.096 * Sex + 325.779 * bmi + 259.469 * age + 23836.067 * smoker + (-5.326) (bmi*sex) \end{aligned} \]
sex = 1 for male and 0 for female smoker = 1 for male and 0 for female
Coefficients:
Intercept: The estimated average medical cost is $-11,717,369, keeping all other variables constant. This value is not logical because it is a negative value.
Sex: Approximate medical costs for a male is $54,096, holding all other variables constant.
BMI: A person will pay approximately $325,779 in medical charges per BMI value.
Age: A person will pay proportionally to her age an approximate medical cost of $259,469 for each year of life, keeping the other variables constant.
Smoker: For a person who is a smoker, the approximate medical costs are $23,836,067, keeping other variables constant.
Sex * BMI: A man, in relation to his BMI, will pay approximately $ -5,326 for medical costs, keeping the other variables constant. This value is not logical because it is a negative value.
P-values of coefficients:
Residual standard error: 6097 on 1332 degrees of freedom Multiple R-squared: 0.7475, Adjusted R-squared: 0.7466 F-statistic: 788.6 on 5 and 1332 DF, p-value: < 2.2e-16
Residual Standard Error: the standard deviation is 6097.
R-squared/Adjusted R^2: values of 0.7475 and 0.7466 respectively, this means that about 75% of the data is outside off the regression line.
F-statistic: value of 788.6 with a small p-value < 2.2e-16. F-statistic tries to determine if among a group of independent variables, at least one has the capacity to explain a significant part of the variation of the dependent variable. For this exercise does not have the capability.
I am going to predict the medical cost for my wife:
She’s 43 years old Age = 43 I don’t smoke smoke = 0 She’s female female = 0 Her BMI is: 28.8
my_predicted_medical_cost <- predict(lm_insurance,data.frame(age=43, smoker="no", bmi=28.8, sex="female"))
my_predicted_medical_cost
## 1
## 8822.231
According to the multiple regression model, the approximate medical costs for my wife are around $8,822,231.
According to reality, the medical costs for my wife are much lower than what the regression model is proposing. So this multiple regression model is not suitable for estimating people’s medical costs.
Conclusion:
The multiple regression model is not appropriate for generating estimates of medical costs.