DATA605_Discussion Wk12_Multiple Linear Regression
Question
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Dataset
Data is from kaggle public datasets and can be found online here: https://www.kaggle.com/mirichoi0218/insurance
library(tidyverse)load and read the Data
url <- "https://raw.githubusercontent.com/omocharly/DATA605/main/insurance.csv"
insurance <- read_csv(url)## Rows: 1338 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a glimpse look at the data
glimpse(insurance)## Rows: 1,338
## Columns: 7
## $ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
Model for insurance data
Model for insurance charges using age, bmi and smoker
insurance_model <- lm(data=insurance, charges ~ age + bmi + smoker)
summary(insurance_model)##
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12415.4 -2970.9 -980.5 1480.0 28971.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11676.83 937.57 -12.45 <2e-16 ***
## age 259.55 11.93 21.75 <2e-16 ***
## bmi 322.62 27.49 11.74 <2e-16 ***
## smokeryes 23823.68 412.87 57.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7469
## F-statistic: 1316 on 3 and 1334 DF, p-value: < 2.2e-16
The linear model for predicting the insurance charges based on age, bmi and smoking status is given by:
charges = -11676.83 + 259.55(age) + 322.62(bmi) + 23823.68(smokeryes)
The multiple r-squared is 74.75%.
Residual Analysis
hist(resid(insurance_model))par(mfrow=c(2,2))
plot(insurance_model)par(mfrow=c(1,3))
plot(jitter(insurance$age), resid(insurance_model))
abline(h=0, col="violet")
plot(jitter(insurance$bmi), resid(insurance_model))
abline(h=0, col="violet")
plot(jitter(insurance$charges), resid(insurance_model))
abline(h=0, col="violet")Linearity: For the quantitative variables age, bmi, charges: The residuals are most likely to be randomly dispersed, no obvious shapes or patterns are found.
Nearly normal residuals The histogram of the residuals shows a normal distribution. The qq plot shows the residuals are mostly line along on the normal line.The normal residual condiction is somewhat met.
Constant variability The majority of residuals are distributed between -1 and 1. The constant variability appears to be met.
Based on the three observation above, the linear model is reliable.