DATA605_Discussion Wk12_Multiple Linear Regression

Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Dataset

Data is from kaggle public datasets and can be found online here: https://www.kaggle.com/mirichoi0218/insurance

library(tidyverse)

load and read the Data

url <- "https://raw.githubusercontent.com/omocharly/DATA605/main/insurance.csv"
insurance <- read_csv(url)

## Rows: 1338 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a glimpse look at the data

glimpse(insurance)

## Rows: 1,338
## Columns: 7
## $ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

Model for insurance data

Model for insurance charges using age, bmi and smoker

insurance_model <- lm(data=insurance, charges ~ age + bmi + smoker)
summary(insurance_model)

## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.4  -2970.9   -980.5   1480.0  28971.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11676.83     937.57  -12.45   <2e-16 ***
## age            259.55      11.93   21.75   <2e-16 ***
## bmi            322.62      27.49   11.74   <2e-16 ***
## smokeryes    23823.68     412.87   57.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7469 
## F-statistic:  1316 on 3 and 1334 DF,  p-value: < 2.2e-16

The linear model for predicting the insurance charges based on age, bmi and smoking status is given by:

charges = -11676.83 + 259.55(age) + 322.62(bmi) + 23823.68(smokeryes)

The multiple r-squared is 74.75%.

Residual Analysis

hist(resid(insurance_model))

par(mfrow=c(2,2))
plot(insurance_model)

par(mfrow=c(1,3))
plot(jitter(insurance$age), resid(insurance_model))
abline(h=0, col="violet")
plot(jitter(insurance$bmi), resid(insurance_model))
abline(h=0, col="violet")
plot(jitter(insurance$charges), resid(insurance_model))
abline(h=0, col="violet")

Linearity: For the quantitative variables age, bmi, charges: The residuals are most likely to be randomly dispersed, no obvious shapes or patterns are found.

Nearly normal residuals The histogram of the residuals shows a normal distribution. The qq plot shows the residuals are mostly line along on the normal line.The normal residual condiction is somewhat met.

Constant variability The majority of residuals are distributed between -1 and 1. The constant variability appears to be met.

Based on the three observation above, the linear model is reliable.