Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
url <- "https://raw.githubusercontent.com/jconno/R-data/master/insurance.csv"
df <- read_csv(url)
## Rows: 1338 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
Subsetting data to make predictions for residents in the Southwest region of the United States, replacing “yes/no” with 1 and 0 respectively for smoking column. Removing “region” and “sex”. Removing “smoking” column for second model (dichotomous term)
sw <- df %>% filter(region == "southwest")
sw$smoker <- ifelse(sw$smoker=="yes",1,0)
sw <- sw %>% select(-sex)
sw <- sw %>% select(-region)
head(sw,3)
## # A tibble: 3 x 5
## age bmi children smoker charges
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 27.9 0 1 16885.
## 2 23 34.4 0 0 1827.
## 3 19 24.6 1 0 1837.
DF for model 1
sw_nosmoke <- sw %>% select(-smoker)
Predicting charges as a function of age, BMI, number of children, and if they smoke
model1 <- lm(charges ~., data = sw_nosmoke)
summary(model1)
##
## Call:
## lm(formula = charges ~ ., data = sw_nosmoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11943 -6410 -4551 -1154 38913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5930.52 3532.83 -1.679 0.09419 .
## age 183.31 44.95 4.078 5.74e-05 ***
## bmi 354.69 110.49 3.210 0.00146 **
## children 168.55 481.67 0.350 0.72662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11040 on 321 degrees of freedom
## Multiple R-squared: 0.09667, Adjusted R-squared: 0.08823
## F-statistic: 11.45 on 3 and 321 DF, p-value: 3.75e-07
par(mfrow = c(2, 2))
plot(model1)
# Model 2 - Quadratic term: children - Dichotomous term: smoking (adding on) - Interactive x Dichotomous term: smoking * BMI
sw2 <- sw %>% mutate(chlidren = children^2,
smi_smoker = smoker*bmi)
model2 <- lm(charges ~., data = sw2)
summary(model2)
##
## Call:
## lm(formula = charges ~ ., data = sw2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13047.0 -1417.9 -833.3 -211.2 23639.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3172.384 1454.703 -2.181 0.0299 *
## age 282.392 17.442 16.190 < 2e-16 ***
## bmi -9.493 46.315 -0.205 0.8377
## children -161.777 484.232 -0.334 0.7385
## smoker -24713.569 3571.731 -6.919 2.51e-11 ***
## chlidren 109.193 124.781 0.875 0.3822
## smi_smoker 1616.315 113.911 14.189 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4224 on 318 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8664
## F-statistic: 351.2 on 6 and 318 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)
model3 <- lm(charges ~., data = sw)
summary(model3)
##
## Call:
## lm(formula = charges ~ ., data = sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13944.8 -2142.8 -704.7 1008.9 24096.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10616.90 1730.14 -6.136 2.49e-09 ***
## age 269.49 22.10 12.193 < 2e-16 ***
## bmi 255.00 54.01 4.721 3.51e-06 ***
## children 24.73 235.09 0.105 0.916
## smoker 25220.53 786.59 32.063 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5385 on 320 degrees of freedom
## Multiple R-squared: 0.7856, Adjusted R-squared: 0.7829
## F-statistic: 293.1 on 4 and 320 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model3)
hist(model2$residuals)