Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
url <- "https://raw.githubusercontent.com/jconno/R-data/master/insurance.csv"
df <- read_csv(url)
## Rows: 1338 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.

Subsetting data to make predictions for residents in the Southwest region of the United States, replacing “yes/no” with 1 and 0 respectively for smoking column. Removing “region” and “sex”. Removing “smoking” column for second model (dichotomous term)

sw <- df %>% filter(region == "southwest")
sw$smoker <- ifelse(sw$smoker=="yes",1,0)
sw <- sw %>% select(-sex)
sw <- sw %>% select(-region)
head(sw,3)
## # A tibble: 3 x 5
##     age   bmi children smoker charges
##   <dbl> <dbl>    <dbl>  <dbl>   <dbl>
## 1    19  27.9        0      1  16885.
## 2    23  34.4        0      0   1827.
## 3    19  24.6        1      0   1837.

DF for model 1

sw_nosmoke <- sw %>% select(-smoker)

Model 1

Predicting charges as a function of age, BMI, number of children, and if they smoke

model1 <- lm(charges ~., data = sw_nosmoke)
summary(model1)
## 
## Call:
## lm(formula = charges ~ ., data = sw_nosmoke)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11943  -6410  -4551  -1154  38913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5930.52    3532.83  -1.679  0.09419 .  
## age           183.31      44.95   4.078 5.74e-05 ***
## bmi           354.69     110.49   3.210  0.00146 ** 
## children      168.55     481.67   0.350  0.72662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11040 on 321 degrees of freedom
## Multiple R-squared:  0.09667,    Adjusted R-squared:  0.08823 
## F-statistic: 11.45 on 3 and 321 DF,  p-value: 3.75e-07
par(mfrow = c(2, 2))
plot(model1)

# Model 2 - Quadratic term: children - Dichotomous term: smoking (adding on) - Interactive x Dichotomous term: smoking * BMI

sw2 <- sw %>% mutate(chlidren = children^2,
                     smi_smoker = smoker*bmi)

model2 <- lm(charges ~., data = sw2)
summary(model2)
## 
## Call:
## lm(formula = charges ~ ., data = sw2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13047.0  -1417.9   -833.3   -211.2  23639.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3172.384   1454.703  -2.181   0.0299 *  
## age            282.392     17.442  16.190  < 2e-16 ***
## bmi             -9.493     46.315  -0.205   0.8377    
## children      -161.777    484.232  -0.334   0.7385    
## smoker      -24713.569   3571.731  -6.919 2.51e-11 ***
## chlidren       109.193    124.781   0.875   0.3822    
## smi_smoker    1616.315    113.911  14.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4224 on 318 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8664 
## F-statistic: 351.2 on 6 and 318 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)

Model 3

model3 <- lm(charges ~., data = sw)
summary(model3)
## 
## Call:
## lm(formula = charges ~ ., data = sw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13944.8  -2142.8   -704.7   1008.9  24096.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10616.90    1730.14  -6.136 2.49e-09 ***
## age            269.49      22.10  12.193  < 2e-16 ***
## bmi            255.00      54.01   4.721 3.51e-06 ***
## children        24.73     235.09   0.105    0.916    
## smoker       25220.53     786.59  32.063  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5385 on 320 degrees of freedom
## Multiple R-squared:  0.7856, Adjusted R-squared:  0.7829 
## F-statistic: 293.1 on 4 and 320 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model3)

Conclusion

hist(model2$residuals)