Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)

url <- "https://raw.githubusercontent.com/jconno/R-data/master/insurance.csv"
df <- read_csv(url)

## Rows: 1338 Columns: 7

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.

Subsetting data to make predictions for residents in the Southwest region of the United States, replacing “yes/no” with 1 and 0 respectively for smoking column. Removing “region” and “sex”. Removing “smoking” column for second model (dichotomous term)

sw <- df %>% filter(region == "southwest")
sw$smoker <- ifelse(sw$smoker=="yes",1,0)
sw <- sw %>% select(-sex)
sw <- sw %>% select(-region)
head(sw,3)

## # A tibble: 3 x 5
##     age   bmi children smoker charges
##   <dbl> <dbl>    <dbl>  <dbl>   <dbl>
## 1    19  27.9        0      1  16885.
## 2    23  34.4        0      0   1827.
## 3    19  24.6        1      0   1837.

DF for model 1

sw_nosmoke <- sw %>% select(-smoker)

Model 1

Predicting charges as a function of age, BMI, number of children, and if they smoke

model1 <- lm(charges ~., data = sw_nosmoke)
summary(model1)

## 
## Call:
## lm(formula = charges ~ ., data = sw_nosmoke)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11943  -6410  -4551  -1154  38913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5930.52    3532.83  -1.679  0.09419 .  
## age           183.31      44.95   4.078 5.74e-05 ***
## bmi           354.69     110.49   3.210  0.00146 ** 
## children      168.55     481.67   0.350  0.72662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11040 on 321 degrees of freedom
## Multiple R-squared:  0.09667,    Adjusted R-squared:  0.08823 
## F-statistic: 11.45 on 3 and 321 DF,  p-value: 3.75e-07

According to the summary statistics above, both the multiple and adjusted \(R^2\) values, this model predicts approximately between 8% and 9% of health charges. The small magnitude of the p-value allows us to reject the null hypothesis.
Looking at the plots below,

par(mfrow = c(2, 2))
plot(model1)

# Model 2 - Quadratic term: children - Dichotomous term: smoking (adding on) - Interactive x Dichotomous term: smoking * BMI

sw2 <- sw %>% mutate(chlidren = children^2,
                     smi_smoker = smoker*bmi)

model2 <- lm(charges ~., data = sw2)
summary(model2)

## 
## Call:
## lm(formula = charges ~ ., data = sw2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13047.0  -1417.9   -833.3   -211.2  23639.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3172.384   1454.703  -2.181   0.0299 *  
## age            282.392     17.442  16.190  < 2e-16 ***
## bmi             -9.493     46.315  -0.205   0.8377    
## children      -161.777    484.232  -0.334   0.7385    
## smoker      -24713.569   3571.731  -6.919 2.51e-11 ***
## chlidren       109.193    124.781   0.875   0.3822    
## smi_smoker    1616.315    113.911  14.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4224 on 318 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8664 
## F-statistic: 351.2 on 6 and 318 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model2)

Model 3

All variables as they are–no quadratic terms, dichotomous terms, nor interactive * dichotomous term
Just for comparison

model3 <- lm(charges ~., data = sw)
summary(model3)

## 
## Call:
## lm(formula = charges ~ ., data = sw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13944.8  -2142.8   -704.7   1008.9  24096.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10616.90    1730.14  -6.136 2.49e-09 ***
## age            269.49      22.10  12.193  < 2e-16 ***
## bmi            255.00      54.01   4.721 3.51e-06 ***
## children        24.73     235.09   0.105    0.916    
## smoker       25220.53     786.59  32.063  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5385 on 320 degrees of freedom
## Multiple R-squared:  0.7856, Adjusted R-squared:  0.7829 
## F-statistic: 293.1 on 4 and 320 DF,  p-value: < 2.2e-16

As observed, both the adjusted and multiple \(R^2\) values indicate this model predicts about 78% of the charges.

par(mfrow = c(2, 2))
plot(model3)

Conclusion

The model prescribed for this discussion, model 2, is appropriate for predicting the cost of health insurance. Looking at the \(R^2\) values, we can easily see this is the most accurate model out of all 3 that were made, as model 2 accounts for approximately 86% of the charges predicted.
Looking at the residual QQ plot, we can also see that the data does not follow an ideal normal distribution. There is skewness on both ends of the plot. To verify this claim, a histogram will be drawn:

hist(model2$residuals)

Looking at the above histogram, we can indeed verify that this data does not follow an ideal normal distribution. However, model 2 is appropriate for predicting chrages

Predicting costs of healthcare of Southwestern US Residents

Joe Connolly

11/14/2021

Model 1

Model 3

Conclusion