library(tidyverse)
library(psych)
library(caTools)
library(car)
library(olsrr)
library(gvlma)
library(caret)
insurance <- read_csv("insurance.csv")

Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

About the Data

Health Inusurance from Kaggle

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: 1/0

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Exploratory Analysis

head(insurance)

## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <int> <chr>  <dbl>    <int> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.

describe(insurance)

##          vars    n     mean       sd  median  trimmed     mad     min
## age         1 1338    39.21    14.05   39.00    39.01   17.79   18.00
## sex*        2 1338      NaN       NA      NA      NaN      NA     Inf
## bmi         3 1338    30.66     6.10   30.40    30.50    6.20   15.96
## children    4 1338     1.09     1.21    1.00     0.94    1.48    0.00
## smoker*     5 1338      NaN       NA      NA      NaN      NA     Inf
## region*     6 1338      NaN       NA      NA      NaN      NA     Inf
## charges     7 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87
##               max    range skew kurtosis     se
## age         64.00    46.00 0.06    -1.25   0.38
## sex*         -Inf     -Inf   NA       NA     NA
## bmi         53.13    37.17 0.28    -0.06   0.17
## children     5.00     5.00 0.94     0.19   0.03
## smoker*      -Inf     -Inf   NA       NA     NA
## region*      -Inf     -Inf   NA       NA     NA
## charges  63770.43 62648.55 1.51     1.59 331.07

insurance %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 10)

Data Prep

The sex, children, smoker, and region variables are all catagorical.

http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/

Is a good resource on how to make dummy variables in R. However it seems that R automatically takes these into consideration.

i <- 1
for (x in insurance$sex){
  if (x == 'male'){
    insurance$sex[i] <- 1
  }else if (x == 'female'){
    insurance$sex[i] <- 0
  }
  i <- i + 1
}

insurance$sex <- as.integer(insurance$sex)

i <- 1
for (x in insurance$smoker){
  if (x == 'yes'){
    insurance$smoker[i] <- 1
  }else if (x == 'no'){
    insurance$smoker[i] <- 0
  }
  i <- i + 1
}
insurance$smoker <- as.integer(insurance$smoker)

Model 1: All Variables

All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

model1 <- lm(charges ~., insurance)

summary(model1)

## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sex               -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smoker           23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

This model has an \(adj R^2\) of 0.7494 and a p-value of \(2.2e-16\)

In this model, 2 of the variables are above our 0.05 signif. level. The model will be recreated without the highest p-value of those variables, as to see if the adj. \(R^2\) and p-value changes.

model1 <- update(model1, .~. - sex)
summary(model1)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smoker           23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

This model only has a very slight change in \(adj R^2\).

Residual analysis

ols_plot_resid_qq(model1)

ols_plot_resid_hist(model1)

The residuals do not show normality.

Make Predictions on model

new.df <- data.frame(age=25, sex=1, bmi=23.3, children =0, smoker = 1, region = 'southwest')
new.df2 <- data.frame(age=35, sex=1, bmi=43.3, children =1, smoker = 0, region = 'southwest')

predict(model1, new.df, type="response")

##        1 
## 25201.88

predict(model1, new.df2, type="response")

##        1 
## 11183.18

CUNY MSDS DATA 605 WEEK 12

Nicholas Schettini

November 14, 2018