library(tidyverse)
library(psych)
library(caTools)
library(car)
library(olsrr)
library(gvlma)
library(caret)
insurance <- read_csv("insurance.csv")
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Health Inusurance from Kaggle
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: 1/0
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
head(insurance)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <int> <chr> <dbl> <int> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
describe(insurance)
## vars n mean sd median trimmed mad min
## age 1 1338 39.21 14.05 39.00 39.01 17.79 18.00
## sex* 2 1338 NaN NA NA NaN NA Inf
## bmi 3 1338 30.66 6.10 30.40 30.50 6.20 15.96
## children 4 1338 1.09 1.21 1.00 0.94 1.48 0.00
## smoker* 5 1338 NaN NA NA NaN NA Inf
## region* 6 1338 NaN NA NA NaN NA Inf
## charges 7 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87
## max range skew kurtosis se
## age 64.00 46.00 0.06 -1.25 0.38
## sex* -Inf -Inf NA NA NA
## bmi 53.13 37.17 0.28 -0.06 0.17
## children 5.00 5.00 0.94 0.19 0.03
## smoker* -Inf -Inf NA NA NA
## region* -Inf -Inf NA NA NA
## charges 63770.43 62648.55 1.51 1.59 331.07
insurance %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 10)
The sex, children, smoker, and region variables are all catagorical.
Is a good resource on how to make dummy variables in R. However it seems that R automatically takes these into consideration.
i <- 1
for (x in insurance$sex){
if (x == 'male'){
insurance$sex[i] <- 1
}else if (x == 'female'){
insurance$sex[i] <- 0
}
i <- i + 1
}
insurance$sex <- as.integer(insurance$sex)
i <- 1
for (x in insurance$smoker){
if (x == 'yes'){
insurance$smoker[i] <- 1
}else if (x == 'no'){
insurance$smoker[i] <- 0
}
i <- i + 1
}
insurance$smoker <- as.integer(insurance$smoker)
All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.
model1 <- lm(charges ~., insurance)
summary(model1)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sex -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smoker 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
This model has an \(adj R^2\) of 0.7494 and a p-value of \(2.2e-16\)
In this model, 2 of the variables are above our 0.05 signif. level. The model will be recreated without the highest p-value of those variables, as to see if the adj. \(R^2\) and p-value changes.
model1 <- update(model1, .~. - sex)
summary(model1)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smoker 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
This model only has a very slight change in \(adj R^2\).
ols_plot_resid_qq(model1)
ols_plot_resid_hist(model1)
The residuals do not show normality.
new.df <- data.frame(age=25, sex=1, bmi=23.3, children =0, smoker = 1, region = 'southwest')
new.df2 <- data.frame(age=35, sex=1, bmi=43.3, children =1, smoker = 0, region = 'southwest')
predict(model1, new.df, type="response")
## 1
## 25201.88
predict(model1, new.df2, type="response")
## 1
## 11183.18