Part 2: Data Analysis

The dataset was collected from the Machine Learning with R by Brett Lantz and is available on Kaggle. It contains data on Medical Insurance costs.

Here is the data provided for each listing:

Instructions on reading the data

To read the data, save the file in your working directory and read the data using the function shown below.

# Read in the data
insurance = read.csv("insurance.csv", head = TRUE, sep = ",")

# Show the first few rows of data
head(insurance, 3)
##   age    sex   bmi children smoker    region   charges
## 1  19 female 27.90        0    yes southwest 16884.924
## 2  18   male 33.77        1     no southeast  1725.552
## 3  28   male 33.00        3     no southeast  4449.462

Question 1

5 pts. Create scatterplots of the response, charges, against three quantitative predictors age, bmi, and children. Describe the general trend (direction and form) of each plot.

par(mfrow = c(2,2))
plot(insurance$age, insurance$charges)
abline(lm(charges ~ age, data = insurance), col = "red")

plot(insurance$bmi, insurance$charges)
abline(lm(charges ~ age, data = insurance), col = "red")

plot(insurance$children, insurance$charges)
abline(lm(charges ~ age, data = insurance), col = "red")

Response to Question 1: Age has an upward trend, a weak positive linear relationship with the DV charges. As age increases, charges also increase. Insurance appears to have a weak upward trend, positive relationship with charges.
children appears to have a downward trend, A very weak positive linear relationship. There appears to a lot of variance in the each plot.

Question 2

5 pts. What is the value of the correlation coefficient for each of the above pair of response and predictor variables? What does it tell you about your comments in Question 1?

cor(insurance[-c(2,5,6)])
##                age       bmi   children    charges
## age      1.0000000 0.1092719 0.04246900 0.29900819
## bmi      0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges  0.2990082 0.1983410 0.06799823 1.00000000
cor(insurance$age, insurance$charges)
## [1] 0.2990082
cor(insurance$bmi, insurance$charges)
## [1] 0.198341
cor(insurance$children, insurance$charges)
## [1] 0.06799823

Response to Question 2: The predictors have very weak correlations with the DV. Correlation coefficients: Age = 0.299, BMI = 0.198, and children with the weakest correlation at Children = 0.068. The Coefficients are all positive and weak like the scatter plots suggests.

Question 3

5 pts. Use age as the predictor to build a simple linear regression model (called model1) for predicting charges. What is the estimated coefficient of age in this model?

model1 <- lm(charges ~ age, data = insurance)
summary(model1)
## 
## Call:
## lm(formula = charges ~ age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8059  -6671  -5939   5440  47829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3165.9      937.1   3.378 0.000751 ***
## age            257.7       22.5  11.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared:  0.08941,    Adjusted R-squared:  0.08872 
## F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 2.2e-16
coef(model1)["age"]
##      age 
## 257.7226

Response to Question 3: The estimated coefficient of age is 257.7.

Question 4

6 pts. Assess whether all of the model assumptions hold, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure you state each of the the model assumptions and corresponding assessments. Each graph may be used to assess one or more model assumptions.

#Assumptions
par(mfrow = c(2,2))
#Linearity - IV vs residuals  
plot(insurance$age, model1$residuals)
abline(h=0, col = "red")
#Variance - fitted vs residuals
plot(model1$fitted.values, model1$residuals)
#Normality - residuals 
hist(model1$residuals)
qqnorm(model1$residuals)
qqline(model1$residuals, col = "red")

Response to Question 4: The assumptions do not hold. For the normality assumption, the distribution is not normal distributed. The QQ plot does not follow the regression line closely. For linearity, as shown in the previous scatter plots DV vs IV and in the plot of residuals vs fitted, it does not appear to have a linear relationship. The data is scattered throughout. It holds up for the constant variance assumption, since the plot of the residuals vs fitted shows constant variance.

Question 5

5 pts. To improve fit, we can use a Box-Cox transformation. Find the optimal lambda value rounded to the nearest half integer. Report this best lambda and the corresponding transformation.

library(MASS)
bc <- boxcox(model1)

bc
## $x
##   [1] -2.00000000 -1.95959596 -1.91919192 -1.87878788 -1.83838384 -1.79797980
##   [7] -1.75757576 -1.71717172 -1.67676768 -1.63636364 -1.59595960 -1.55555556
##  [13] -1.51515152 -1.47474747 -1.43434343 -1.39393939 -1.35353535 -1.31313131
##  [19] -1.27272727 -1.23232323 -1.19191919 -1.15151515 -1.11111111 -1.07070707
##  [25] -1.03030303 -0.98989899 -0.94949495 -0.90909091 -0.86868687 -0.82828283
##  [31] -0.78787879 -0.74747475 -0.70707071 -0.66666667 -0.62626263 -0.58585859
##  [37] -0.54545455 -0.50505051 -0.46464646 -0.42424242 -0.38383838 -0.34343434
##  [43] -0.30303030 -0.26262626 -0.22222222 -0.18181818 -0.14141414 -0.10101010
##  [49] -0.06060606 -0.02020202  0.02020202  0.06060606  0.10101010  0.14141414
##  [55]  0.18181818  0.22222222  0.26262626  0.30303030  0.34343434  0.38383838
##  [61]  0.42424242  0.46464646  0.50505051  0.54545455  0.58585859  0.62626263
##  [67]  0.66666667  0.70707071  0.74747475  0.78787879  0.82828283  0.86868687
##  [73]  0.90909091  0.94949495  0.98989899  1.03030303  1.07070707  1.11111111
##  [79]  1.15151515  1.19191919  1.23232323  1.27272727  1.31313131  1.35353535
##  [85]  1.39393939  1.43434343  1.47474747  1.51515152  1.55555556  1.59595960
##  [91]  1.63636364  1.67676768  1.71717172  1.75757576  1.79797980  1.83838384
##  [97]  1.87878788  1.91919192  1.95959596  2.00000000
## 
## $y
##   [1] -6640.193 -6564.071 -6488.640 -6413.929 -6339.965 -6266.775 -6194.386
##   [8] -6122.830 -6052.140 -5982.349 -5913.493 -5845.612 -5778.743 -5712.930
##  [15] -5648.217 -5584.649 -5522.275 -5461.146 -5401.313 -5342.831 -5285.757
##  [22] -5230.148 -5176.063 -5123.564 -5072.712 -5023.570 -4976.200 -4930.664
##  [29] -4887.024 -4845.340 -4805.672 -4768.075 -4732.603 -4699.305 -4668.228
##  [36] -4639.411 -4612.891 -4588.697 -4566.855 -4547.380 -4530.284 -4515.572
##  [43] -4503.241 -4493.285 -4485.686 -4480.424 -4477.474 -4476.803 -4478.377
##  [50] -4482.153 -4488.090 -4496.140 -4506.254 -4518.381 -4532.469 -4548.465
##  [57] -4566.314 -4585.964 -4607.361 -4630.451 -4655.182 -4681.504 -4709.367
##  [64] -4738.720 -4769.518 -4801.715 -4835.266 -4870.128 -4906.260 -4943.623
##  [71] -4982.178 -5021.889 -5062.720 -5104.638 -5147.610 -5191.605 -5236.593
##  [78] -5282.545 -5329.434 -5377.234 -5425.918 -5475.464 -5525.846 -5577.043
##  [85] -5629.033 -5681.796 -5735.311 -5789.559 -5844.522 -5900.182 -5956.523
##  [92] -6013.526 -6071.178 -6129.462 -6188.365 -6247.871 -6307.967 -6368.641
##  [99] -6429.879 -6491.669
optimal <- round(bc$x[which.max(bc$y)], 2)

Question 6

6 pts. Use this optimal lambda value to transform the response variable. Build a new simple linear regression model, named model2, with the transformed response and the predictor age. Similar to Question 4, check all of the model assumptions. Is model2 a better fit?

model2 <- lm(log(charges) ~ age, data= insurance)
summary(model2)
## 
## Call:
## lm(formula = log(charges) ~ age, data = insurance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3433 -0.4166 -0.3094  0.5000  2.1999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.744247   0.063336  122.27   <2e-16 ***
## age         0.034545   0.001521   22.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7813 on 1336 degrees of freedom
## Multiple R-squared:  0.2786, Adjusted R-squared:  0.2781 
## F-statistic:   516 on 1 and 1336 DF,  p-value: < 2.2e-16
#Assumptions:
par(mfrow= c(2,2))

#IV vs residuals - Linearity & Variance
plot(insurance$age, model2$residuals)
abline(h=0, col = "red")

#Fitted vs residuals - Independence 
plot(model2$fitted.values, model2$residuals)

#QQ & Histogram - Normality
hist(model2$residuals)
qqnorm(model2$residuals)
qqline(model2$residuals, col = "red")

Response to Question 6: The residual vs age plot still shows a dense cluster of residuals below zero, so the mean assumption zero assumption /linearity does not hold. In the fitted vs residuals, the residuals are less clustered but the variance or spread seems to decrease, so constant variance assumption still does not hold. For normality, Using the qq-plot and histogram, it still shows skewdness, its much better than before transformation, however it still does not hold the normality assumption.

Question 7

6 pts. Fit a multiple linear regression model (named model3), using price as the response variable and all the predicting variables: age, sex, bmi, children, smoker, and region. Which estimated coefficients (including intercept) are statistically significant at the 99% confidence level?

model3 <- lm(charges ~ ., data = insurance)
summary(model3)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Response to Question 7: Intercept, age, bmi, children, and smoker are significant predictors at significance 99%. Their p-values are less than the alpha 0.01, so we can reject the null and confirm these predictors are statistically significant at significance level 99%.

Question 8

6 pts. What is the estimated coefficient for region = “southeast” in this MLR model? What is the interpretation for the estimated coefficient for region = “southeast”?

#Find baseline
insurance$region <- as.factor(insurance$region)
levels(insurance$region)
## [1] "northeast" "northwest" "southeast" "southwest"
#extract coefficient
summary(model3)$coef[8]
## [1] -1035.022
cat("coeff:",summary(model3)$coef[8])
## coeff: -1035.022

Response to Question 8: The estimated coefficient for southeast region is -1035.0. This means insurance charges Southeast region -1035 lower than northeast region, holding all predictors constant.

Question 9

6 pts. Detect and count the number of outliers using Cook’s distance and a threshold of 1.

library(car) 
## Loading required package: carData
cd <- cooks.distance(model3)
#Identify outliers
plot(cd, type = "h")
abline(h=1, col ="red")

outliers <- as.numeric(names(cd)[cd > 1])
length(outliers)
## [1] 0

Response to Question 9: There are no outliers in the data set using cooks distance and a threshold of 1.

Question 10

6 pts. Fit a multiple linear regression model (named reduced_model) with charges as the response and all other predictors except for sex. Compare the coefficient of determinations for this reduced_model and the full model from Question 7 (model3). Use the partial F-test to compare model3 and reduced_model using alpha = 0.05.

reduced_model <- lm(charges ~ .-sex, data = insurance)
summary(reduced_model)
## 
## Call:
## lm(formula = charges ~ . - sex, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
#Partial F-test: Perform ANOVA to compare the full and reduced models
anovamodel <- anova(reduced_model, model3)
anovamodel
## Analysis of Variance Table
## 
## Model 1: charges ~ (age + sex + bmi + children + smoker + region) - sex
## Model 2: charges ~ age + sex + bmi + children + smoker + region
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1   1330 4.8845e+10                           
## 2   1329 4.8840e+10  1   5716429 0.1556 0.6933
#r^2 
summary(reduced_model)$r.squared
## [1] 0.7508839
summary(model3)$r.squared
## [1] 0.750913
#Compare adj r2
cat("reduced model adj R^2:", summary(reduced_model)$adj.r.squared)
## reduced model adj R^2: 0.7495727
cat("model3 adj R^2:", summary(model3)$adj.r.squared)
## model3 adj R^2: 0.7494136
#F-stat
anovamodel$F[2]
## [1] 0.155553

Response to Question 10: The F-statistic = 0.1556 which is very small, and the p-value for the F-statistic is 0.6933 which is greater than out alpha level. We can accept the null, the additional predictors do not significantly improve the explanatory power for charges. Both of the models had very close adjusted r squared values, but the reduced model had a slightly higher adjusted r squared value of 0.7495727. It explains more of the variability in charges.

Question 11

4 pts. Using model3, make a prediction for charges with the following factors: age = 20, sex = “female”, bmi = 30, children = 2, smoker = “no”, and region = “southeast”.

What is your predicted charges and the corresponding 95% prediction interval?

insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$regions <- as.factor(insurance$region)
levels(insurance$sex)
## [1] "female" "male"
levels(insurance$smoker)
## [1] "no"  "yes"
levels(insurance$region)
## [1] "northeast" "northwest" "southeast" "southwest"
new_data <- data.frame(age = 20,
                       sex = factor("female", levels = c("female", "male")),
                       bmi = 30,
                       children = 2, 
                       smoker = factor("no", levels = c("no", "yes")),
                       region = factor("southeast", levels = c("northeast","northwest","southeast","southwest"))
)
head(new_data, 3)
##   age    sex bmi children smoker    region
## 1  20 female  30        2     no southeast
prediction <- predict(model3, new_data, interval = "prediction", level = 0.95)

#prediction 
prediction
##        fit       lwr      upr
## 1 3290.371 -8636.926 15217.67
#interval 
cat("95% interval: [", prediction[2], ", ", prediction[3],"]")
## 95% interval: [ -8636.926 ,  15217.67 ]

Response to Question 11:

The End.