Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.
You have been contracted as a healthcare consulting company to understand the factors on which the pricing of health insurance depends.
The data consists of a data frame with 1338 observations on the following 7 variables:
To read the data in R, save the file in your working
directory (make sure you have changed the directory if different from
the R working directory) and read the data using the R
function read.csv()
insurance = read.csv("insurance.csv", head = TRUE)
age = insurance$age
sex = insurance$sex
bmi = insurance$bmi
children = insurance$children
smoker = insurance$smoker
region = insurance$region
price = insurance$price
plot(age,price,ylab="Price",xlab="Age",
main="Price vs Age")
abline(lm(price ~ age, data = insurance), col = "blue")
plot(bmi,price,ylab="Price",xlab="BMI",
main="Price vs BMI")
abline(lm(price ~ bmi, data = insurance), col = "blue")
plot(children,price,ylab="Price",xlab="Children",
main="Price vs Children")
abline(lm(price ~ children, data = insurance), col = "blue")
Answer: The General Trend for all of these scatter plots is positive and linear relationship, meaning a increase in children, BMI, and Age all correspond to a increase in Price. I would argue that Price vs Age has the most distinct linear trend. BMI and Children seem to have a lot of noise in the data. Within the context of Price vs Age there has to be a interaction between age and another variable that would cause the linear groupings in the Price vs Age plot.
# Age Correlation Coefficient
cor(age, price)
## [1] 0.2990082
# BMI Correlation Coefficient
cor(bmi, price)
## [1] 0.198341
# Children Correlation Coefficient
cor(children, price)
## [1] 0.06799823
Answer: The correlation coefficient between age and price is 0.2990082, which is a lightly moderate linearly related term to Price, the second strongest linearly related term to price is BMI with a correlation coefficient of 0.198341, and the least linearly related term to price is children, with a correlation coefficient of 0.06799823 essentially showing that there is no relationship between Children and Price. With that being said this data reinforces our observations made about the scatter plot in part a, w.r.t. age and bmi however it sheds more light on the fact that children may have almost no relationship between price and children. None the less it aligns with the positive relationship established above.
Hint: Use the given code to convert the qualitative predictors to factors.
#make categorical variables into factors
sex = insurance$sex<-as.factor(insurance$sex) #makes female the baseline level
smoker = insurance$smoker<-as.factor(insurance$smoker) #makes no the baseline level
region = insurance$region<-as.factor(insurance$region) #makes northeast the baseline level
boxplot(price~sex, main= "Price vs Sex", xlab = "Sex", ylab = "Price")
boxplot(price~smoker, main= "Price vs Smoker", xlab = "Smoker", ylab = "Price")
boxplot(price~region, main= "Price vs Region", xlab = "Region", ylab = "Price")
Answer: The Box plots elude to the fact that Price vs Sex and Price vs Region or that the Sex and Region qualitative predictors have little to no affect on price in any significant way. However the smoker qualitative variable suggest that there is a significant difference in price between smokers and non smokers.So we would expect to see a relationship between Smoker and Price but not w.r.t. Sex and Region. We would need to perform more analysis to confirm this observation however.
Answer: Yes, based on the analysis above I would suggest running a multiple linear regression with all the predictors to start with because there appears to be some relationships that are more obvious than others but we would need to perform further analysis like fitting a multiple linear regression to conclude for certain that some predictors don’t have significant influence. Then we could refit the model and exclude those parameters that we may deem insignificant.
Build a multiple linear regression model, named model1, using the response, price, and all 6 predictors, and then answer the questions that follow:
model1 = lm(price ~ age + bmi + children + region + sex + smoker , data = insurance)
cat("R^2:",summary(model1)$r.squared)
## R^2: 0.750913
Answer: The coefficient of determination for this model is 0.7509. The R squared value summarizes how well the predictors of this model can be used to predict the response “Price”. The value of ~0.75 is a moderately good fit for the model, and that ~75% of the variation in the response variable price is explained by all 6 predictors.
summary(model1)
##
## Call:
## lm(formula = price ~ age + bmi + children + region + sex + smoker,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## sexmale -131.3 332.9 -0.394 0.693348
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Answer:
\(F\)-statistic = 500.8
\(p\)-value = 2.2e-16
Ha = At least one of the slope coefficients is nonzero. Since alpha of 0.05 exceeds the significance level (p = 2.2e-16), we would reject the null hypothesis. The overall model seems to be statistically useful in predicting the response variable price.
regionanova = aov(price ~ region)
summary(regionanova)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 1.301e+09 433586560 2.97 0.0309 *
## Residuals 1334 1.948e+11 146007093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(regionanova, type="means")
## Tables of means
## Grand mean
##
## 13270.42
##
## region
## northeast northwest southeast southwest
## 13406 12418 14735 12347
## rep 324 325 364 325
Answer: The p-value of the f test is 0.309 which is less than the alpha level of 0.05, therefore we would reject the null hypothesis that all the means of the insurance policies of different regions are equal. Which means that at least one of the means of the region are not equal to at least one of the other regions at the level of 0.05.
model2 = lm(price ~ age + bmi + children + sex + smoker , data = insurance)
#summary(model2)
anova(model2,model1)
## Analysis of Variance Table
##
## Model 1: price ~ age + bmi + children + sex + smoker
## Model 2: price ~ age + bmi + children + region + sex + smoker
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1332 4.9073e+10
## 2 1329 4.8840e+10 3 233431209 2.1173 0.09622 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: The partial F-test p-value is 0.09622 which is greater than the alpha value of 0.05 so we cannot reject the null hypotheses that the regression coefficients for regions are 0 given all other predictors in model 1 at the level of 0.05
Answer: Part a and part b provide different results. we concluded in part a based on only model1 that the means of the insurance policy for a given region are different. In part B we concluded that the regression coefficients for region are quite possibly zero given this analysis at the level of 0.05.
The two types of analysis determine two things, part a concludes the association of region to the response variable price without considering other factors, where as b we tested whether or not region adds explanatory power in adition to the other variables age, bmi, sex, children and smoker.
Note: Please use model1 for all of the following questions.
model1$coefficients[8]
## sexmale
## -131.3144
Answer: The estimated coefficient of sexmale is -131.3144. The interpretation of this value means that the price of male insurance policies are $131.31 cheaper than policies for females when all other predictors are held constant.
model1$coefficients[3]
## bmi
## 339.1935
Answer: The estimated coefficient of bmi is 339.1935. The interpretation of this value means that a 0.01 increase in BMI would increase the price of a insurance policy by 0.01 x 339.1935 = $3.391935 when all other predictors are held constant.
confint(model1,level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -13876.3934 -10000.68373
## age 233.5138 280.19893
## bmi 283.0884 395.29848
## children 205.1633 745.83780
## regionnorthwest -1287.2982 581.37040
## regionsoutheast -1974.0968 -95.94733
## regionsouthwest -1897.6364 -22.46560
## sexmale -784.4703 521.84155
## smokeryes 23038.0307 24659.03838
summary(model1)$coefficients['age','t value']
## [1] 21.58666
confint(model1,level = 0.90)
## 5 % 95 %
## (Intercept) -13564.4899 -10312.5872
## age 237.2708 276.4419
## bmi 292.1187 386.2682
## children 248.6749 702.3262
## regionnorthwest -1136.9143 430.9865
## regionsoutheast -1822.9499 -247.0942
## regionsouthwest -1746.7292 -173.3728
## sexmale -679.3429 416.7142
## smokeryes 23168.4837 24528.5854
summary(model1)$coefficients['age','t value']
## [1] 21.58666
Answer: The 95% CI for age for model1 is (233.5138, 280.19893)
The 90% CI for age for model1 is (237.2708, 276.4419)
The level of confidence has a direct relation to the width of the range of the CI, this being said the 95% confidence interval is wider than that of the 90% interval. Hence to have a higher confidence level the range has to expand slightly to ensure that the value will lie within that range at that confidence.
Neither of the intervals contain zero which means that age is statistically significant at both levels.
insurance <- insurance[1,1:6]
predict(model1, insurance, interval = "confidence")
## fit lwr upr
## 1 25293.71 24143.98 26443.44
Answer: The estimated average price of all insurance policies with the same characteristics as the first data point in the sample is $25293.71. The CI with a confidence level of 95% is the following, the lower bound is 24143.98 and the upper bound is 26443.44. We are 95% confident that the mean price of the insurance policies will fall between this interval above.
insurance[1] <- 50
predict(model1, insurance, interval = "prediction")
## fit lwr upr
## 1 33256.26 21313.29 45199.23
Answer: The predicted price of the insurance policy would be 33,256.26 dollars. The 95% prediction interval is the following (21313.29, 45199.23) with 21313.29 dollars as the lower bound and 45199.23 dollars as the upper bound. We are 95% confident that the price of a insurance policy with this particular set of characteristics is between the Prediction interval.