Peer Grader Guidance

Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.

Background

You have been contracted as a healthcare consulting company to understand the factors on which the pricing of health insurance depends.

Data Description

The data consists of a data frame with 1338 observations on the following 7 variables:

price: Response variable ($)
age: Quantitative variable
sex: Qualitative variable
bmi: Quantitative variable
children: Quantitative variable
smoker: Qualitative variable
region: Qualitative variable

Instructions on reading the data

To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()

insurance = read.csv("insurance.csv", head = TRUE)

age = insurance$age
sex = insurance$sex
bmi = insurance$bmi
children = insurance$children
smoker = insurance$smoker
region = insurance$region
price = insurance$price

Question 1: Exploratory Data Analysis [15 points]

4 pts Create scatterplots of the response, price, against three quantitative predictors age, bmi, and children. Describe the general trend (direction and form) of each plot. It should be 3 separate scatter plots.

plot(age,price,ylab="Price",xlab="Age",
     main="Price vs Age")
abline(lm(price ~ age, data = insurance), col = "blue")

plot(bmi,price,ylab="Price",xlab="BMI",
     main="Price vs BMI")
abline(lm(price ~ bmi, data = insurance), col = "blue")

plot(children,price,ylab="Price",xlab="Children",
     main="Price vs Children")
abline(lm(price ~ children, data = insurance), col = "blue")

Answer: The General Trend for all of these scatter plots is positive and linear relationship, meaning a increase in children, BMI, and Age all correspond to a increase in Price. I would argue that Price vs Age has the most distinct linear trend. BMI and Children seem to have a lot of noise in the data. Within the context of Price vs Age there has to be a interaction between age and another variable that would cause the linear groupings in the Price vs Age plot.

4 pts What is the value of the correlation coefficient for each of the above pair of response and predictor variables? What does it tell you about your comments in part (a)?

# Age Correlation Coefficient
cor(age, price)

## [1] 0.2990082

# BMI Correlation Coefficient
cor(bmi, price)

## [1] 0.198341

# Children Correlation Coefficient
cor(children, price)

## [1] 0.06799823

Answer: The correlation coefficient between age and price is 0.2990082, which is a lightly moderate linearly related term to Price, the second strongest linearly related term to price is BMI with a correlation coefficient of 0.198341, and the least linearly related term to price is children, with a correlation coefficient of 0.06799823 essentially showing that there is no relationship between Children and Price. With that being said this data reinforces our observations made about the scatter plot in part a, w.r.t. age and bmi however it sheds more light on the fact that children may have almost no relationship between price and children. None the less it aligns with the positive relationship established above.

4 pts Create box plots of the response, price, and the three qualitative predictors sex, smoker, and region. Based on these box plots, does there appear to be a relationship between these qualitative predictors and the response?

Hint: Use the given code to convert the qualitative predictors to factors.

#make categorical variables into factors
sex = insurance$sex<-as.factor(insurance$sex) #makes female the baseline level
smoker = insurance$smoker<-as.factor(insurance$smoker) #makes no the baseline level
region = insurance$region<-as.factor(insurance$region) #makes northeast the baseline level

boxplot(price~sex, main= "Price vs Sex", xlab = "Sex", ylab = "Price")

boxplot(price~smoker, main= "Price vs Smoker", xlab = "Smoker", ylab = "Price")

boxplot(price~region, main= "Price vs Region", xlab = "Region", ylab = "Price")

Answer: The Box plots elude to the fact that Price vs Sex and Price vs Region or that the Sex and Region qualitative predictors have little to no affect on price in any significant way. However the smoker qualitative variable suggest that there is a significant difference in price between smokers and non smokers.So we would expect to see a relationship between Smoker and Price but not w.r.t. Sex and Region. We would need to perform more analysis to confirm this observation however.

3 pts Based on the analysis above, does it make sense to run a multiple linear regression with all of the predictors?

Answer: Yes, based on the analysis above I would suggest running a multiple linear regression with all the predictors to start with because there appears to be some relationships that are more obvious than others but we would need to perform further analysis like fitting a multiple linear regression to conclude for certain that some predictors don’t have significant influence. Then we could refit the model and exclude those parameters that we may deem insignificant.

Question 2: Fitting the Multiple Linear Regression Model [12 points]

Build a multiple linear regression model, named model1, using the response, price, and all 6 predictors, and then answer the questions that follow:

model1 = lm(price ~ age + bmi + children + region + sex + smoker , data = insurance)

6 pts Report the coefficient of determination (R-squared) for the model and give a concise interpretation of this value.

cat("R^2:",summary(model1)$r.squared)

## R^2: 0.750913

Answer: The coefficient of determination for this model is 0.7509. The R squared value summarizes how well the predictors of this model can be used to predict the response “Price”. The value of ~0.75 is a moderately good fit for the model, and that ~75% of the variation in the response variable price is explained by all 6 predictors.

6 pts Is the model of any use in predicting price? Using $\alpha = 0.05$, provide the following elements of the test of overall regression of the model: null hypothesis $H_0$, alternative hypothesis $H_a$, $F$-statistic or $p$-value, and conclusion.

summary(model1)

## 
## Call:
## lm(formula = price ~ age + bmi + children + region + sex + smoker, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## sexmale           -131.3      332.9  -0.394 0.693348    
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Answer:

$F$-statistic = 500.8

$p$-value = 2.2e-16

Ha = At least one of the slope coefficients is nonzero. Since alpha of 0.05 exceeds the significance level (p = 2.2e-16), we would reject the null hypothesis. The overall model seems to be statistically useful in predicting the response variable price.

Question 3: Model Comparison [14 points]

5 pts Assuming a marginal relationship between region and price, perform an ANOVA $F$-test on the mean insurance prices among the different regions. Using an $\alpha$-level of 0.05, can we reject the null hypothesis that the means of the regions are equal? Please interpret.

regionanova = aov(price ~ region)
summary(regionanova)

##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## region         3 1.301e+09 433586560    2.97 0.0309 *
## Residuals   1334 1.948e+11 146007093                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model.tables(regionanova, type="means")

## Tables of means
## Grand mean
##          
## 13270.42 
## 
##  region 
##     northeast northwest southeast southwest
##         13406     12418     14735     12347
## rep       324       325       364       325

Answer: The p-value of the f test is 0.309 which is less than the alpha level of 0.05, therefore we would reject the null hypothesis that all the means of the insurance policies of different regions are equal. Which means that at least one of the means of the region are not equal to at least one of the other regions at the level of 0.05.

5 pts Now, build a second multiple linear regression model, called model2, using price as the response variable, and all variables except region as the predictors. Conduct a partial $F$-test comparing model2 with model1. What is the partial-F test p-value? Can we reject the null hypothesis that the regression coefficients for region variables are zero at an $\alpha$-level of 0.05?

model2 = lm(price ~ age + bmi + children + sex + smoker , data = insurance)
#summary(model2)
anova(model2,model1)

## Analysis of Variance Table
## 
## Model 1: price ~ age + bmi + children + sex + smoker
## Model 2: price ~ age + bmi + children + region + sex + smoker
##   Res.Df        RSS Df Sum of Sq      F  Pr(>F)  
## 1   1332 4.9073e+10                              
## 2   1329 4.8840e+10  3 233431209 2.1173 0.09622 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The partial F-test p-value is 0.09622 which is greater than the alpha value of 0.05 so we cannot reject the null hypotheses that the regression coefficients for regions are 0 given all other predictors in model 1 at the level of 0.05

4 pts What can you conclude from 3a and 3b? Do they provide the exact same results?

Answer: Part a and part b provide different results. we concluded in part a based on only model1 that the means of the insurance policy for a given region are different. In part B we concluded that the regression coefficients for region are quite possibly zero given this analysis at the level of 0.05.

The two types of analysis determine two things, part a concludes the association of region to the response variable price without considering other factors, where as b we tested whether or not region adds explanatory power in adition to the other variables age, bmi, sex, children and smoker.

Note: Please use model1 for all of the following questions.

Question 4: Coefficient Interpretation [7 points]

3 pts Interpret the estimated coefficient of sexmale in the context of the problem. Make sure female is the baseline level for sex. Mention any assumptions you make about other predictors clearly when stating the interpretation.

model1$coefficients[8]

##   sexmale 
## -131.3144

Answer: The estimated coefficient of sexmale is -131.3144. The interpretation of this value means that the price of male insurance policies are $131.31 cheaper than policies for females when all other predictors are held constant.

4 pts If the value of the bmi in model1 is increased by 0.01 and the other predictors are kept constant, what change in the response would be expected?

model1$coefficients[3]

##      bmi 
## 339.1935

Answer: The estimated coefficient of bmi is 339.1935. The interpretation of this value means that a 0.01 increase in BMI would increase the price of a insurance policy by 0.01 x 339.1935 = $3.391935 when all other predictors are held constant.

Question 5: Confidence and Prediction Intervals [12 points]

6 pts Compute 90% and 95% confidence intervals (CIs) for the parameter associated with age for model1. What observations can you make about the width of these intervals?

confint(model1,level = 0.95)

##                       2.5 %       97.5 %
## (Intercept)     -13876.3934 -10000.68373
## age                233.5138    280.19893
## bmi                283.0884    395.29848
## children           205.1633    745.83780
## regionnorthwest  -1287.2982    581.37040
## regionsoutheast  -1974.0968    -95.94733
## regionsouthwest  -1897.6364    -22.46560
## sexmale           -784.4703    521.84155
## smokeryes        23038.0307  24659.03838

summary(model1)$coefficients['age','t value']

## [1] 21.58666

confint(model1,level = 0.90)

##                         5 %        95 %
## (Intercept)     -13564.4899 -10312.5872
## age                237.2708    276.4419
## bmi                292.1187    386.2682
## children           248.6749    702.3262
## regionnorthwest  -1136.9143    430.9865
## regionsoutheast  -1822.9499   -247.0942
## regionsouthwest  -1746.7292   -173.3728
## sexmale           -679.3429    416.7142
## smokeryes        23168.4837  24528.5854

summary(model1)$coefficients['age','t value']

## [1] 21.58666

Answer: The 95% CI for age for model1 is (233.5138, 280.19893)

The 90% CI for age for model1 is (237.2708, 276.4419)

The level of confidence has a direct relation to the width of the range of the CI, this being said the 95% confidence interval is wider than that of the 90% interval. Hence to have a higher confidence level the range has to expand slightly to ensure that the value will lie within that range at that confidence.

Neither of the intervals contain zero which means that age is statistically significant at both levels.

3 pts Using model1, estimate the average price for all insurance policies with the same characteristics as the first data point in the sample. What is the 95% confidence interval? Provide an interpretation of your results.

insurance <- insurance[1,1:6]
predict(model1, insurance, interval = "confidence")

##        fit      lwr      upr
## 1 25293.71 24143.98 26443.44

Answer: The estimated average price of all insurance policies with the same characteristics as the first data point in the sample is $25293.71. The CI with a confidence level of 95% is the following, the lower bound is 24143.98 and the upper bound is 26443.44. We are 95% confident that the mean price of the insurance policies will fall between this interval above.

3 pts Suppose that the age value for the first data point is increased to 50, while all other values are kept fixed. Using model1, predict the price of an insurance policy with these characteristics. What is the 95% prediction interval? Provide an interpretation of your results.

insurance[1] <- 50
predict(model1, insurance, interval = "prediction")

##        fit      lwr      upr
## 1 33256.26 21313.29 45199.23

Answer: The predicted price of the insurance policy would be 33,256.26 dollars. The 95% prediction interval is the following (21313.29, 45199.23) with 21313.29 dollars as the lower bound and 45199.23 dollars as the upper bound. We are 95% confident that the price of a insurance policy with this particular set of characteristics is between the Prediction interval.