The structure of the data

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Question 1

Conduct an exploratory analysis of the data (you shouldn’t build a model here). You should bear in mind the research question of interest and make it accessible to a non-statistician.

Exploratory Data Analysis

The First 10 observations

##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

Check for missing values

## [1]  0 NA  0  0 NA NA  0

For each numeric variable, the mean of the missing values is 0. For categorical variables, the length of missing values is indicated as NA.There are no missing values for each variable.

Statistics for numerical variables

##          vars    n     mean       sd  median  trimmed     mad     min      max
## age         1 1338    39.21    14.05   39.00    39.01   17.79   18.00    64.00
## bmi         2 1338    30.66     6.10   30.40    30.50    6.20   15.96    53.13
## children    3 1338     1.09     1.21    1.00     0.94    1.48    0.00     5.00
## charges     4 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87 63770.43
##             range skew kurtosis     se
## age         46.00 0.06    -1.25   0.38
## bmi         37.17 0.28    -0.06   0.17
## children     5.00 0.94     0.19   0.03
## charges  62648.55 1.51     1.59 331.07

It appears from the output that there are no missing values

Exploratory Plots

Check correlation in numeric data

## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

For every numeric variable, there is a significant relationship with the charges variable.

From the plot male and female appear to have similar medians. Male appears to have larger variability than Female. There are no obvious outliers in any of the samples.

All regions appear to have similar central points. Southeast region appears to have larger variability compared to other reigons.

The centers of smokers are higher than those of non-smokers . Smokers appears to have more variability than non-smokers.

There are more policy Holders from southeast compared to other regions.

There are more non-smokers than smokers among the policy holders.

There are more charges from 10000 to 15000. Charges appears to be left-skewed

Fit a linear regression model with charges as the outcome, and age , bmi , children as numeric covariates and sex , smoker and region as categorical covariates. The reference categories for sex, smoker and region will be female , non-smoker and regionnortheast, respectively, as long as you don’t change the data prior to building the model. Do not include interactions.

Question 2

Explain what the following regression coefficients in Model 1 tell us about how particular covariates are related to charges (i.e. you should interpret the estimates in context), and state the units that they are measured in.

## 
## Call:
## lm(formula = charges ~ age + bmi + children + relevel(sex, ref = "female") + 
##     relevel(smoker, ref = "no") + relevel(region, ref = "northeast"), 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 -11938.5      987.8 -12.086
## age                                            256.9       11.9  21.587
## bmi                                            339.2       28.6  11.860
## children                                       475.5      137.8   3.451
## relevel(sex, ref = "female")male              -131.3      332.9  -0.394
## relevel(smoker, ref = "no")yes               23848.5      413.1  57.723
## relevel(region, ref = "northeast")northwest   -353.0      476.3  -0.741
## relevel(region, ref = "northeast")southeast  -1035.0      478.7  -2.162
## relevel(region, ref = "northeast")southwest   -960.0      477.9  -2.009
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## age                                          < 2e-16 ***
## bmi                                          < 2e-16 ***
## children                                    0.000577 ***
## relevel(sex, ref = "female")male            0.693348    
## relevel(smoker, ref = "no")yes               < 2e-16 ***
## relevel(region, ref = "northeast")northwest 0.458769    
## relevel(region, ref = "northeast")southeast 0.030782 *  
## relevel(region, ref = "northeast")southwest 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Intercept : -11938.5 USD is the expected mean of charges when all other covariates are 0.
children: the expected change in the charges is 475.5USD for a unit change in number of Children when all other covariates are held fixed.
regionnorthwest: residing in northwest region increases the charges by -353.0USD relative to regionnorteast.

Question 3

The category regionnorthwest in Model 1 has a large p-value associated with it. Your colleague advises you to remove this particular covariate from the model (without removing the data for this region from the model). Write a short paragraph explaining to your colleague what the implications of doing this would be in terms of the interpretation of the intercept and remaining dummy variables for region (you should also give the new interpretation in context), but do not remove this category.

From the output, there is no significant difference in the performance of the model from the previous model. The model explains 75.09% of variability in the data.

The p-value of coefficient estimates of other regions have significantly increase. This indicates that they are less significant to the model.

Question 4

Construct a linear model that addresses this suggestion, retaining all other covariates in the model. This model will be referred to as Model 2. Briefly justify your approach, explaining why it addresses your colleague’s suggestion. Interpret, in context, any additional covariates that are now in your model but weren’t before.

## 
## Call:
## lm(formula = charges ~ age + bmi + children + relevel(sex, ref = "female") + 
##     relevel(smoker, ref = "no") + relevel(region, ref = "northeast") + 
##     bmi * relevel(smoker, ref = "no"), data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14580.7  -1857.2  -1360.8   -475.7  30552.4 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  -2223.454    865.611  -2.569
## age                                            263.620      9.516  27.703
## bmi                                             23.533     25.601   0.919
## children                                       516.403    110.179   4.687
## relevel(sex, ref = "female")male              -500.146    266.518  -1.877
## relevel(smoker, ref = "no")yes              -20415.611   1648.277 -12.386
## relevel(region, ref = "northeast")northwest   -585.478    380.859  -1.537
## relevel(region, ref = "northeast")southeast  -1210.131    382.750  -3.162
## relevel(region, ref = "northeast")southwest  -1231.108    382.218  -3.221
## bmi:relevel(smoker, ref = "no")yes            1443.096     52.647  27.411
##                                             Pr(>|t|)    
## (Intercept)                                  0.01032 *  
## age                                          < 2e-16 ***
## bmi                                          0.35814    
## children                                    3.06e-06 ***
## relevel(sex, ref = "female")male             0.06079 .  
## relevel(smoker, ref = "no")yes               < 2e-16 ***
## relevel(region, ref = "northeast")northwest  0.12447    
## relevel(region, ref = "northeast")southeast  0.00160 ** 
## relevel(region, ref = "northeast")southwest  0.00131 ** 
## bmi:relevel(smoker, ref = "no")yes           < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4846 on 1328 degrees of freedom
## Multiple R-squared:  0.8409, Adjusted R-squared:  0.8398 
## F-statistic:   780 on 9 and 1328 DF,  p-value: < 2.2e-16

The interaction effect between bmi and smoker is added into the model with non-smokers as the reference category.

The interaction effect **bmi*smokeryes** has a very small p-value. This shows that the interaction is more significant in the model. Therefore, the interaction effect should not be ignored.

For every unit change in the interaction effect **bmi*smokeryes**, there is an expected change of is 1443.096 USD in response, charges when all other covariates are held fixed.

Question 5

Compare the output from Model 2 with the output from Model 1. In particular comment on the following.

Compare the regression coefficient of bmi in Model 2 and Model 1, and explain any differences. The regression coefficient of bmi in model 2 is lower than that of model 1. The p-value of the coefficient of bmi is relatively higher in model 2 compared to its p-value in model 1. This shows that the regression coeffient of bmi in model 2 is less significant compared to that of model 1.

Would you remove the variable bmi from either model? Why?

The bmi in model 2 can be removed because it is statistically less significant while in model 1 it cannot be removed as it is statistically significant.

What happens to the variable sex in Model 2, in comparison to Model 1, and whether this is a concern.

The regression coefficient of sex in model 2 is lower than that of model 1. However, the p-value of the coefficients in both models are relatively higher compare to significance level of 0.05. This shows that they are less significant in both models. This predictor variable can be ignored in both models.

Question 6

Create suitable plots of the response, charges , against each of the covariates in Model 2 in order to assess whether the assumption of linearity is reasonable.

Charges and Age

As the regression line increases, the residuals appears to remain constant. Therefore the linearity assumption here is reasonable.

The red line indicates a significant association between charges and Age. An increase in Age causes a positive increase in the amount Charges.

Charges and bmi

As the regression line increases the residual errors appears to increase and move away from the fittet line.. This is against linearity assumption.

The red line indicates a significant association between charges and body mass index. An increase in bmi causes a positive increase in amount of Charges.

Charges and children

As the regression line increases the residual errors appears to reduce and close to the fitted line. The linearity assumption is reasonable here.

There is a small positive correlation between number of Children and amount of Charges. The red line shows that an increase in the number of children causes a small increase in the amount of charges.

Using the plots you generated, comment on whether you prefer Model 1 or Model 2, justifying your answer.

From the scatter plots and regression line in the plots above, the residuals appears to reduces as the regression lines increase except for the body mass index. This appears to indicate that the linearity assumption is reasonable.

The plots shows significant relationship between the response variable and each of the predictor variables. Therefore model 2 is preferable.

Using the output for Model 1 and Model 2, comment on whether you prefer Model 1 or Model 2, justifying your answer. [No need to conduct any statistical procedures here.]

The R-squared for model 2 is 84.09% while the R-squared for model 1 is 75.09%. Since R-squared for model 2 is higher than that of model 1, model 2 is preferred to model 1.

The Residual standard error for model 2 is 4846 while that of model 1 is 6062. This also makes model 2 is preferred to model 1 as its residual standard error is lower than that of model 1.

State, with a reason, whether Model 1 or Model 2 is your overall preferred model for answering the insurance company’s question.[No need to conduct any procedures here; I’m looking for some sensible comments based on the plots and the output only.]

Model 2 is preferred model. From the plots, there was a significant relationship between the response variable and each covariate as shown by the p-values. Also, from model two, there is an interaction effect that is significant to the response,charges. The residual standard error and R-squared are much better for model two compared to model 1.

Question 7

The variable children is currently included in Model 2 as a numeric variable, though it could be included as a categorical variable instead. Build another version of Model 2 but this time with children as a categorical covariate instead of numeric. This will be known as Model 3. You might find it helpful to use the as.factor function in R to do this.

Fit model 3

## 
## Call:
## lm(formula = charges ~ age + bmi + as.factor(children) + relevel(sex, 
##     ref = "female") + relevel(smoker, ref = "no") + relevel(region, 
##     ref = "northeast") + bmi * relevel(smoker, ref = "no"), data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14340.8  -1918.4  -1259.7   -405.8  30534.7 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  -2186.977    869.649  -2.515
## age                                            264.266      9.522  27.754
## bmi                                             21.715     25.585   0.849
## as.factor(children)1                           323.962    336.619   0.962
## as.factor(children)2                          1529.192    372.836   4.102
## as.factor(children)3                          1009.551    437.870   2.306
## as.factor(children)4                          3432.837    990.107   3.467
## as.factor(children)5                          1845.093   1163.492   1.586
## relevel(sex, ref = "female")male              -499.733    266.241  -1.877
## relevel(smoker, ref = "no")yes              -20398.813   1646.569 -12.389
## relevel(region, ref = "northeast")northwest   -598.643    380.799  -1.572
## relevel(region, ref = "northeast")southeast  -1206.353    382.829  -3.151
## relevel(region, ref = "northeast")southwest  -1228.919    382.122  -3.216
## bmi:relevel(smoker, ref = "no")yes            1442.648     52.605  27.424
##                                             Pr(>|t|)    
## (Intercept)                                 0.012028 *  
## age                                          < 2e-16 ***
## bmi                                         0.396170    
## as.factor(children)1                        0.336025    
## as.factor(children)2                        4.35e-05 ***
## as.factor(children)3                        0.021287 *  
## as.factor(children)4                        0.000543 ***
## as.factor(children)5                        0.113018    
## relevel(sex, ref = "female")male            0.060739 .  
## relevel(smoker, ref = "no")yes               < 2e-16 ***
## relevel(region, ref = "northeast")northwest 0.116173    
## relevel(region, ref = "northeast")southeast 0.001663 ** 
## relevel(region, ref = "northeast")southwest 0.001331 ** 
## bmi:relevel(smoker, ref = "no")yes           < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4840 on 1324 degrees of freedom
## Multiple R-squared:  0.8418, Adjusted R-squared:  0.8402 
## F-statistic: 541.9 on 13 and 1324 DF,  p-value: < 2.2e-16

Comment on how the output for Model 3 compares to that for Model 2.

There is no significant difference in the coefficients of the covariates between the two models. However, the residual standard errors in model 3 is 4840 which is less compared to 4846 in model 2. The R-squared of model 3 is 84.18% which is slightly higher compared to 84.09% of Model 2.

How does the interpretation of the variable children change between Models 2 and 3?

In model 1, a unit increase in children causes a 516.403USD increase in the response variable, charges. In model 3, the reference category is taken to be 0 and the children has no effect on response. When the value of children is:1 , the response increases by 323.962 relative to category 0,2, the response increases by 1529.192 relative to category 0, 3, the response increases by 1009.551 relative to category 0, 4 , the response increases by 3432.837 relative to category 0, and when it is 5 , the response increases by 1845.093 relative to category 0.

Name one advantage and one disadvantage of including a variable like children as a categorical covariate rather than numeric variable (you should think about this in general, not necessarily for this particular example).

Advantage : it increases the number of parameters. This causes bias reduction in the coefficients estimates.

Disadvantage : it may lead to wrong intepretation of.

Would you consider including age and/ or bmi as categorical variables? Explain your answer. Note that even if you state ‘yes’ here, you don’t need to implement this model.

Yes. Including age and bmi as categorical variables because it is possible to tell the effect of each group to the response variable and not age or bmi as general.

Question 8

For Models 1, 2 and 3, predict the charges for each of the observations in your model (you might want to look up R’s fitted function that does this for you). This will give you three sets of predictions for charges, one set for each model. For each model, compare the following quantities explaining any similarity or difference between them, using basic mathematics if necessary.

The total (i.e. sum of) observed charges over all observations, and
The total of the predicted charges (i.e. where is the predicted charge for the th observation using the model under consideration).

Total observed charges

## [1] 17755825

Model 1

## [1] 17755825

Sum of observed charges is equal to sum of predicted charges from model 1.

Model 2

## [1] 17755825

Sum of observed charges is equal to sum of predicted charges from model 2.

Model 3

## [1] 17755825

Sum of observed charges is equal to sum of predicted charges from model 3.

\[ \tag {I_am_tgtarus }\]