We run the model comparison:
apgar = read.csv("apgar.csv", header = TRUE)
apgar
## CASE APGAR GENDER SMOKES WGTGAIN GESTAT PRENAT ANNINC
## 1 43 6 0 0 22 37 0 32
## 2 51 5 0 0 50 35 0 17
## 3 56 4 0 0 60 36 0 21
## 4 53 4 1 0 60 37 0 32
## 5 40 6 0 0 35 41 1 23
## 6 59 2 0 0 50 38 1 12
## 7 17 8 1 0 18 42 1 37
## 8 27 7 1 0 27 36 1 90
## 9 47 5 1 0 24 40 1 97
## 10 4 10 0 0 30 42 2 150
## 11 10 9 0 0 25 42 2 12
## 12 11 9 0 0 17 39 2 87
## 13 12 9 0 0 35 38 2 41
## 14 20 8 0 0 19 36 2 30
## 15 22 8 0 0 28 40 2 21
## 16 30 7 0 0 32 41 2 34
## 17 31 7 0 0 27 42 2 45
## 18 41 6 0 0 32 35 2 11
## 19 2 10 1 0 20 39 2 100
## 20 8 9 1 0 21 40 2 39
## 21 15 8 1 0 35 40 2 46
## 22 16 8 1 0 21 41 2 67
## 23 18 8 1 0 17 38 2 10
## 24 25 7 1 0 40 39 2 34
## 25 37 6 1 0 19 38 2 19
## 26 46 5 1 0 28 41 2 21
## 27 57 3 1 0 75 39 2 56
## 28 5 10 0 0 22 38 3 29
## 29 13 9 0 0 23 38 3 160
## 30 21 8 0 0 25 37 3 34
## 31 23 8 0 0 30 41 3 76
## 32 32 7 0 0 19 38 3 130
## 33 33 7 0 0 28 39 3 29
## 34 42 6 0 0 28 38 3 34
## 35 50 5 0 0 45 38 3 160
## 36 1 10 1 0 25 40 3 50
## 37 3 10 1 0 18 41 3 90
## 38 7 9 1 0 23 36 3 180
## 39 9 9 1 0 32 42 3 90
## 40 14 8 1 0 22 39 3 99
## 41 26 7 1 0 24 37 3 54
## 42 28 7 1 0 19 35 3 65
## 43 36 6 1 0 34 40 3 65
## 44 38 6 1 0 24 39 3 45
## 45 34 6 1 1 18 35 0 23
## 46 44 5 1 1 37 38 0 21
## 47 19 8 0 1 25 37 1 15
## 48 29 7 0 1 32 37 1 42
## 49 49 5 0 1 15 28 1 32
## 50 55 4 0 1 25 39 1 31
## 51 58 3 0 1 13 22 1 98
## 52 6 9 1 1 18 37 1 60
## 53 35 6 1 1 26 37 1 99
## 54 48 5 0 1 10 20 2 34
## 55 24 7 1 1 14 42 2 21
## 56 39 6 0 1 19 40 3 65
## 57 54 4 0 1 12 25 3 97
## 58 45 5 1 1 35 42 3 34
## 59 52 4 1 1 8 20 3 56
## 60 60 1 1 1 10 20 3 45
summary(apgar)
## CASE APGAR GENDER SMOKES
## Min. : 1.00 Min. : 1.000 Min. :0.0 Min. :0.0000
## 1st Qu.:15.75 1st Qu.: 5.000 1st Qu.:0.0 1st Qu.:0.0000
## Median :30.50 Median : 7.000 Median :0.5 Median :0.0000
## Mean :30.50 Mean : 6.683 Mean :0.5 Mean :0.2667
## 3rd Qu.:45.25 3rd Qu.: 8.000 3rd Qu.:1.0 3rd Qu.:1.0000
## Max. :60.00 Max. :10.000 Max. :1.0 Max. :1.0000
## WGTGAIN GESTAT PRENAT ANNINC
## Min. : 8.00 Min. :20.00 Min. :0.000 Min. : 10.00
## 1st Qu.:19.00 1st Qu.:37.00 1st Qu.:1.000 1st Qu.: 29.00
## Median :25.00 Median :38.00 Median :2.000 Median : 41.50
## Mean :27.08 Mean :37.12 Mean :1.967 Mean : 55.78
## 3rd Qu.:32.00 3rd Qu.:40.00 3rd Qu.:3.000 3rd Qu.: 78.75
## Max. :75.00 Max. :42.00 Max. :3.000 Max. :180.00
GESTAT_c = apgar$GESTAT - mean(apgar$GESTAT)
model.c1 = lm(apgar$APGAR ~ 1)
mcSummary(model.c1)
## lm(formula = apgar$APGAR ~ 1)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.000 0 Inf 0
## Error 260.983 59 4.423
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 2.103 0
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.272 24.614 2680.017 0.911 NA 6.14 7.227 0
model.a1 = lm(apgar$APGAR ~ 1 + GESTAT_c)
mcSummary(model.a1)
## lm(formula = apgar$APGAR ~ 1 + GESTAT_c)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 72.044 1 72.044 0.276 22.116 0
## Error 188.939 58 3.258
## Corr Total 260.983 59 4.423
##
## RMSE AdjEtaSq
## 1.805 0.264
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.233 28.683 2680.017 0.934 NA 6.217 7.150 0
## GESTAT_c 0.205 0.044 4.703 72.044 0.276 NA 0.118 0.292 0
From the output of \(mcSummary\) above we can define two confidence intervals for the mean apgar score, one derived from the unconditional model and the other from the model conditional on GESTAT scores: \[ Unconditional: M_{apgar} \in [6.14, 7.227]\\ Conditional: M_{apgar} \in [6.217, 7.150] \] These two confidences intervals differ because the mean-squared error in the conditional model has been reduced compared to that of the unconditional model. The formula for the confidence interval of the intercept makes this clear: \[ CI_{.95} = b_{0} \pm \sqrt{\frac{F*MSE}{n}} \] That is, when the MSE of a given model is reduced, the confidence interval shrinks accordingly.
We run the model comparison below:
GESTAT_40 = apgar$GESTAT - 40 #center gestat scores on 40 weeks
apgar$int8 = 8
model.c2 = lm(apgar$APGAR ~ 0 + GESTAT_40, offset = int8, data = apgar)
model.a2 = lm(apgar$APGAR ~ 1 + GESTAT_40)
modelCompare(ModelC = model.c2, ModelA = model.a2)
## SSE (Compact) = 213.3758
## SSE (Augmented) = 188.939
## Delta R-Squared = -0.6626379
## Partial Eta-Squared (PRE) = 0.114525
## F(1,58) = 7.501568, p = 0.008175292
From the \(modelCompare\) output we observe that with an \(F(1, 58) = 7.501\) and corresponding \(p = .008\), we can reject the null hypothesis that \(\beta_0 = 8\). That is, the best predicted apgar score for a baby experiencing 40 weeks of gestation is very unlikely to be 8.
A brief “news summary” of the results and conclusions for this test is given immediately below: I set out to determine whether or not the expected apgar score for a baby experiencing a full gestation period of 40 weeks would be best predicted as 8 points. I found that a model freely estimating the scores for such babies significantly improved model fit, suggesting that the corresponding apgar score is likely to differ from 8. That is, I can rule out the possibility that 8 points should be considered the expected apgar score for babies experiencing the full 40 weeks of gestation.
Last week I used the insurance dataset, which includes annual medical insurance costs of \(1338\) Americans alongside a number of personal characteristics such as age, sex, smoking status, and so on, to determine if one’s BMI signficantly influenced insurance costs. Unsurprisingly, the results affirmed that BMI significantly increased medical insurance costs amongst these individuals. Here I modify that analysis to determine if an additional unit of BMI plausibly increases insurance costs by $100. The compacted and augmented models as well as the null hypothesis for this comparison are \[ Compact: charges_i = \beta_0 + 100*BMI_i + \epsilon_i \\ Augmented: charges_i = \beta_0 + \beta_1*BMI_i + \epsilon_i \\ Null: \beta_1 = 100 \] Below I employ these models to test the null:
insurance = read.csv("insurance.csv", header=TRUE)
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
model.c3 = lm(insurance$charges ~ 1 + offset(100*insurance$bmi))
model.a3 = lm(insurance$charges ~ 1 + insurance$bmi)
mcSummary(model.a3)
## lm(formula = insurance$charges ~ 1 + insurance$bmi)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 7713391237 1 7713391237 0.039 54.709 0
## Error 188360830332 1336 140988645
## Corr Total 196074221568 1337 146652372
##
## RMSE AdjEtaSq
## 11873.86 0.039
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5
## (Intercept) 1192.937 1664.802 0.717 72392559 0.000 NA -2072.974
## insurance$bmi 393.873 53.251 7.397 7713391237 0.039 NA 289.409
## CI_97.5 p
## (Intercept) 4458.849 0.474
## insurance$bmi 498.337 0.000
modelCompare(ModelC = model.c3, ModelA = model.a3)
## SSE (Compact) = 192654734369
## SSE (Augmented) = 188360830332
## Delta R-Squared = 0.03933914
## Partial Eta-Squared (PRE) = 0.02228808
## F(1,1336) = 30.45567, p = 4.098075e-08
That is, in an augmented model controlling for no other factors, BMI significantly predicted annual medical insurance costs, \(B_1 = 393.87, F(1, 1336) = 30.45, p << .001\), such that each additional unit of BMI increased medical insurance costs by ~$394; this improved model fit above a compact model predicting that \(\beta_1 = 100\), or that a one unit increase in BMI increase annual medical insurance costs by $100.
In this test I determine whether or not an individual at the BMI threshold for obesity (defined as BMI = 25) is at all likely to have annual medical insurance costs of $25000. The augmented and compact models as well as the null hypothesis for this test are \[ Compact: charges_i = 25000 + \beta_1*BMIthresh_i + \epsilon_i \\ Augmented: charges_i = \beta_0 + \beta_1*BMIthresh_i + \epsilon_i\\ Null hypothesis: \beta_0 = 25000 \] I run the test below:
BMIthresh = insurance$bmi -25
insurance$int25000 = 25000
model.c4 = lm(insurance$charges ~ 0 + BMIthresh, offset = int25000, data = insurance)
model.a4 = lm(insurance$charges ~ 1 + BMIthresh)
modelSummary(model.a4)
## lm(formula = insurance$charges ~ 1 + BMIthresh)
## Observations: 1338
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 11039.76 443.08 24.916 < 2e-16 ***
## BMIthresh 393.87 53.25 7.397 2.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 188360830331.8, Error df: 1336
## R-squared: 0.0393
modelCompare(ModelC = model.c4, ModelA = model.a4)
## SSE (Compact) = 328318938855
## SSE (Augmented) = 188360830332
## Delta R-Squared = -0.6087557
## Partial Eta-Squared (PRE) = 0.426287
## F(1,1336) = 992.6906, p = 2.146358e-163
In a model controlling for BMI at the threshold of obesity, the prediction of annual medical insurance costs at this threshold significantly improved model fit, \(B_0 = 11039, F(1,1336) = 992.69, p << .001\), above a model predicting annual medical insurance costs of $25000. That is, at BMI = 25, the accepted threshold for obesity, annual medical insurance costs are highly unlikely to be $25000; they are likely much lower at around $11000.