Lab 7 - PSYC 5741

Question 1

We run the model comparison:

apgar = read.csv("apgar.csv", header = TRUE)
apgar

##    CASE APGAR GENDER SMOKES WGTGAIN GESTAT PRENAT ANNINC
## 1    43     6      0      0      22     37      0     32
## 2    51     5      0      0      50     35      0     17
## 3    56     4      0      0      60     36      0     21
## 4    53     4      1      0      60     37      0     32
## 5    40     6      0      0      35     41      1     23
## 6    59     2      0      0      50     38      1     12
## 7    17     8      1      0      18     42      1     37
## 8    27     7      1      0      27     36      1     90
## 9    47     5      1      0      24     40      1     97
## 10    4    10      0      0      30     42      2    150
## 11   10     9      0      0      25     42      2     12
## 12   11     9      0      0      17     39      2     87
## 13   12     9      0      0      35     38      2     41
## 14   20     8      0      0      19     36      2     30
## 15   22     8      0      0      28     40      2     21
## 16   30     7      0      0      32     41      2     34
## 17   31     7      0      0      27     42      2     45
## 18   41     6      0      0      32     35      2     11
## 19    2    10      1      0      20     39      2    100
## 20    8     9      1      0      21     40      2     39
## 21   15     8      1      0      35     40      2     46
## 22   16     8      1      0      21     41      2     67
## 23   18     8      1      0      17     38      2     10
## 24   25     7      1      0      40     39      2     34
## 25   37     6      1      0      19     38      2     19
## 26   46     5      1      0      28     41      2     21
## 27   57     3      1      0      75     39      2     56
## 28    5    10      0      0      22     38      3     29
## 29   13     9      0      0      23     38      3    160
## 30   21     8      0      0      25     37      3     34
## 31   23     8      0      0      30     41      3     76
## 32   32     7      0      0      19     38      3    130
## 33   33     7      0      0      28     39      3     29
## 34   42     6      0      0      28     38      3     34
## 35   50     5      0      0      45     38      3    160
## 36    1    10      1      0      25     40      3     50
## 37    3    10      1      0      18     41      3     90
## 38    7     9      1      0      23     36      3    180
## 39    9     9      1      0      32     42      3     90
## 40   14     8      1      0      22     39      3     99
## 41   26     7      1      0      24     37      3     54
## 42   28     7      1      0      19     35      3     65
## 43   36     6      1      0      34     40      3     65
## 44   38     6      1      0      24     39      3     45
## 45   34     6      1      1      18     35      0     23
## 46   44     5      1      1      37     38      0     21
## 47   19     8      0      1      25     37      1     15
## 48   29     7      0      1      32     37      1     42
## 49   49     5      0      1      15     28      1     32
## 50   55     4      0      1      25     39      1     31
## 51   58     3      0      1      13     22      1     98
## 52    6     9      1      1      18     37      1     60
## 53   35     6      1      1      26     37      1     99
## 54   48     5      0      1      10     20      2     34
## 55   24     7      1      1      14     42      2     21
## 56   39     6      0      1      19     40      3     65
## 57   54     4      0      1      12     25      3     97
## 58   45     5      1      1      35     42      3     34
## 59   52     4      1      1       8     20      3     56
## 60   60     1      1      1      10     20      3     45

summary(apgar)

##       CASE           APGAR            GENDER        SMOKES      
##  Min.   : 1.00   Min.   : 1.000   Min.   :0.0   Min.   :0.0000  
##  1st Qu.:15.75   1st Qu.: 5.000   1st Qu.:0.0   1st Qu.:0.0000  
##  Median :30.50   Median : 7.000   Median :0.5   Median :0.0000  
##  Mean   :30.50   Mean   : 6.683   Mean   :0.5   Mean   :0.2667  
##  3rd Qu.:45.25   3rd Qu.: 8.000   3rd Qu.:1.0   3rd Qu.:1.0000  
##  Max.   :60.00   Max.   :10.000   Max.   :1.0   Max.   :1.0000  
##     WGTGAIN          GESTAT          PRENAT          ANNINC      
##  Min.   : 8.00   Min.   :20.00   Min.   :0.000   Min.   : 10.00  
##  1st Qu.:19.00   1st Qu.:37.00   1st Qu.:1.000   1st Qu.: 29.00  
##  Median :25.00   Median :38.00   Median :2.000   Median : 41.50  
##  Mean   :27.08   Mean   :37.12   Mean   :1.967   Mean   : 55.78  
##  3rd Qu.:32.00   3rd Qu.:40.00   3rd Qu.:3.000   3rd Qu.: 78.75  
##  Max.   :75.00   Max.   :42.00   Max.   :3.000   Max.   :180.00

GESTAT_c = apgar$GESTAT - mean(apgar$GESTAT)

model.c1 = lm(apgar$APGAR ~ 1)
mcSummary(model.c1)

## lm(formula = apgar$APGAR ~ 1)
## 
## Omnibus ANOVA
##                 SS df    MS EtaSq F p
## Model        0.000  0   Inf     0    
## Error      260.983 59 4.423          
## Corr Total 260.983 59 4.423          
## 
##   RMSE AdjEtaSq
##  2.103        0
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.272 24.614 2680.017 0.911  NA   6.14   7.227 0

model.a1 = lm(apgar$APGAR ~ 1 + GESTAT_c)
mcSummary(model.a1)

## lm(formula = apgar$APGAR ~ 1 + GESTAT_c)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.233 28.683 2680.017 0.934  NA  6.217   7.150 0
## GESTAT_c    0.205 0.044  4.703   72.044 0.276  NA  0.118   0.292 0

From the output of $mcSummary$ above we can define two confidence intervals for the mean apgar score, one derived from the unconditional model and the other from the model conditional on GESTAT scores: \[ Unconditional: M_{apgar} \in [6.14, 7.227]\\ Conditional: M_{apgar} \in [6.217, 7.150] \] These two confidences intervals differ because the mean-squared error in the conditional model has been reduced compared to that of the unconditional model. The formula for the confidence interval of the intercept makes this clear: \[ CI_{.95} = b_{0} \pm \sqrt{\frac{F*MSE}{n}} \] That is, when the MSE of a given model is reduced, the confidence interval shrinks accordingly.

Question 2

We run the model comparison below:

GESTAT_40 = apgar$GESTAT - 40 #center gestat scores on 40 weeks
apgar$int8 = 8 

model.c2 = lm(apgar$APGAR ~ 0 + GESTAT_40, offset = int8, data = apgar) 
model.a2 = lm(apgar$APGAR ~ 1 + GESTAT_40)

modelCompare(ModelC = model.c2, ModelA = model.a2)

## SSE (Compact) =  213.3758 
## SSE (Augmented) =  188.939 
## Delta R-Squared =  -0.6626379 
## Partial Eta-Squared (PRE) =  0.114525 
## F(1,58) = 7.501568, p = 0.008175292

From the $modelCompare$ output we observe that with an $F(1, 58) = 7.501$ and corresponding $p = .008$, we can reject the null hypothesis that $\beta_0 = 8$. That is, the best predicted apgar score for a baby experiencing 40 weeks of gestation is very unlikely to be 8.

A brief “news summary” of the results and conclusions for this test is given immediately below: I set out to determine whether or not the expected apgar score for a baby experiencing a full gestation period of 40 weeks would be best predicted as 8 points. I found that a model freely estimating the scores for such babies significantly improved model fit, suggesting that the corresponding apgar score is likely to differ from 8. That is, I can rule out the possibility that 8 points should be considered the expected apgar score for babies experiencing the full 40 weeks of gestation.

Question 3

Last week I used the insurance dataset, which includes annual medical insurance costs of $1338$ Americans alongside a number of personal characteristics such as age, sex, smoking status, and so on, to determine if one’s BMI signficantly influenced insurance costs. Unsurprisingly, the results affirmed that BMI significantly increased medical insurance costs amongst these individuals. Here I modify that analysis to determine if an additional unit of BMI plausibly increases insurance costs by $100. The compacted and augmented models as well as the null hypothesis for this comparison are \[ Compact: charges_i = \beta_0 + 100*BMI_i + \epsilon_i \\ Augmented: charges_i = \beta_0 + \beta_1*BMI_i + \epsilon_i \\ Null: \beta_1 = 100 \] Below I employ these models to test the null:

insurance = read.csv("insurance.csv", header=TRUE)
summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

model.c3 = lm(insurance$charges ~ 1 + offset(100*insurance$bmi))
model.a3 = lm(insurance$charges ~ 1 + insurance$bmi)
mcSummary(model.a3)

## lm(formula = insurance$charges ~ 1 + insurance$bmi)
## 
## Omnibus ANOVA
##                      SS   df         MS EtaSq      F p
## Model        7713391237    1 7713391237 0.039 54.709 0
## Error      188360830332 1336  140988645               
## Corr Total 196074221568 1337  146652372               
## 
##      RMSE AdjEtaSq
##  11873.86    0.039
## 
## Coefficients
##                    Est    StErr     t     SSR(3) EtaSq tol    CI_2.5
## (Intercept)   1192.937 1664.802 0.717   72392559 0.000  NA -2072.974
## insurance$bmi  393.873   53.251 7.397 7713391237 0.039  NA   289.409
##                CI_97.5     p
## (Intercept)   4458.849 0.474
## insurance$bmi  498.337 0.000

modelCompare(ModelC = model.c3, ModelA = model.a3)

## SSE (Compact) =  192654734369 
## SSE (Augmented) =  188360830332 
## Delta R-Squared =  0.03933914 
## Partial Eta-Squared (PRE) =  0.02228808 
## F(1,1336) = 30.45567, p = 4.098075e-08

That is, in an augmented model controlling for no other factors, BMI significantly predicted annual medical insurance costs, $B_1 = 393.87, F(1, 1336) = 30.45, p << .001$, such that each additional unit of BMI increased medical insurance costs by ~$394; this improved model fit above a compact model predicting that $\beta_1 = 100$, or that a one unit increase in BMI increase annual medical insurance costs by $100.

Question 4

In this test I determine whether or not an individual at the BMI threshold for obesity (defined as BMI = 25) is at all likely to have annual medical insurance costs of $25000. The augmented and compact models as well as the null hypothesis for this test are \[ Compact: charges_i = 25000 + \beta_1*BMIthresh_i + \epsilon_i \\ Augmented: charges_i = \beta_0 + \beta_1*BMIthresh_i + \epsilon_i\\ Null hypothesis: \beta_0 = 25000 \] I run the test below:

BMIthresh = insurance$bmi -25
insurance$int25000 = 25000
model.c4 = lm(insurance$charges ~ 0 + BMIthresh, offset = int25000, data = insurance)
model.a4 = lm(insurance$charges ~ 1 + BMIthresh)
modelSummary(model.a4)

## lm(formula = insurance$charges ~ 1 + BMIthresh)
## Observations: 1338
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate       SE      t Pr(>|t|)    
## (Intercept) 11039.76   443.08 24.916  < 2e-16 ***
## BMIthresh     393.87    53.25  7.397 2.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 188360830331.8, Error df: 1336
## R-squared:  0.0393

modelCompare(ModelC = model.c4, ModelA = model.a4)

## SSE (Compact) =  328318938855 
## SSE (Augmented) =  188360830332 
## Delta R-Squared =  -0.6087557 
## Partial Eta-Squared (PRE) =  0.426287 
## F(1,1336) = 992.6906, p = 2.146358e-163

In a model controlling for BMI at the threshold of obesity, the prediction of annual medical insurance costs at this threshold significantly improved model fit, $B_0 = 11039, F(1,1336) = 992.69, p << .001$, above a model predicting annual medical insurance costs of $25000. That is, at BMI = 25, the accepted threshold for obesity, annual medical insurance costs are highly unlikely to be $25000; they are likely much lower at around $11000.

Lab 7 - PSYC 5741

Spencer Moore

10/8/2020

Question 1

Question 2

Question 3

Question 4