HW07

Sarah Coffin

October 6, 2021

Question 1

In lecture we are discussing a small example relating height and weight. A much larger set of observations is available in the dataset ‘htwt’. The available variables are HEIGHT, AGE, and WEIGHT. First get the descriptive statistics (e.g., mean and standard deviation) on these variables. Then use lm (in R) to examine the following conditional model:

#Height
mean(htwt$HEIGHT)
## [1] 61.36456
sd(htwt$HEIGHT)
## [1] 3.945402
#Weight
mean(htwt$WEIGHT)
## [1] 101.308
sd(htwt$WEIGHT)
## [1] 19.4407
#Age
mean(htwt$AGE)
## [1] 16.44304
sd(htwt$AGE)
## [1] 1.842577
  • Use AGE to predict WEIGHT. Said another way, regress WEIGHT on AGE. Our question of interest is whether there is a relationship between age and weight.
  1. Write out the model A/model C comparison to test this question.

Model A: \(WEIGHT_i = \beta_0 + \beta_1AGE + \epsilon_i\) Model C: \(WEIGHT_i = \beta_0 + \beta_1AGE + \epsilon_i\) Null hypothesis: \(\beta_1 = B_1\)

#Model A
htwt.a.1 <- lm(WEIGHT~AGE, data = htwt)
mcSummary(htwt.a.1)
## lm(formula = WEIGHT ~ AGE, data = htwt)
## 
## Omnibus ANOVA
##                  SS  df        MS EtaSq       F p
## Model      35924.08   1 35924.084 0.403 158.479 0
## Error      53269.93 235   226.681                
## Corr Total 89194.01 236   377.941                
## 
##    RMSE AdjEtaSq
##  15.056      0.4
## 
## Coefficients
##                Est StErr      t    SSR(3) EtaSq tol  CI_2.5 CI_97.5     p
## (Intercept) -8.794 8.800 -0.999   226.322 0.004  NA -26.131   8.544 0.319
## AGE          6.696 0.532 12.589 35924.084 0.403  NA   5.648   7.744 0.000
# Model C
htwt.c.1 <- lm(WEIGHT ~ 1, data = htwt)
mcSummary(htwt.c.1)
## lm(formula = WEIGHT ~ 1, data = htwt)
## 
## Omnibus ANOVA
##                  SS  df      MS EtaSq F p
## Model          0.00   0     Inf     0    
## Error      89194.01 236 377.941          
## Corr Total 89194.01 236 377.941          
## 
##    RMSE AdjEtaSq
##  19.441        0
## 
## Coefficients
##                 Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 101.308 1.263 80.224 2432405 0.965  NA  98.82 103.796 0
  1. Generate a plot of the model’s predictions and provide interpretations of both the slope and the intercept.
plot(htwt$WEIGHT ~ htwt$AGE,
     xlab = "Age (years)", ylab = "Weight (lbs)",
     main = "Weight (lbs) and Age (years)")
abline(htwt.c.1, col = "magenta", lwd = 2) 
abline(htwt.a.1, col = "green", lwd = 2)

Interpretation (Slope): For every 1 year of age, we expect that weight will change by 6.70. Interpretation (Intercept): A person of zero years of age is expected to weigh -8.80 pounds.

  1. Write out the prediction equation represented by this line.

\(WEIGHT_i = 101.31 + 6.7_iAGE + \epsilon_i\)

  1. What is the sum of squared errors for model A?

The sum of squared errors for Model A is 53269.93.

  1. What is the sum of squared errors for model C?

The sum of squared errors for Model C is 89194.01.

  1. Calculate PRE and F* that compares the pair of models and write 1) a statistical conclusion and 2) a short substantive conclusion (i.e., in English).
#PRE = (SSE(C) - SSE(A))/(SSE(C))
PRE1 <- (89194.01-53269.93)/(89194.01)
PRE1
## [1] 0.4027634
#F*= (PRE)/(PA-PC)/(1-PRE)/(n-PA)
Fstar_1 <-((0.4027634)/(2-1))/((1-0.4027634)/(237-2))
Fstar_1
## [1] 158.4789

Statistical Conclusion: There is a significant association between age and weight (b=6.70), F(1,235) = 158.48, PRE = 0.40, p < .001.

Substantive Conclusion: We wanted to test whether age is associated with weight, so we regressed weight on age. We found a significant relationship between age and weight (b = 6.70), indicating that weight increases approximately 6.7 pounds for every 1 year of age. Additionally, the intercept of the model indicates that the predicted weight of a person of average age is 101.31 pounds.

Question 2

Immediately after birth, babies are assigned an APGAR score based on such factors of breathing, reflexes, skin color, etc. Scores range from 0 to 10, with 10 being a normal, healthy baby. Low scores indicate the need for immediate critical care. The data in the dataset ‘apgar’ represent hypothetical data from a study of disadvantaged mothers who were provided with various opportunities to obtain pre-natal care. Presumably, the longer the gestation period, the healthier the baby.

  1. Estimate a model in which you predict APGAR scores from gestation period (GESTAT). Interpret the resulting intercept and slope.
#Model A
apgar.a.1 <- lm(APGAR~GESTAT, data = apgar)
mcSummary(apgar.a.1)
## lm(formula = APGAR ~ GESTAT, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) -0.930 1.636 -0.569  1.054 0.006  NA -4.204   2.344 0.572
## GESTAT       0.205 0.044  4.703 72.044 0.276  NA  0.118   0.292 0.000
# Model C
apgar.c.1 <- lm(APGAR ~ 1, data = apgar)
mcSummary(apgar.c.1)
## lm(formula = APGAR ~ 1, data = apgar)
## 
## Omnibus ANOVA
##                 SS df    MS EtaSq F p
## Model        0.000  0   Inf     0    
## Error      260.983 59 4.423          
## Corr Total 260.983 59 4.423          
## 
##   RMSE AdjEtaSq
##  2.103        0
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.272 24.614 2680.017 0.911  NA   6.14   7.227 0
#Means
mean(apgar$APGAR)
## [1] 6.683333
mean(apgar$GESTAT)
## [1] 37.11667
#Standard Deviations
sd(apgar$APGAR)
## [1] 2.103199
sd(apgar$GESTAT)
## [1] 5.387027

Interpretation (Slope): For every week of gestation, we expect that the APGAR score will change by 0.21. Interpretation (Intercept): An infant at zero weeks of gestation is expected to have an APGAR score of -0.93.

  1. Now provide a test of whether longer gestaton periods are associated with higher APGAR scores. Write model A, model C, SSE(A), SSE(C), PRE, F*, and a short, substantive conclusion. Also compute the correlation between the two variables.

Model A: \(APGAR_i = \beta 0 + \beta_1GESTAT + \epsilon_i\) Model C: \(APGAR_i = \beta_0 + B_1GESTAT + \epsilon_i\) Null hypothesis: \(\beta 1 = B1\)

SSE(A) = 188.94 SSE(C) = 260.98

#PRE = (SSE(C) - SSE(A))/(SSE(C))
PRE2 <- (260.983 - 188.939)/260.983
PRE2
## [1] 0.2760486
#F*= (PRE)/(PA-PC)/(1-PRE)/(n-PA)
Fstar_2 <- ((0.2760486)/(2-1))/((1-0.2760486)/(60-2))
Fstar_2
## [1] 22.11588
cor(apgar$GESTAT, apgar$APGAR, method = c("pearson"))
## [1] 0.5254043
plot(apgar$APGAR ~ apgar$GESTAT,
     xlab = "Gestation period (weeks)", ylab = "APGAR Score",
     main = "APGAR Score and Gestation Period (Weeks) ")
abline(apgar.c.1, col = "purple", lwd = 2) 
abline(apgar.a.1, col = "orange", lwd = 2)

Statistical Conclusion: Our output seems to suggest moderate correlation between APGAR scores and gestation period (b=0.21), F(1,58) = 22.12, PRE = 0.28, p < .001.

Substantive Conclusion: We wanted to test whether gestation period (weeks) is correlated with infant APGAR scores, so we regressed APGAR scores on gestation period. We found a significant relationship between gestation period and APGAR scores (b = 0.21), indicating that APGAR scores increase approximately 0.21 pounds for every week in the gestation period. Additionally, the intercept of the model indicates that the predicted gestation period of a infant with an average APGAR score is 6.68.

  1. Now redo this estimation, this time centering (mean-deviating) GESTAT (call this new predictor GESTAT_c) before running the analysis. Write the new models A and C, then give the parameter estimates for the new model A and provide interpretations of them, specifically comparing them to the first model A that you estimated.

Model A: \(APGAR_i = \beta_0 + \beta1GESTAT_c + \epsilon_i\) Model C: \(APGAR_i = \beta_0 + B_1GESTAT_c + \epsilon_i\)

mean(apgar$GESTAT)
## [1] 37.11667
apgar$GESTAT_c <- apgar$GESTAT - mean(apgar$GESTAT)
G_C_M <- lm(APGAR ~ GESTAT_c, data = apgar)

mcSummary(G_C_M)
## lm(formula = APGAR ~ GESTAT_c, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 6.683 0.233 28.683 2680.017 0.934  NA  6.217   7.150 0
## GESTAT_c    0.205 0.044  4.703   72.044 0.276  NA  0.118   0.292 0

The parameter estimates for this model with mean centered gestation are \(\beta_0\) = 6.68 and \(\beta_1\) = 0.21.

Interpretation (Slope): For every week of gestation centered at the mean, we expect that the APGAR score will change by 0.21. Compared to the first Model A we predicted that was not centered at the mean, the slope did not change.

Interpretation (Intercept): An infant at zero weeks of gestation is expected to have an APGAR score of 6.68, when gestation period centered at the mean. Compared to the first Model A we predicted that was not centered at the mean, the intercept changed from -0.93 to 6.68.

  1. Finally, redo the analysis one last time, but this time changing GESTAT not by subtracting a constant from it (i.e., its mean) but rather changing its units into days rather than weeks. In other words, compute a new predictor variable from GESTAT that is GESTAT multiplied by 7 (the number of days in a week). Regress APGAR on this new transformed version of GESTAT and again provide interpretations of both the intercept and slope.
mean(apgar$GESTAT)
## [1] 37.11667
apgar$GESTAT_days <- apgar$GESTAT * 7
GESTAT_days <-lm(APGAR ~ GESTAT_days, data = apgar)

mcSummary(GESTAT_days)
## lm(formula = APGAR ~ GESTAT_days, data = apgar)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model       72.044  1 72.044 0.276 22.116 0
## Error      188.939 58  3.258               
## Corr Total 260.983 59  4.423               
## 
##   RMSE AdjEtaSq
##  1.805    0.264
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) -0.930 1.636 -0.569  1.054 0.006  NA -4.204   2.344 0.572
## GESTAT_days  0.029 0.006  4.703 72.044 0.276  NA  0.017   0.042 0.000

Interpretation (Slope): For every day of gestation, we expect that the APGAR score will change by 0.03.

Interpretation (Intercept): An infant at zero days of gestation is expected to have an APGAR score of -0.93.

  1. You now have three prediction functions, one with GESTAT as the predictor and the other two with different versions of GESTAT as the predictor. From each, calculate the predicted APGAR score for a baby whose gestation period is 39 weeks.

#GESTAT \(APGAR_i = -0.930 + (0.205*39) = 7.065\)

#GESTAT_c \(APGARc_i = 6.683 + (0.205 *39) = 14.678\)

#GESTAT_days \(APGAR_days = -0.930 + 0.29(273) = 78.24\)

  1. Write a brief summary paragraph about what you have learned from this exercise in transforming a predictor variable. In other words, how do these transformations affect the estimated intercept and slope, the predicted values, and the overall goodness of fit of the underlying model?

This exercise taught me that centering data on the mean changes our intercept value, but not the parameter value for our \(\beta_1\). Additionally, I learned that it’s imperative to make sure your outcome and predictor variables are formatted in the same units (i.e. days for gestation period), though that will result in a change of value for \(\beta_1 X_1\), since formatting in days rather than weeks requires us to multiply by 7. In terms of the goodness of fit for the underlying model, it seems that centering our intercept at the mean was useful in understanding how far our participant observations were from the mean gestation period, which allowed us more easily to see the relationship of the predictor on our outcome of APGAR score.

Question 3 - Project application

  1. Choose a continuous predictor variable from your dataset that you are interested in using to predict your outcome variable of interest. Write model A, model C, SSE(A), SSE(C), PRE, F*, and a short, substantive conclusion.

Model A: \(CO2HouseholdEmit_i = \beta_0 + \beta1foodwaste + \epsilon_i\) Model C: \(CO2HouseholdEmit_i = \beta_0 + B_1foodwaste + \epsilon_i\)

#Model A
co2.a.1 <- lm(EmissionsTons~WastePounds, data = co2emissions)
mcSummary(co2.a.1)
## lm(formula = EmissionsTons ~ WastePounds, data = co2emissions)
## 
## Omnibus ANOVA
##                 SS  df    MS EtaSq     F     p
## Model        0.017   1 0.017     0 0.039 0.844
## Error      106.803 248 0.431                  
## Corr Total 106.820 249 0.429                  
## 
##   RMSE AdjEtaSq
##  0.656   -0.004
## 
## Coefficients
##               Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 7.477 0.187 40.074 691.596 0.866  NA  7.109   7.844 0.000
## WastePounds 0.000 0.000  0.196   0.017 0.000  NA  0.000   0.000 0.844
# Model C
co2.c.1 <- lm(EmissionsTons ~ 1, data = co2emissions)
mcSummary(co2.c.1)
## lm(formula = EmissionsTons ~ 1, data = co2emissions)
## 
## Omnibus ANOVA
##                SS  df    MS EtaSq F p
## Model        0.00   0   Inf     0    
## Error      106.82 249 0.429          
## Corr Total 106.82 249 0.429          
## 
##   RMSE AdjEtaSq
##  0.655        0
## 
## Coefficients
##               Est StErr       t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 7.513 0.041 181.355 14109.49 0.992  NA  7.431   7.594 0

SSE(A) = 106.80 SSE(C) = 106.82

#PRE = (SSE(C) - SSE(A))/(SSE(C))
PRE3 <- (106.820-106.8033)/(106.820)
PRE3
## [1] 0.0001563378
#F*= (PRE)/(PA-PC)/(1-PRE)/(n-PA)
Fstar_3 <-((0.0001563378)/(2-1))/((1-0.0001563378)/(250-2))
Fstar_3
## [1] 0.03877784

Substantive Conclusion: In this study, I wanted to test whether food waste (pounds) is associated with household carbon dioxide (CO2) emissions (tons), so I regressed household CO2 emissions on food waste. I found that there was no significant relationship between household food waste and CO2 emissions (b < .001), indicating that CO2 emissions increase less than 0.001 tons for every 1 pound of food waste. Additionally, the intercept of the model indicates that the predicted CO2 emissions of a household with average food waste is 7.51 tons per year.

  1. Next, take that same analysis but test the slope against an a priori value other than 0. Try to come up with a somewhat meaningful test, but your data might not lend itself well to that. If so, just choose a number. As we discussed in class, just looking at the confidence interval of the slope in your initial model will tell you whether or not your a priori slope value is significantly different from your estimated slope. However, the CI doesn’t give us an exact PRE. So, write out Model A and Model C for this analysis (hint: model A will be the same as it was last time; Model C will be different), conduct the relevant model comparison and report the results in a short (2-3 sentences) writeup that would be suitable for a journal article. Don’t forget an interpretation sentence “written in English” at the end.

Model A: \(CO2HouseholdEmit_i = \beta_0 + \beta1foodwaste + \epsilon_i\) Model C: \(CO2HouseholdEmit_i = \beta_0 + B_1foodwaste + \epsilon_i, (b_0 = 7)\)

#Model A
co2.a.1 <- lm(EmissionsTons~WastePounds, data = co2emissions)
mcSummary(co2.a.1)
## lm(formula = EmissionsTons ~ WastePounds, data = co2emissions)
## 
## Omnibus ANOVA
##                 SS  df    MS EtaSq     F     p
## Model        0.017   1 0.017     0 0.039 0.844
## Error      106.803 248 0.431                  
## Corr Total 106.820 249 0.429                  
## 
##   RMSE AdjEtaSq
##  0.656   -0.004
## 
## Coefficients
##               Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 7.477 0.187 40.074 691.596 0.866  NA  7.109   7.844 0.000
## WastePounds 0.000 0.000  0.196   0.017 0.000  NA  0.000   0.000 0.844
int7 <- 7
co2.c.7<- lm(EmissionsTons~ 0 + WastePounds, data = co2emissions)
mcSummary(co2.c.7)
## lm(formula = EmissionsTons ~ 0 + WastePounds, data = co2emissions)
## 
## Omnibus ANOVA
##                   SS  df        MS EtaSq F p
## Model      13417.907   1 13417.907          
## Error        798.399 249     3.206          
## Corr Total 14216.306 250    56.865          
## 
##   RMSE AdjEtaSq
##  1.791       NA
## 
## Coefficients
##               Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## WastePounds 0.004     0 64.689 13417.91 0.944  NA  0.004   0.004 0
modelCompare(co2.c.7, co2.a.1)
## SSE (Compact) =  798.3993 
## SSE (Augmented) =  106.803 
## Delta R-Squared =  -0.9436836 
## Partial Eta-Squared (PRE) =  0.8662286 
## F(1,248) = 1605.909, p = 2.536442e-110

#OFFICE HOURS: Is this output telling us which is the better model of the two (mean vs. a priori estimate), or is it telling us whether or not household food waste (pounds) is a statistically significant predictor of household CO2 emissions? Or both?

Journal Article Interpretation: The current study examined the relationship between food waste (pounds) and household carbon dioxide (CO2) emissions (tons). A simple linear model was built to examine this relationship, with household CO2 emissions regressed on food waste. To determine the model that best fit the sample data, linear models fit to the mean and also an a priori value of 7 tons of CO2 emissions were compared statistically. Results of the model comparison suggested that there was no significant relationship between household food waste and CO2 emissions when the model was centered at the mean of the sample at 7.47. However, when the model was centered at an a priori estimate value of 7, food waste was a significant predictor for household CO2 emissions, F(1,248) = 6.53, PRE = 0.026, p = 0.011. Results of the study conclude that food waste (pounds) centered at the mean is not a statistically significant predictor of annual household CO2 emissions (tons) alone; other factors that contribute to household CO2 emissions should be considered in understanding contributors to annual CO2 emissions per household in the US.