Graduate Statistics Homework 9

setwd("~/Desktop/Grad Stats/Week 9")
getwd()

## [1] "/Users/sarahcoffin/Desktop/Grad Stats/Week 9"

library(prettydoc)
library(car)

## Loading required package: carData

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

library(lmSupport)
library(pwr)
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')


heart <-read.csv("heart.csv", na.strings=".")
usnews <-read.csv("usnews.csv", na.strings=".")
co2emissions <-read.csv("co2emissions.csv", na.strings=".")

Homework 09

We’re reverting to the format of older assignments for a quick tutorial. I want to cover a few things that are helpful for multiple regression. This tutorial will use the example dataset, heart.csv.

As before, open R, read in the dataset. Open the packages psych, car, and lmSupport and activate mcSummary. Then we’ll load in the data and take a look at rows 71-80.

library(psych)
library(car)
library(lmSupport)
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')

heart <-read.csv("heart.csv", header=T, na.strings=".")
describe(heart)

##          vars   n   mean    sd median trimmed   mad min max range  skew
## Height      1 110 171.58 16.08  172.5  173.03 11.12  68 195   127 -3.41
## Weight      2 110  66.33 15.16   63.0   65.19 11.86  27 110    83  0.69
## Age         3 110  20.56  3.92   20.0   19.78  1.48  18  45    27  3.98
## Gender      4 110   1.46  0.50    1.0    1.45  0.00   1   2     1  0.14
## Smokes      5 110   1.90  0.30    2.0    2.00  0.00   1   2     1 -2.63
## Alcohol     6 110   1.38  0.49    1.0    1.35  0.00   1   2     1  0.48
## Exercise    7 110   2.21  0.65    2.0    2.26  0.00   1   3     2 -0.23
## Ran         8 110   1.58  0.50    2.0    1.60  0.00   1   2     1 -0.33
## Pulse1      9 109  75.69 13.30   76.0   75.02 10.38  47 145    98  1.47
## Pulse2     10 109  96.80 31.57   84.0   93.83 23.72  56 176   120  0.77
## Year       11 110  95.63  1.75   96.0   95.66  1.48  93  98     5 -0.30
##          kurtosis   se
## Height      18.34 1.53
## Weight       0.49 1.45
## Age         19.02 0.37
## Gender      -2.00 0.05
## Smokes       4.96 0.03
## Alcohol     -1.79 0.05
## Exercise    -0.75 0.06
## Ran         -1.91 0.05
## Pulse1       6.18 1.27
## Pulse2      -0.67 3.02
## Year        -1.18 0.17

heart[71:80,]

##    Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 Year
## 71    186   87.0  23      1      2       1        1   1     49     83   97
## 72    190   82.0  19      1      2       2        2   2     76     73   97
## 73    179   80.0  20      1      2       1        2   1    145    155   97
## 74    165   48.0  19      2      2       2        3   2     83     84   97
## 75    172   53.0  20      1      2       2        3   1     72    136   97
## 76    173   64.0  20      2      2       1        2   2     NA     NA   97
## 77    170   53.5  20      2      2       2        3   2     60     62   97
## 78    170   58.5  20      1      2       2        3   2     80     82   97
## 79    163   51.0  18      2      2       1        3   1     70    120   97
## 80    191   78.0  19      1      2       1        1   1     68    136   97

Many datasets have missing cells. Look at the heart data (all 110 cases). If you check, you’ll see that observation 76 is missing data for the variables Pulse1 and Pulse2. Often this is no big deal. If we were to run a simple regression predicting Height as a function of Pulse1, obs 76 would naturally be left out. In our consideration of multiple regression, however, it will be nice to have a set of data with no missing cases. The reason is that we will look at models with different but partially overlapping sets of predictors. For example:

lm(Y ~ A) lm(Y ~ B) lm(Y ~ A + B)

We want to focus our attention on the differences between those models. But if case 23 is missing data on predictor A, and case 37 is missing data on predictor B, our results will change across the three models in part because the analyses use different subsets of data. That will distract us. By excluding any case with missing data (called “listwise deletion”), we can standardize the dataset for all models. We can do this with the following code.

heart <- na.omit(heart)

You wouldn’t generally use listwise deletion on your own data, but we’re doing it here for didactic purposes. It is, however, very important to know how to handle missing values in your own datasets. It’s a common issue.

Here we’ll use na.omit, then look at the dataset again. Notice that obs 76 is gone.

heart <- na.omit(heart)
heart[71:80,]

##    Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 Year
## 71    186   87.0  23      1      2       1        1   1     49     83   97
## 72    190   82.0  19      1      2       2        2   2     76     73   97
## 73    179   80.0  20      1      2       1        2   1    145    155   97
## 74    165   48.0  19      2      2       2        3   2     83     84   97
## 75    172   53.0  20      1      2       2        3   1     72    136   97
## 77    170   53.5  20      2      2       2        3   2     60     62   97
## 78    170   58.5  20      1      2       2        3   2     80     82   97
## 79    163   51.0  18      2      2       1        3   1     70    120   97
## 80    191   78.0  19      1      2       1        1   1     68    136   97
## 81    172   59.0  18      2      2       1        2   1     78    129   97

Now, we’re going to begin running some multiple regressions. We’ll start by predicting Pulse2 from two simultaneous predictors, Age and Pulse1. The structure is a simple generalization of what you did with simple regression in Chapter 5.

model.1 <- lm(heart$Pulse2 ~ Age + Pulse1, data=heart)

So, we’ll run that model and then we’ll look at the results with mcSummary. That function was designed to give you a lot of the information you will need for multiple regression and ANOVA, and to put it in a succinct-but-ugly form. Below, you can see an overall, multi-df or “omnibus” test of the set of predictors (the ANOVA section). We also see estimates of each of three parameters in the model (intercept, age, and pulse1), and each test is associated with a single df.

Can you specify the 4 model comparisons we’re looking at, here? (Please test yourself. You will need to be able to.)

model.1 <- lm(heart$Pulse2 ~ Age + Pulse1, data=heart)
mcSummary(model.1)

## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
## 
## Omnibus ANOVA
##                   SS  df       MS EtaSq     F p
## Model       14991.09   2 7495.545 0.139 8.577 0
## Error       92634.47 106  873.910              
## Corr Total 107625.56 108  996.533              
## 
##    RMSE AdjEtaSq
##  29.562    0.123
## 
## Coefficients
##                Est  StErr      t    SSR(3) EtaSq   tol CI_2.5 CI_97.5     p
## (Intercept) 49.775 23.934  2.080  3779.511 0.039    NA  2.322  97.227 0.040
## Age         -0.752  0.730 -1.030   927.860 0.010 0.979 -2.199   0.695 0.305
## Pulse1       0.826  0.216  3.818 12740.732 0.121 0.979  0.397   1.254 0.000

Four Model Comparisons: 1) $Pulse_2i = 49.76 + -0.75X_1age + + 0.86X_2pulse1 + Error $ 2) $Pulse_2i = 49.76 + -0.75X_1age + Error $ 3) $Pulse_2i = 49.76 + -0.86X_1pulse1 + Error $ 4) $Pulse_2i = 49.76 + Error $ 5) $Pulse_2i = B_0 + Error $

Existing R packages (even lmSupport) do provide this information in a more piecemeal fashion. Compare the output of the real packages you can download from cran. You need to be able to find useful information using this kind of “official” output. Make sure you can interpret this stuff.

modelSummary(model.1) # slopes

## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
## Observations: 109
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate      SE      t Pr(>|t|)    
## (Intercept)  49.7746 23.9345  2.080 0.039973 *  
## Age          -0.7519  0.7298 -1.030 0.305165    
## Pulse1        0.8256  0.2162  3.818 0.000227 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 92634.5, Error df: 106
## R-squared:  0.1393

modelEffectSizes(model.1) # PRE & SSR

## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
## 
## Coefficients
##                    SSR df pEta-sqr dR-sqr
## (Intercept)  3779.5113  1   0.0392     NA
## Age           927.8601  1   0.0099 0.0086
## Pulse1      12740.7316  1   0.1209 0.1184
## 
## Sum of squared errors (SSE): 92634.5
## Sum of squared total  (SST): 107625.6

confint(model.1, level=0.95) # confidence intervals

##                  2.5 %     97.5 %
## (Intercept)  2.3222401 97.2270500
## Age         -2.1987386  0.6948625
## Pulse1       0.3969251  1.2543255

Important but esoteric note: mcSummary and modelEffectSizes provide what is called “Type III sums of squares,” which in these models are equivalent to Type II SS. If you use the simple anova() function, it provides Type I SS. These different kinds of SS ask different questions. The important point for us is that Type III SS for a given predictor partials out any variability that can be accounted for by any other predictor. That is often what we want. We will discuss this further in class. For now, to preserve your sanity, use mcSummary or modelEffectSizes to get the sum of squares for the homeworks.

Questions

From Glass & Hopkins we have the following variables. READ = reading score at end of first-grade RRDEV = reading readiness test from previous/kindergarten year [entered in mean deviation form, mean = 49.0] IQDEV = IQ score on standardized test from previous year [entered in mean deviation form, mean = 102.8]

We estimate a multiple regression model and obtain the following estimates.

READ^ = 26.0 + .362RRDEV + .180IQDEV

Interpret all 3 coefficients carefully in the style we used for the height, age, and weight example discussed in class (see also pp. 110-111 of the textbook).

Intercept: When RRDEV and IQDEV are average, reading score at the end of first grade (READ) is 26. $\beta_1$: Controlling for RRDEV, a one unit increase in IQ corresponds with a .180 increase in READ. $\beta_1$Controlling for IQDEV, a one unit increase in RRDEV corresponds with a .362 increase in READ.

Find the predicted reading score for a pupil with a RR score (not in mean deviation form) of 55 and an IQ of 120.

READ^ = 26.0 + ((.36255) - 49) + .180120.

The predicted reading score for a pupil with a RR score (not in mean deviation form) of 55 and an IQ of 120 is 18.51.

The dataset usnews.csv contains information on most of the colleges and universities in the United States. Parents, students, alumni, and state legislators are increasingly interested in a school’s graduation rate, the percentage of entering students who graduate from that school within 6 years. In this homework assignment you will explore some of the variables that might be related to a school’s graduation rate. The following variables (among many others) are available:

Variable Type Label

accptRate num Acceptance Rate: proportion 0-1
satTotal int Combined-SAT (Sum)
gradRate int Graduation Rate: percent 0-100 inTuition int In-state Tuition
numFulltime int Number of Fulltime Students sfRatio num Student/Faculty Ratio
top10HS int Percent in Top 10% of HS Class

Filter the data using na.omit. Then estimate the following regression models predicting gradRate. Make sure you obtain Type II/III sums of squares (mcSummary ensures Type III SS). In each of the 3 regression models, provide a detailed interpretation for the accptRate slope (e.g., for each additional inch of height, we predict a 3.9 pound increase in weight).

usnews <- na.omit(usnews)

Using accptRate as the only predictor.

model.2 <- lm(usnews$gradRate ~ accptRate, data=usnews)
mcSummary(model.2)

## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
## 
## Omnibus ANOVA
##                   SS  df       MS EtaSq      F p
## Model       4252.344   1 4252.344 0.088 14.582 0
## Error      44326.649 152  291.623               
## Corr Total 48578.994 153  317.510               
## 
##    RMSE AdjEtaSq
##  17.077    0.082
## 
## Coefficients
##                 Est StErr      t    SSR(3) EtaSq tol  CI_2.5 CI_97.5 p
## (Intercept)  91.175 6.550 13.919 56502.330 0.560  NA  78.234 104.116 0
## accptRate   -33.360 8.736 -3.819  4252.344 0.088  NA -50.621 -16.100 0

modelSummary(model.2) # slopes

## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
## Observations: 154
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate      SE      t Pr(>|t|)    
## (Intercept)   91.175   6.550 13.919  < 2e-16 ***
## accptRate    -33.360   8.736 -3.819 0.000195 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 44326.6, Error df: 152
## R-squared:  0.0875

modelEffectSizes(model.2) # PRE & SSR

## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept) 56502.330  1   0.5604     NA
## accptRate    4252.344  1   0.0875 0.0875
## 
## Sum of squared errors (SSE): 44326.6
## Sum of squared total  (SST): 48579.0

confint(model.2, level=0.95) # confidence intervals

##                 2.5 %    97.5 %
## (Intercept)  78.23392 104.11624
## accptRate   -50.62075 -16.10017

Intercept: When accptRate is zero, gradRate is 91.18. $\beta_1$: A one unit increase in accptRate corresponds with a -33.36 decrease in gradRate.

Using accptRate, satTotal, and top10HS as predictors.

model.3 <- lm(usnews$gradRate ~ accptRate + satTotal + top10HS, data=usnews)
mcSummary(model.3)

## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS, 
##     data = usnews)
## 
## Omnibus ANOVA
##                  SS  df       MS EtaSq      F p
## Model      12897.23   3 4299.077 0.265 18.073 0
## Error      35681.76 150  237.878               
## Corr Total 48578.99 153  317.510               
## 
##    RMSE AdjEtaSq
##  15.423    0.251
## 
## Coefficients
##                Est  StErr      t   SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept)  0.810 21.591  0.038    0.335 0.000    NA -41.852  43.472 0.970
## accptRate   -4.530  9.276 -0.488   56.744 0.002 0.724 -22.859  13.798 0.626
## satTotal     0.064  0.021  3.002 2143.535 0.057 0.299   0.022   0.105 0.003
## top10HS      0.119  0.122  0.973  225.187 0.006 0.297  -0.123   0.361 0.332

modelSummary(model.3) # slopes

## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS, 
##     data = usnews)
## Observations: 154
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate       SE      t Pr(>|t|)   
## (Intercept)  0.81009 21.59125  0.038  0.97012   
## accptRate   -4.53040  9.27583 -0.488  0.62597   
## satTotal     0.06358  0.02118  3.002  0.00314 **
## top10HS      0.11915  0.12246  0.973  0.33214   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 35681.8, Error df: 150
## R-squared:  0.2655

modelEffectSizes(model.3) # PRE & SSR

## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS, 
##     data = usnews)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept)    0.3349  1   0.0000     NA
## accptRate     56.7444  1   0.0016 0.0012
## satTotal    2143.5352  1   0.0567 0.0441
## top10HS      225.1870  1   0.0063 0.0046
## 
## Sum of squared errors (SSE): 35681.8
## Sum of squared total  (SST): 48579.0

confint(model.3, level=0.95) # confidence intervals

##                    2.5 %     97.5 %
## (Intercept) -41.85217538 43.4723494
## accptRate   -22.85856338 13.7977606
## satTotal      0.02173026  0.1054332
## top10HS      -0.12282002  0.3611152

Intercept: When accptRate, satTotal, and top10HS are zero, gradRate is 0.810. $\beta_1$: When satTotal and top10HS are held constant, a one unit increase in accptRate corresponds with a -4.53 decrease in gradRate. $\beta_2$: When accptRate and top10HS are held constant, a one unit increase in satTotal corresponds with a 0.06 increase in gradRate. $\beta_3$: When accptRate and satTotal are held constant, a one unit increase in top10HS corresponds with a 0.12 increase in gradRate.

Using accptRate, numFulltime, and sfRatio as predictors.

model.4 <- lm(usnews$gradRate ~ accptRate + numFulltime + sfRatio, data=usnews)
mcSummary(model.4)

## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio, 
##     data = usnews)
## 
## Omnibus ANOVA
##                  SS  df       MS EtaSq      F p
## Model       9869.86   3 3289.953 0.203 12.749 0
## Error      38709.13 150  258.061               
## Corr Total 48578.99 153  317.510               
## 
##    RMSE AdjEtaSq
##  16.064    0.187
## 
## Coefficients
##                 Est StErr      t    SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept) 112.003 7.611 14.717 55890.821 0.591    NA  96.965 127.041 0.000
## accptRate   -31.981 8.363 -3.824  3773.952 0.089 0.966 -48.506 -15.457 0.000
## numFulltime   0.000 0.000 -0.528    71.972 0.002 0.865  -0.001   0.000 0.598
## sfRatio      -1.533 0.367 -4.178  4505.398 0.104 0.881  -2.257  -0.808 0.000

modelSummary(model.4) # slopes

## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio, 
##     data = usnews)
## Observations: 154
## 
## Linear model fit by least squares
## 
## Coefficients:
##               Estimate         SE      t Pr(>|t|)    
## (Intercept) 112.003109   7.610636 14.717  < 2e-16 ***
## accptRate   -31.981274   8.362934 -3.824 0.000192 ***
## numFulltime  -0.000131   0.000248 -0.528 0.598208    
## sfRatio      -1.532601   0.366795 -4.178 4.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 38709.1, Error df: 150
## R-squared:  0.2032

modelEffectSizes(model.4) # PRE & SSR

## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio, 
##     data = usnews)
## 
## Coefficients
##                    SSR df pEta-sqr dR-sqr
## (Intercept) 55890.8211  1   0.5908     NA
## accptRate    3773.9524  1   0.0888 0.0777
## numFulltime    71.9715  1   0.0019 0.0015
## sfRatio      4505.3984  1   0.1043 0.0927
## 
## Sum of squared errors (SSE): 38709.1
## Sum of squared total  (SST): 48579.0

confint(model.4, level=0.95) # confidence intervals

##                     2.5 %        97.5 %
## (Intercept)  9.696521e+01 127.041004210
## accptRate   -4.850564e+01 -15.456909179
## numFulltime -6.211088e-04   0.000359121
## sfRatio     -2.257354e+00  -0.807848755

Intercept: When accptRate, numFulltime, and sfRatio are zero, gradRate is 112. $\beta_1$: When numFulltime and sfRatio are held constant, a one unit increase in accptRate corresponds with a -31.98 decrease in gradRate. $\beta_2$: When accptRate and sfRatio are held constant, a one unit increase in numFulltime corresponds with a >-0.001 decrease in gradRate. $\beta_3$: When accptRate and numFulltime are held constant, a one unit increase in sfRatio corresponds with a -1.53 decrease in gradRate.

Based on the three different interpretations you have given for the three slopes for accptRate:

Summarize what you think the role of that variable is in affecting gradRate overall.

Based on the interpretations above for accptRate, acceptRate corresponds with a decrease in gradRate. Though the decrease in gradRate corresponding with accptRate is varied for each of the models, all models broadly indicate that when acceptance rates for universities go up, graduation rates go down.

Provide a complete statistical test of the effect of accptRate controlling for numFulltime and sfRatio. Specify Model A, Model C, n-PA, PA-PC, and the null hypothesis. Provide PRE, F, and p. Provide a statistical conclusion and a news summary.

Model A:$ gradRate_i = _0 + _1acceptRate + _2numFulltime + _3sfRatio$ Model C:$ gradRate_i = _0 + _2numFulltime + _3sfRatio$

n-PA: 154 - 4 = 150 PA-PC: 4-3 = 1 $H_0:\beta_1 = 0$

model.5 <- lm(gradRate ~ accptRate + numFulltime + sfRatio, data=usnews)
mcSummary(model.5)

## lm(formula = gradRate ~ accptRate + numFulltime + sfRatio, data = usnews)
## 
## Omnibus ANOVA
##                  SS  df       MS EtaSq      F p
## Model       9869.86   3 3289.953 0.203 12.749 0
## Error      38709.13 150  258.061               
## Corr Total 48578.99 153  317.510               
## 
##    RMSE AdjEtaSq
##  16.064    0.187
## 
## Coefficients
##                 Est StErr      t    SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept) 112.003 7.611 14.717 55890.821 0.591    NA  96.965 127.041 0.000
## accptRate   -31.981 8.363 -3.824  3773.952 0.089 0.966 -48.506 -15.457 0.000
## numFulltime   0.000 0.000 -0.528    71.972 0.002 0.865  -0.001   0.000 0.598
## sfRatio      -1.533 0.367 -4.178  4505.398 0.104 0.881  -2.257  -0.808 0.000

We wanted to test the effect of university acceptance rate (accptRate) on graduation rate (gradRate) in a sample of public and private universities across the US. We found that when we accounted for the student to faculty ratio (sfRatio) and percent students in the top 10% of their HS class (top10HS), acceptance rate was had a significant effect on university graduation rates; t(150) = -3.82 , PRE = 0.089, p < 0.001; reject the null hypothesis.

Based on question #2, one might surmise that, as a predictor, accptRate is redundant with the other predictor variables. Estimate the following models and discuss the extent to which accptRate is redundant with the other predictors in the above models 2.b and 2.c. You can examine this question by looking at the PRE (or eta-squared) that you get when you use the two other variables (i.e., satTotal & top10HS as a set; and numFulltime & sfRatio as a set) to make predictions for accptRate (in MODEL A) instead of simply predicting the mean (MODEL C).

Predict accptRate with satTotal and top10HS.

model.6 <- lm(accptRate ~ satTotal + top10HS, data=usnews)
mcSummary(model.6)

## lm(formula = accptRate ~ satTotal + top10HS, data = usnews)
## 
## Omnibus ANOVA
##               SS  df    MS EtaSq      F p
## Model      1.056   2 0.528 0.276 28.843 0
## Error      2.765 151 0.018               
## Corr Total 3.821 153 0.025               
## 
##   RMSE AdjEtaSq
##  0.135    0.267
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq   tol CI_2.5 CI_97.5     p
## (Intercept)  1.204 0.162  7.427  1.010 0.268    NA  0.884   1.524 0.000
## satTotal     0.000 0.000 -2.087  0.080 0.028 0.307 -0.001   0.000 0.039
## top10HS     -0.002 0.001 -2.311  0.098 0.034 0.307 -0.005   0.000 0.022

Predict accptRate with numFulltime and sfRatio.

model.7 <- lm(accptRate ~ numFulltime + sfRatio, data=usnews)
mcSummary(model.7)

## lm(formula = accptRate ~ numFulltime + sfRatio, data = usnews)
## 
## Omnibus ANOVA
##               SS  df    MS EtaSq     F     p
## Model      0.131   2 0.066 0.034 2.682 0.072
## Error      3.690 151 0.024                  
## Corr Total 3.821 153 0.025                  
## 
##   RMSE AdjEtaSq
##  0.156    0.022
## 
## Coefficients
##               Est StErr      t SSR(3) EtaSq   tol CI_2.5 CI_97.5     p
## (Intercept) 0.686 0.049 14.073  4.840 0.567    NA  0.589   0.782 0.000
## numFulltime 0.000 0.000 -2.205  0.119 0.031 0.892  0.000   0.000 0.029
## sfRatio     0.005 0.004  1.392  0.047 0.013 0.892 -0.002   0.012 0.166

Compare these results (PRE as defined above) to the tolerances for accptRate in models 2b and 2c. Can you explain what a tolerance is? (We will discuss in class.)

In model 2a, satTotal accounted for 2.8% of variance, while top10HS accounted for 3.4%, for a total of 6.2% variance accounted for when both variables are included in the model. In model 2b, numFulltime accounted for 3.1% variance, while sfRatio accounted for 1.3% variance, for a total of 4.4% variance accounted for when both variables are included in the model. Tolerance, or the variation not explained by inclusion of our predictor variables (1-PRE), is 0.97 for satTotal, 0.97 for top10HS, 0.969 for numFulltime, and 0.99 for sfRatio. Given that we want our predictors to explain as much variance as possible in our model, model 2a with satTotal and top10HS accounts for more variance, thus displaying a higher proportional reduction in error and lower tolerance than model 2b.

Using your own dataset, think about a 1-df multiple-regression type question in which you examine the effect of one focal predictor while controlling for at least one other variable.

CLEARLY specify the question in language that makes sense to a smart person who may not know your particular field (like your TAs).

Greenhouse gas emissions have a number of contributors, one being individual household CO2 emissions per year. The emission of greenhouse gasses contribute to today’s environmental crisis by exacerbating climate change, which has been found to have a number of negative consequences such as depletion of the Earth’s ozone layer, extreme weather, disruptions in food and water supplies, rising sea levels, and ecological instability, to name a few. When it comes to annual household carbon dioxide emissions in the US, a number of known contributors have been identified, such as how much food is wasted in the household, as well as overall generated waste that often ends up in landfills. For this dataset I wanted to know, does the amount of waste a household generates per year in pounds (WastePounds) have an effect on annual household CO2 emissions in tons (EmissionsTons), controlling for food waste per year in pounds (FoodPounds)?

Specify Models A & C and the H0.

Model A: $EmissionsTons_i = \beta_0 + \beta_1FoodPounds + \beta_2WastePounds + \epsilon_i$ Model C: $EmissionsTons_i = \beta_0 + \beta_1FoodPounds+ \epsilon_i$ Null hypothesis: $\beta_2 = 0$

Estimate Model A (provide the code and output)

model.8 <- lm(EmissionsTons ~ FoodPounds + WastePounds, data=co2emissions)
mcSummary(model.8)

## lm(formula = EmissionsTons ~ FoodPounds + WastePounds, data = co2emissions)
## 
## Omnibus ANOVA
##                 SS  df    MS EtaSq     F     p
## Model        0.253   2 0.126 0.002 0.293 0.746
## Error      106.567 247 0.431                  
## Corr Total 106.820 249 0.429                  
## 
##   RMSE AdjEtaSq
##  0.657   -0.006
## 
## Coefficients
##               Est StErr     t SSR(3) EtaSq   tol CI_2.5 CI_97.5     p
## (Intercept) 5.511 2.663 2.069  1.848 0.017    NA  0.266  10.756 0.040
## FoodPounds  0.001 0.001 0.740  0.236 0.002 0.999 -0.002   0.004 0.460
## WastePounds 0.000 0.000 0.169  0.012 0.000 0.999  0.000   0.000 0.866

Carefully interpret the meaning of the coefficient in which you are interested (don’t worry about significance, just explain what the slope means.)

$\beta_2$: When food in pounds (FoodPounds) is held constant, a one unit increase in household waste in pounds (WastePounds) corresponds with a 0.001 increase in household CO2 emissions in tons.

Graduate Statistics Homework 9

Sarah Coffin

October 26, 2021

Variable Type Label