setwd("~/Desktop/Grad Stats/Week 9")
getwd()## [1] "/Users/sarahcoffin/Desktop/Grad Stats/Week 9"
library(prettydoc)
library(car)## Loading required package: carData
library(psych)##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
library(lmSupport)
library(pwr)
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')
heart <-read.csv("heart.csv", na.strings=".")
usnews <-read.csv("usnews.csv", na.strings=".")
co2emissions <-read.csv("co2emissions.csv", na.strings=".")Homework 09
We’re reverting to the format of older assignments for a quick tutorial. I want to cover a few things that are helpful for multiple regression. This tutorial will use the example dataset, heart.csv.
As before, open R, read in the dataset. Open the packages psych, car, and lmSupport and activate mcSummary. Then we’ll load in the data and take a look at rows 71-80.
library(psych)
library(car)
library(lmSupport)
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')
heart <-read.csv("heart.csv", header=T, na.strings=".")
describe(heart)## vars n mean sd median trimmed mad min max range skew
## Height 1 110 171.58 16.08 172.5 173.03 11.12 68 195 127 -3.41
## Weight 2 110 66.33 15.16 63.0 65.19 11.86 27 110 83 0.69
## Age 3 110 20.56 3.92 20.0 19.78 1.48 18 45 27 3.98
## Gender 4 110 1.46 0.50 1.0 1.45 0.00 1 2 1 0.14
## Smokes 5 110 1.90 0.30 2.0 2.00 0.00 1 2 1 -2.63
## Alcohol 6 110 1.38 0.49 1.0 1.35 0.00 1 2 1 0.48
## Exercise 7 110 2.21 0.65 2.0 2.26 0.00 1 3 2 -0.23
## Ran 8 110 1.58 0.50 2.0 1.60 0.00 1 2 1 -0.33
## Pulse1 9 109 75.69 13.30 76.0 75.02 10.38 47 145 98 1.47
## Pulse2 10 109 96.80 31.57 84.0 93.83 23.72 56 176 120 0.77
## Year 11 110 95.63 1.75 96.0 95.66 1.48 93 98 5 -0.30
## kurtosis se
## Height 18.34 1.53
## Weight 0.49 1.45
## Age 19.02 0.37
## Gender -2.00 0.05
## Smokes 4.96 0.03
## Alcohol -1.79 0.05
## Exercise -0.75 0.06
## Ran -1.91 0.05
## Pulse1 6.18 1.27
## Pulse2 -0.67 3.02
## Year -1.18 0.17
heart[71:80,]## Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 Year
## 71 186 87.0 23 1 2 1 1 1 49 83 97
## 72 190 82.0 19 1 2 2 2 2 76 73 97
## 73 179 80.0 20 1 2 1 2 1 145 155 97
## 74 165 48.0 19 2 2 2 3 2 83 84 97
## 75 172 53.0 20 1 2 2 3 1 72 136 97
## 76 173 64.0 20 2 2 1 2 2 NA NA 97
## 77 170 53.5 20 2 2 2 3 2 60 62 97
## 78 170 58.5 20 1 2 2 3 2 80 82 97
## 79 163 51.0 18 2 2 1 3 1 70 120 97
## 80 191 78.0 19 1 2 1 1 1 68 136 97
Many datasets have missing cells. Look at the heart data (all 110 cases). If you check, you’ll see that observation 76 is missing data for the variables Pulse1 and Pulse2. Often this is no big deal. If we were to run a simple regression predicting Height as a function of Pulse1, obs 76 would naturally be left out. In our consideration of multiple regression, however, it will be nice to have a set of data with no missing cases. The reason is that we will look at models with different but partially overlapping sets of predictors. For example:
lm(Y ~ A) lm(Y ~ B) lm(Y ~ A + B)
We want to focus our attention on the differences between those models. But if case 23 is missing data on predictor A, and case 37 is missing data on predictor B, our results will change across the three models in part because the analyses use different subsets of data. That will distract us. By excluding any case with missing data (called “listwise deletion”), we can standardize the dataset for all models. We can do this with the following code.
heart <- na.omit(heart)You wouldn’t generally use listwise deletion on your own data, but we’re doing it here for didactic purposes. It is, however, very important to know how to handle missing values in your own datasets. It’s a common issue.
Here we’ll use na.omit, then look at the dataset again. Notice that obs 76 is gone.
heart <- na.omit(heart)
heart[71:80,]## Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 Year
## 71 186 87.0 23 1 2 1 1 1 49 83 97
## 72 190 82.0 19 1 2 2 2 2 76 73 97
## 73 179 80.0 20 1 2 1 2 1 145 155 97
## 74 165 48.0 19 2 2 2 3 2 83 84 97
## 75 172 53.0 20 1 2 2 3 1 72 136 97
## 77 170 53.5 20 2 2 2 3 2 60 62 97
## 78 170 58.5 20 1 2 2 3 2 80 82 97
## 79 163 51.0 18 2 2 1 3 1 70 120 97
## 80 191 78.0 19 1 2 1 1 1 68 136 97
## 81 172 59.0 18 2 2 1 2 1 78 129 97
Now, we’re going to begin running some multiple regressions. We’ll start by predicting Pulse2 from two simultaneous predictors, Age and Pulse1. The structure is a simple generalization of what you did with simple regression in Chapter 5.
model.1 <- lm(heart$Pulse2 ~ Age + Pulse1, data=heart)So, we’ll run that model and then we’ll look at the results with mcSummary. That function was designed to give you a lot of the information you will need for multiple regression and ANOVA, and to put it in a succinct-but-ugly form. Below, you can see an overall, multi-df or “omnibus” test of the set of predictors (the ANOVA section). We also see estimates of each of three parameters in the model (intercept, age, and pulse1), and each test is associated with a single df.
Can you specify the 4 model comparisons we’re looking at, here? (Please test yourself. You will need to be able to.)
model.1 <- lm(heart$Pulse2 ~ Age + Pulse1, data=heart)
mcSummary(model.1)## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 14991.09 2 7495.545 0.139 8.577 0
## Error 92634.47 106 873.910
## Corr Total 107625.56 108 996.533
##
## RMSE AdjEtaSq
## 29.562 0.123
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 49.775 23.934 2.080 3779.511 0.039 NA 2.322 97.227 0.040
## Age -0.752 0.730 -1.030 927.860 0.010 0.979 -2.199 0.695 0.305
## Pulse1 0.826 0.216 3.818 12740.732 0.121 0.979 0.397 1.254 0.000
Four Model Comparisons: 1) $Pulse_2i = 49.76 + -0.75X_1age + + 0.86X_2pulse1 + Error $ 2) $Pulse_2i = 49.76 + -0.75X_1age + Error $ 3) $Pulse_2i = 49.76 + -0.86X_1pulse1 + Error $ 4) $Pulse_2i = 49.76 + Error $ 5) $Pulse_2i = B_0 + Error $
Existing R packages (even lmSupport) do provide this information in a more piecemeal fashion. Compare the output of the real packages you can download from cran. You need to be able to find useful information using this kind of “official” output. Make sure you can interpret this stuff.
modelSummary(model.1) # slopes## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
## Observations: 109
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 49.7746 23.9345 2.080 0.039973 *
## Age -0.7519 0.7298 -1.030 0.305165
## Pulse1 0.8256 0.2162 3.818 0.000227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 92634.5, Error df: 106
## R-squared: 0.1393
modelEffectSizes(model.1) # PRE & SSR## lm(formula = heart$Pulse2 ~ Age + Pulse1, data = heart)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 3779.5113 1 0.0392 NA
## Age 927.8601 1 0.0099 0.0086
## Pulse1 12740.7316 1 0.1209 0.1184
##
## Sum of squared errors (SSE): 92634.5
## Sum of squared total (SST): 107625.6
confint(model.1, level=0.95) # confidence intervals## 2.5 % 97.5 %
## (Intercept) 2.3222401 97.2270500
## Age -2.1987386 0.6948625
## Pulse1 0.3969251 1.2543255
Important but esoteric note: mcSummary and modelEffectSizes provide what is called “Type III sums of squares,” which in these models are equivalent to Type II SS. If you use the simple anova() function, it provides Type I SS. These different kinds of SS ask different questions. The important point for us is that Type III SS for a given predictor partials out any variability that can be accounted for by any other predictor. That is often what we want. We will discuss this further in class. For now, to preserve your sanity, use mcSummary or modelEffectSizes to get the sum of squares for the homeworks.
Questions
- From Glass & Hopkins we have the following variables. READ = reading score at end of first-grade RRDEV = reading readiness test from previous/kindergarten year [entered in mean deviation form, mean = 49.0] IQDEV = IQ score on standardized test from previous year [entered in mean deviation form, mean = 102.8]
We estimate a multiple regression model and obtain the following estimates.
READ^ = 26.0 + .362RRDEV + .180IQDEV
- Interpret all 3 coefficients carefully in the style we used for the height, age, and weight example discussed in class (see also pp. 110-111 of the textbook).
Intercept: When RRDEV and IQDEV are average, reading score at the end of first grade (READ) is 26. \(\beta_1\): Controlling for RRDEV, a one unit increase in IQ corresponds with a .180 increase in READ. \(\beta_1\)Controlling for IQDEV, a one unit increase in RRDEV corresponds with a .362 increase in READ.
- Find the predicted reading score for a pupil with a RR score (not in mean deviation form) of 55 and an IQ of 120.
READ^ = 26.0 + ((.36255) - 49) + .180120.
The predicted reading score for a pupil with a RR score (not in mean deviation form) of 55 and an IQ of 120 is 18.51.
The dataset usnews.csv contains information on most of the colleges and universities in the United States. Parents, students, alumni, and state legislators are increasingly interested in a school’s graduation rate, the percentage of entering students who graduate from that school within 6 years. In this homework assignment you will explore some of the variables that might be related to a school’s graduation rate. The following variables (among many others) are available:
Variable Type Label
accptRate num Acceptance Rate: proportion 0-1
satTotal int Combined-SAT (Sum)
gradRate int Graduation Rate: percent 0-100 inTuition int In-state Tuition
numFulltime int Number of Fulltime Students sfRatio num Student/Faculty Ratio
top10HS int Percent in Top 10% of HS Class
Filter the data using na.omit. Then estimate the following regression models predicting gradRate. Make sure you obtain Type II/III sums of squares (mcSummary ensures Type III SS). In each of the 3 regression models, provide a detailed interpretation for the accptRate slope (e.g., for each additional inch of height, we predict a 3.9 pound increase in weight).
usnews <- na.omit(usnews)- Using accptRate as the only predictor.
model.2 <- lm(usnews$gradRate ~ accptRate, data=usnews)
mcSummary(model.2)## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 4252.344 1 4252.344 0.088 14.582 0
## Error 44326.649 152 291.623
## Corr Total 48578.994 153 317.510
##
## RMSE AdjEtaSq
## 17.077 0.082
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 91.175 6.550 13.919 56502.330 0.560 NA 78.234 104.116 0
## accptRate -33.360 8.736 -3.819 4252.344 0.088 NA -50.621 -16.100 0
modelSummary(model.2) # slopes## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
## Observations: 154
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 91.175 6.550 13.919 < 2e-16 ***
## accptRate -33.360 8.736 -3.819 0.000195 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 44326.6, Error df: 152
## R-squared: 0.0875
modelEffectSizes(model.2) # PRE & SSR## lm(formula = usnews$gradRate ~ accptRate, data = usnews)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 56502.330 1 0.5604 NA
## accptRate 4252.344 1 0.0875 0.0875
##
## Sum of squared errors (SSE): 44326.6
## Sum of squared total (SST): 48579.0
confint(model.2, level=0.95) # confidence intervals## 2.5 % 97.5 %
## (Intercept) 78.23392 104.11624
## accptRate -50.62075 -16.10017
Intercept: When accptRate is zero, gradRate is 91.18. \(\beta_1\): A one unit increase in accptRate corresponds with a -33.36 decrease in gradRate.
- Using accptRate, satTotal, and top10HS as predictors.
model.3 <- lm(usnews$gradRate ~ accptRate + satTotal + top10HS, data=usnews)
mcSummary(model.3)## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS,
## data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 12897.23 3 4299.077 0.265 18.073 0
## Error 35681.76 150 237.878
## Corr Total 48578.99 153 317.510
##
## RMSE AdjEtaSq
## 15.423 0.251
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 0.810 21.591 0.038 0.335 0.000 NA -41.852 43.472 0.970
## accptRate -4.530 9.276 -0.488 56.744 0.002 0.724 -22.859 13.798 0.626
## satTotal 0.064 0.021 3.002 2143.535 0.057 0.299 0.022 0.105 0.003
## top10HS 0.119 0.122 0.973 225.187 0.006 0.297 -0.123 0.361 0.332
modelSummary(model.3) # slopes## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS,
## data = usnews)
## Observations: 154
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 0.81009 21.59125 0.038 0.97012
## accptRate -4.53040 9.27583 -0.488 0.62597
## satTotal 0.06358 0.02118 3.002 0.00314 **
## top10HS 0.11915 0.12246 0.973 0.33214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 35681.8, Error df: 150
## R-squared: 0.2655
modelEffectSizes(model.3) # PRE & SSR## lm(formula = usnews$gradRate ~ accptRate + satTotal + top10HS,
## data = usnews)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 0.3349 1 0.0000 NA
## accptRate 56.7444 1 0.0016 0.0012
## satTotal 2143.5352 1 0.0567 0.0441
## top10HS 225.1870 1 0.0063 0.0046
##
## Sum of squared errors (SSE): 35681.8
## Sum of squared total (SST): 48579.0
confint(model.3, level=0.95) # confidence intervals## 2.5 % 97.5 %
## (Intercept) -41.85217538 43.4723494
## accptRate -22.85856338 13.7977606
## satTotal 0.02173026 0.1054332
## top10HS -0.12282002 0.3611152
Intercept: When accptRate, satTotal, and top10HS are zero, gradRate is 0.810. \(\beta_1\): When satTotal and top10HS are held constant, a one unit increase in accptRate corresponds with a -4.53 decrease in gradRate. \(\beta_2\): When accptRate and top10HS are held constant, a one unit increase in satTotal corresponds with a 0.06 increase in gradRate. \(\beta_3\): When accptRate and satTotal are held constant, a one unit increase in top10HS corresponds with a 0.12 increase in gradRate.
- Using accptRate, numFulltime, and sfRatio as predictors.
model.4 <- lm(usnews$gradRate ~ accptRate + numFulltime + sfRatio, data=usnews)
mcSummary(model.4)## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio,
## data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 9869.86 3 3289.953 0.203 12.749 0
## Error 38709.13 150 258.061
## Corr Total 48578.99 153 317.510
##
## RMSE AdjEtaSq
## 16.064 0.187
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 112.003 7.611 14.717 55890.821 0.591 NA 96.965 127.041 0.000
## accptRate -31.981 8.363 -3.824 3773.952 0.089 0.966 -48.506 -15.457 0.000
## numFulltime 0.000 0.000 -0.528 71.972 0.002 0.865 -0.001 0.000 0.598
## sfRatio -1.533 0.367 -4.178 4505.398 0.104 0.881 -2.257 -0.808 0.000
modelSummary(model.4) # slopes## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio,
## data = usnews)
## Observations: 154
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 112.003109 7.610636 14.717 < 2e-16 ***
## accptRate -31.981274 8.362934 -3.824 0.000192 ***
## numFulltime -0.000131 0.000248 -0.528 0.598208
## sfRatio -1.532601 0.366795 -4.178 4.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 38709.1, Error df: 150
## R-squared: 0.2032
modelEffectSizes(model.4) # PRE & SSR## lm(formula = usnews$gradRate ~ accptRate + numFulltime + sfRatio,
## data = usnews)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 55890.8211 1 0.5908 NA
## accptRate 3773.9524 1 0.0888 0.0777
## numFulltime 71.9715 1 0.0019 0.0015
## sfRatio 4505.3984 1 0.1043 0.0927
##
## Sum of squared errors (SSE): 38709.1
## Sum of squared total (SST): 48579.0
confint(model.4, level=0.95) # confidence intervals## 2.5 % 97.5 %
## (Intercept) 9.696521e+01 127.041004210
## accptRate -4.850564e+01 -15.456909179
## numFulltime -6.211088e-04 0.000359121
## sfRatio -2.257354e+00 -0.807848755
Intercept: When accptRate, numFulltime, and sfRatio are zero, gradRate is 112. \(\beta_1\): When numFulltime and sfRatio are held constant, a one unit increase in accptRate corresponds with a -31.98 decrease in gradRate. \(\beta_2\): When accptRate and sfRatio are held constant, a one unit increase in numFulltime corresponds with a >-0.001 decrease in gradRate. \(\beta_3\): When accptRate and numFulltime are held constant, a one unit increase in sfRatio corresponds with a -1.53 decrease in gradRate.
- Based on the three different interpretations you have given for the three slopes for accptRate:
- Summarize what you think the role of that variable is in affecting gradRate overall.
Based on the interpretations above for accptRate, acceptRate corresponds with a decrease in gradRate. Though the decrease in gradRate corresponding with accptRate is varied for each of the models, all models broadly indicate that when acceptance rates for universities go up, graduation rates go down.
- Provide a complete statistical test of the effect of accptRate controlling for numFulltime and sfRatio. Specify Model A, Model C, n-PA, PA-PC, and the null hypothesis. Provide PRE, F, and p. Provide a statistical conclusion and a news summary.
Model A:$ gradRate_i = _0 + _1acceptRate + _2numFulltime + _3sfRatio$ Model C:$ gradRate_i = _0 + _2numFulltime + _3sfRatio$
n-PA: 154 - 4 = 150 PA-PC: 4-3 = 1 \(H_0:\beta_1 = 0\)
model.5 <- lm(gradRate ~ accptRate + numFulltime + sfRatio, data=usnews)
mcSummary(model.5)## lm(formula = gradRate ~ accptRate + numFulltime + sfRatio, data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 9869.86 3 3289.953 0.203 12.749 0
## Error 38709.13 150 258.061
## Corr Total 48578.99 153 317.510
##
## RMSE AdjEtaSq
## 16.064 0.187
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 112.003 7.611 14.717 55890.821 0.591 NA 96.965 127.041 0.000
## accptRate -31.981 8.363 -3.824 3773.952 0.089 0.966 -48.506 -15.457 0.000
## numFulltime 0.000 0.000 -0.528 71.972 0.002 0.865 -0.001 0.000 0.598
## sfRatio -1.533 0.367 -4.178 4505.398 0.104 0.881 -2.257 -0.808 0.000
We wanted to test the effect of university acceptance rate (accptRate) on graduation rate (gradRate) in a sample of public and private universities across the US. We found that when we accounted for the student to faculty ratio (sfRatio) and percent students in the top 10% of their HS class (top10HS), acceptance rate was had a significant effect on university graduation rates; t(150) = -3.82 , PRE = 0.089, p < 0.001; reject the null hypothesis.
- Based on question #2, one might surmise that, as a predictor, accptRate is redundant with the other predictor variables. Estimate the following models and discuss the extent to which accptRate is redundant with the other predictors in the above models 2.b and 2.c. You can examine this question by looking at the PRE (or eta-squared) that you get when you use the two other variables (i.e., satTotal & top10HS as a set; and numFulltime & sfRatio as a set) to make predictions for accptRate (in MODEL A) instead of simply predicting the mean (MODEL C).
- Predict accptRate with satTotal and top10HS.
model.6 <- lm(accptRate ~ satTotal + top10HS, data=usnews)
mcSummary(model.6)## lm(formula = accptRate ~ satTotal + top10HS, data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 1.056 2 0.528 0.276 28.843 0
## Error 2.765 151 0.018
## Corr Total 3.821 153 0.025
##
## RMSE AdjEtaSq
## 0.135 0.267
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 1.204 0.162 7.427 1.010 0.268 NA 0.884 1.524 0.000
## satTotal 0.000 0.000 -2.087 0.080 0.028 0.307 -0.001 0.000 0.039
## top10HS -0.002 0.001 -2.311 0.098 0.034 0.307 -0.005 0.000 0.022
- Predict accptRate with numFulltime and sfRatio.
model.7 <- lm(accptRate ~ numFulltime + sfRatio, data=usnews)
mcSummary(model.7)## lm(formula = accptRate ~ numFulltime + sfRatio, data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.131 2 0.066 0.034 2.682 0.072
## Error 3.690 151 0.024
## Corr Total 3.821 153 0.025
##
## RMSE AdjEtaSq
## 0.156 0.022
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 0.686 0.049 14.073 4.840 0.567 NA 0.589 0.782 0.000
## numFulltime 0.000 0.000 -2.205 0.119 0.031 0.892 0.000 0.000 0.029
## sfRatio 0.005 0.004 1.392 0.047 0.013 0.892 -0.002 0.012 0.166
- Compare these results (PRE as defined above) to the tolerances for accptRate in models 2b and 2c. Can you explain what a tolerance is? (We will discuss in class.)
In model 2a, satTotal accounted for 2.8% of variance, while top10HS accounted for 3.4%, for a total of 6.2% variance accounted for when both variables are included in the model. In model 2b, numFulltime accounted for 3.1% variance, while sfRatio accounted for 1.3% variance, for a total of 4.4% variance accounted for when both variables are included in the model. Tolerance, or the variation not explained by inclusion of our predictor variables (1-PRE), is 0.97 for satTotal, 0.97 for top10HS, 0.969 for numFulltime, and 0.99 for sfRatio. Given that we want our predictors to explain as much variance as possible in our model, model 2a with satTotal and top10HS accounts for more variance, thus displaying a higher proportional reduction in error and lower tolerance than model 2b.
- Using your own dataset, think about a 1-df multiple-regression type question in which you examine the effect of one focal predictor while controlling for at least one other variable.
- CLEARLY specify the question in language that makes sense to a smart person who may not know your particular field (like your TAs).
Greenhouse gas emissions have a number of contributors, one being individual household CO2 emissions per year. The emission of greenhouse gasses contribute to today’s environmental crisis by exacerbating climate change, which has been found to have a number of negative consequences such as depletion of the Earth’s ozone layer, extreme weather, disruptions in food and water supplies, rising sea levels, and ecological instability, to name a few. When it comes to annual household carbon dioxide emissions in the US, a number of known contributors have been identified, such as how much food is wasted in the household, as well as overall generated waste that often ends up in landfills. For this dataset I wanted to know, does the amount of waste a household generates per year in pounds (WastePounds) have an effect on annual household CO2 emissions in tons (EmissionsTons), controlling for food waste per year in pounds (FoodPounds)?
- Specify Models A & C and the H0.
Model A: \(EmissionsTons_i = \beta_0 + \beta_1FoodPounds + \beta_2WastePounds + \epsilon_i\) Model C: \(EmissionsTons_i = \beta_0 + \beta_1FoodPounds+ \epsilon_i\) Null hypothesis: \(\beta_2 = 0\)
- Estimate Model A (provide the code and output)
model.8 <- lm(EmissionsTons ~ FoodPounds + WastePounds, data=co2emissions)
mcSummary(model.8)## lm(formula = EmissionsTons ~ FoodPounds + WastePounds, data = co2emissions)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.253 2 0.126 0.002 0.293 0.746
## Error 106.567 247 0.431
## Corr Total 106.820 249 0.429
##
## RMSE AdjEtaSq
## 0.657 -0.006
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 5.511 2.663 2.069 1.848 0.017 NA 0.266 10.756 0.040
## FoodPounds 0.001 0.001 0.740 0.236 0.002 0.999 -0.002 0.004 0.460
## WastePounds 0.000 0.000 0.169 0.012 0.000 0.999 0.000 0.000 0.866
- Carefully interpret the meaning of the coefficient in which you are interested (don’t worry about significance, just explain what the slope means.)
\(\beta_2\): When food in pounds (FoodPounds) is held constant, a one unit increase in household waste in pounds (WastePounds) corresponds with a 0.001 increase in household CO2 emissions in tons.