In this problem we will consider several studies investigating the link between blood pressure, age (in years), and weight (in kilograms).
The prediction is a blood pressure of 121.51
.25 +- .11 * 2.06 = (.0234, .4766) –> This represents a 95 % confidence interval for the age parameter.
H0 = Bj = 0, Ha = Bj != 0
Tstat = 2.27 2P(T>=t) ~.15 Since our p-value is larger than our significance level of .01, we fail to reject the null hypothesis that there is no linear relationship between age and blood pressure.
Variable | Estimate | Std. Error | t-stat | p |
---|---|---|---|---|
Intercept | 97 | 4.1 | 23.6585 | 1 |
Age | .2702 | 0.14 | 1.93 | .0574 |
Weight | 0.9 | 1.74 | .517 | 0.303 |
In a hidden R chunk, we have imported the GPA data that is investigated in chapter 11.2 of your book. Here, for example, the regression fit on page 629:
fullmodel <- lm(GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE, data = gpa)
print(fullmodel)
##
## Call:
## lm(formula = GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE, data = gpa)
##
## Coefficients:
## (Intercept) SATM SATCR SATW HSM
## -1.186783 0.001989 0.000157 0.000474 0.091477
## HSS HSE
## 0.130097 0.056791
For each for the following models, report: - the fitted model coefficients - mean squared error (MSE) - percent explained variation - and \(p\)-values for tests that \(\hat \beta_j = 0\) for all coefficients.
model1 = lm(GPA~SATM + HSS, data = gpa)
sm = summary(model1)
sm
##
## Call:
## lm(formula = GPA ~ SATM + HSS, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9606 -0.4118 0.1637 0.5311 1.3831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8469136 0.5584649 -1.517 0.131540
## SATM 0.0026897 0.0007973 3.374 0.000948 ***
## HSS 0.2286058 0.0427660 5.346 3.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7113 on 147 degrees of freedom
## Multiple R-squared: 0.2538, Adjusted R-squared: 0.2436
## F-statistic: 25 on 2 and 147 DF, p-value: 4.517e-10
MSE = mean(sm$residuals^2)
MSE
## [1] 0.495849
model2 = lm(GPA~SATM + HSS+ HSM, data = gpa)
sm2 = summary(model2)
sm2
##
## Call:
## lm(formula = GPA ~ SATM + HSS + HSM, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9958 -0.3900 0.1793 0.5199 1.2232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8871465 0.5564866 -1.594 0.11306
## SATM 0.0023746 0.0008195 2.898 0.00434 **
## HSS 0.1725225 0.0560094 3.080 0.00247 **
## HSM 0.0850497 0.0552022 1.541 0.12556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.708 on 146 degrees of freedom
## Multiple R-squared: 0.2657, Adjusted R-squared: 0.2507
## F-statistic: 17.61 on 3 and 146 DF, p-value: 8.195e-10
MSE2 = mean(sm2$residuals^2)
MSE2
## [1] 0.4879162
model3 = lm(GPA~SATM + HSS+ HSM + HSE, data = gpa)
sm3 = summary(model3)
sm3
##
## Call:
## lm(formula = GPA ~ SATM + HSS + HSM + HSE, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9937 -0.3645 0.1617 0.5143 1.3931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1064079 0.5973727 -1.852 0.06604 .
## SATM 0.0024008 0.0008199 2.928 0.00396 **
## HSS 0.1332323 0.0682112 1.953 0.05272 .
## HSM 0.0827049 0.0552476 1.497 0.13657
## HSE 0.0643942 0.0638155 1.009 0.31462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.708 on 145 degrees of freedom
## Multiple R-squared: 0.2709, Adjusted R-squared: 0.2507
## F-statistic: 13.47 on 4 and 145 DF, p-value: 2.336e-09
MSE3 = mean(sm3$residuals^2)
MSE3
## [1] 0.4845139
Consider the hypothesis that none of the SAT related variables have a linear relationship with GPA. \[H_0: SATM = 0, SATCR = 0, SATW = 0 vs. H_A: SATM \ne 0 \text{ or } SATCR \ne 0 \text{ or } SATW \ne 0\] We can test this hypothesis by comparing the full model to the model that does not include the SAT related variables using R’s anova
function. Update the following code to produce the smaller model (and set eval = TRUE
):
smallmodel <- lm(GPA~ HSM + HSS + HSE, data = gpa)
anova(smallmodel, fullmodel)
## Analysis of Variance Table
##
## Model 1: GPA ~ HSM + HSS + HSE
## Model 2: GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 146 76.975
## 2 143 72.465 3 4.5104 2.9669 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpret these results. At the \(\alpha = 0.05\) level would you reject the null hypothesis? Would do you conclude about the contribution of the SAT variables?
Our p-value is .0341 At the .05 confidence level, we reject the null hypothesis and have sufficient evidence to conclude that SAT related variables have a linear relationship with GPA.
Again using the GPA data problem 2 and the fullmodel
and smallmodel
, consider a new observation with values:
Variable | Value |
---|---|
HSM | 9 |
HSS | 9 |
HSE | 9 |
SATM | 630 |
SATCR | 560 |
SATW | 560 |
fullmodel.
newdata = data.frame(HSM = 9, HSS = 9, HSE = 9, SATM = 630, SATCR = 560, SATW = 560)
predict(fullmodel, newdata, interval = "prediction", level = .95)
## fit lwr upr
## 1 2.924757 1.512202 4.337312
smallmodel.
newdata1 = data.frame(HSM = 9, HSS = 9, HSE = 9)
predict(smallmodel, newdata1, interval = "prediction", level = .95)
## fit lwr upr
## 1 2.930049 1.489834 4.370264
The models give extremely similar intervals. Both have nearly identical lower and upper bounds for the confidence intervals predicted using both models.
While linear regression is linear in the arguments, we can fit a polynomial curve of degree \(q\) using: \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_q x^q\] (Equivalently we can think of a set of new variables \(z_1 = x, z_2 = x^2, \ldots z_q = x^q\).)
Here is a plot of Body Mass Index (BMI, a ratio of height to weight) vs. physical activity (PA, in thousands of steps):
plot(BMI ~ PA, data = pabmi)
This plot is vaguely suggestive of a quadratic relationship between BMI and PA (i.e., \(q = 2\)).
meanPA = mean(pabmi$PA)
CenteredPA = meanPA
for(i in 1:100)
{
pabmi$PA[i] = pabmi$PA[i] - CenteredPA
}
newMeanPA = mean(pabmi$PA)
SQUAREDPA = pabmi$PA^2
simpleModel <- lm(BMI ~ PA, data = pabmi)
summary(simpleModel)
##
## Call:
## lm(formula = BMI ~ PA, data = pabmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3819 -2.5636 0.2062 1.9820 8.5078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9390 0.3655 65.499 < 2e-16 ***
## PA -0.6547 0.1583 -4.135 7.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.655 on 98 degrees of freedom
## Multiple R-squared: 0.1485, Adjusted R-squared: 0.1399
## F-statistic: 17.1 on 1 and 98 DF, p-value: 7.503e-05
simpleModel1<- lm(BMI ~ SQUAREDPA, data = pabmi)
summary(simpleModel1)
##
## Call:
## lm(formula = BMI ~ SQUAREDPA, data = pabmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6565 -2.5813 0.5981 2.5205 9.4950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.51663 0.50677 46.405 <2e-16 ***
## SQUAREDPA 0.07927 0.06013 1.318 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.926 on 98 degrees of freedom
## Multiple R-squared: 0.01742, Adjusted R-squared: 0.007397
## F-statistic: 1.738 on 1 and 98 DF, p-value: 0.1905
simpleModel <- lm(BMI ~ PA, data = pabmi)
summary(simpleModel)$r.squared
## [1] 0.1485401
What is the \(R^2\) for the quadratic model? Interpret these values.
.017, this means that the data does not fit the linear model line very well at all.
resid = residuals(simpleModel1)
The model does not fit the data linearly well. Also, there is alot of variance in the difference between individual residual values.
In a hidden R chunk, we have loaded data on comerical architecture firms (in the billing
data frame). We will build a model to predict total billing for a firm.
TotalBill02
), number of architects (N_Arch
), engineers (N_Eng
), and staff (N_Staff
).summaryA = summary(billing$N_Arch)
summaryB = summary(billing$N_Eng)
summaryC = summary(billing$N_Staff)
summaryD = summary(billing$TotalBill02)
summaryA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 7.00 10.57 16.00 39.00
plot(billing$N_Arch, ylab = "Number of Architects")
summaryB
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 6.81 7.00 36.00
plot(billing$N_Eng, xlab = "Index", ylab = "Number of Engineers")
summaryC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 16.0 58.0 59.9 70.0 240.0
plot(billing$N_Staff, ylab = "Number of Staff")
summaryD
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 3.300 6.700 8.252 10.600 29.500
cor
function and plot.data.frame
function).cor(billing)
## TotalBill02 ArchBill02 ArchBill01 N_Arch N_Eng
## TotalBill02 1.0000000 0.84529126 0.82638347 0.78411784 0.81944925
## ArchBill02 0.8452913 1.00000000 0.98344484 0.96132106 0.49922217
## ArchBill01 0.8263835 0.98344484 1.00000000 0.95861422 0.46182119
## N_Arch 0.7841178 0.96132106 0.95861422 1.00000000 0.45688647
## N_Eng 0.8194492 0.49922217 0.46182119 0.45688647 1.00000000
## N_Staff 0.9586908 0.79323092 0.77645095 0.75791152 0.90177474
## Yr_Estab -0.1228019 0.07979176 0.04273231 0.05217474 -0.09506632
## N_Staff Yr_Estab
## TotalBill02 0.9586908 -0.12280194
## ArchBill02 0.7932309 0.07979176
## ArchBill01 0.7764510 0.04273231
## N_Arch 0.7579115 0.05217474
## N_Eng 0.9017747 -0.09506632
## N_Staff 1.0000000 -0.11219607
## Yr_Estab -0.1121961 1.00000000
plot(billing)
model1 = lm(TotalBill02 ~ ArchBill01 + ArchBill02 , data = billing)
summary(model1)
##
## Call:
## lm(formula = TotalBill02 ~ ArchBill01 + ArchBill02, data = billing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.238 -2.570 -1.600 1.111 10.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4284 1.2061 2.013 0.0593 .
## ArchBill01 -0.1898 0.8803 -0.216 0.8317
## ArchBill02 1.2282 0.8589 1.430 0.1699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.793 on 18 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.6836
## F-statistic: 22.61 on 2 and 18 DF, p-value: 1.231e-05
2.967
Based off the data, I do not think there are any concerns.