Homework 10

Problem 1

In this problem we will consider several studies investigating the link between blood pressure, age (in years), and weight (in kilograms).

For a simple random sample of 26 subjects, the researchers fitted the following model: \[\hat y = 101 + 0.25 \cdot \text{age} + 0.12 \cdot \text{weight}\] What is the prediction for a new subject who is 35 years old and weighs 98 kg?

The prediction is a blood pressure of 121.51

Using the model from the previous problem, the researchers found a standard error for the age parameter of 0.11. Construct a 95% confidence interval for this parameter.

.25 +- .11 * 2.06 = (.0234, .4766) –> This represents a 95 % confidence interval for the age parameter.

State and test (at the \(\alpha = 0.01\) level) the hypothesis of no linear relationship between age and blood pressure.

H0 = Bj = 0, Ha = Bj != 0

Tstat = 2.27 2P(T>=t) ~.15 Since our p-value is larger than our significance level of .01, we fail to reject the null hypothesis that there is no linear relationship between age and blood pressure.

In a new sample of 108 subjects, complete the following table, where “t-stat” is the test statistic for the hypothesis that the true coefficient is zero and “p” is the \(p\)-value for that hypothesis (this is the default behavior in R).

Variable	Estimate	Std. Error	t-stat	p
Intercept	97	4.1	23.6585	1
Age	.2702	0.14	1.93	.0574
Weight	0.9	1.74	.517	0.303

Problem 2

Part 1

In a hidden R chunk, we have imported the GPA data that is investigated in chapter 11.2 of your book. Here, for example, the regression fit on page 629:

fullmodel <- lm(GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE, data = gpa)
print(fullmodel)

## 
## Call:
## lm(formula = GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE, data = gpa)
## 
## Coefficients:
## (Intercept)         SATM        SATCR         SATW          HSM  
##   -1.186783     0.001989     0.000157     0.000474     0.091477  
##         HSS          HSE  
##    0.130097     0.056791

For each for the following models, report: - the fitted model coefficients - mean squared error (MSE) - percent explained variation - and \(p\)-values for tests that \(\hat \beta_j = 0\) for all coefficients.

SATM and HSS

model1 = lm(GPA~SATM + HSS, data = gpa)
sm = summary(model1)
sm

## 
## Call:
## lm(formula = GPA ~ SATM + HSS, data = gpa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9606 -0.4118  0.1637  0.5311  1.3831 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8469136  0.5584649  -1.517 0.131540    
## SATM         0.0026897  0.0007973   3.374 0.000948 ***
## HSS          0.2286058  0.0427660   5.346 3.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7113 on 147 degrees of freedom
## Multiple R-squared:  0.2538, Adjusted R-squared:  0.2436 
## F-statistic:    25 on 2 and 147 DF,  p-value: 4.517e-10

MSE = mean(sm$residuals^2)
MSE

## [1] 0.495849

SATM, HSM, and HSS

model2 = lm(GPA~SATM + HSS+ HSM, data = gpa)
sm2 = summary(model2)
sm2

## 
## Call:
## lm(formula = GPA ~ SATM + HSS + HSM, data = gpa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9958 -0.3900  0.1793  0.5199  1.2232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.8871465  0.5564866  -1.594  0.11306   
## SATM         0.0023746  0.0008195   2.898  0.00434 **
## HSS          0.1725225  0.0560094   3.080  0.00247 **
## HSM          0.0850497  0.0552022   1.541  0.12556   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.708 on 146 degrees of freedom
## Multiple R-squared:  0.2657, Adjusted R-squared:  0.2507 
## F-statistic: 17.61 on 3 and 146 DF,  p-value: 8.195e-10

MSE2 = mean(sm2$residuals^2)
MSE2

## [1] 0.4879162

SATM, HSM, HSS, and HSE

model3 = lm(GPA~SATM + HSS+ HSM + HSE, data = gpa)
sm3 = summary(model3)
sm3

## 
## Call:
## lm(formula = GPA ~ SATM + HSS + HSM + HSE, data = gpa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9937 -0.3645  0.1617  0.5143  1.3931 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.1064079  0.5973727  -1.852  0.06604 . 
## SATM         0.0024008  0.0008199   2.928  0.00396 **
## HSS          0.1332323  0.0682112   1.953  0.05272 . 
## HSM          0.0827049  0.0552476   1.497  0.13657   
## HSE          0.0643942  0.0638155   1.009  0.31462   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.708 on 145 degrees of freedom
## Multiple R-squared:  0.2709, Adjusted R-squared:  0.2507 
## F-statistic: 13.47 on 4 and 145 DF,  p-value: 2.336e-09

MSE3 = mean(sm3$residuals^2)
MSE3

## [1] 0.4845139

Part 2

Consider the hypothesis that none of the SAT related variables have a linear relationship with GPA. \[H_0: SATM = 0, SATCR = 0, SATW = 0 vs. H_A: SATM \ne 0 \text{ or } SATCR \ne 0 \text{ or } SATW \ne 0\] We can test this hypothesis by comparing the full model to the model that does not include the SAT related variables using R’s anova function. Update the following code to produce the smaller model (and set eval = TRUE):

smallmodel <- lm(GPA~ HSM + HSS + HSE, data = gpa)
anova(smallmodel, fullmodel)

## Analysis of Variance Table
## 
## Model 1: GPA ~ HSM + HSS + HSE
## Model 2: GPA ~ SATM + SATCR + SATW + HSM + HSS + HSE
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    146 76.975                             
## 2    143 72.465  3    4.5104 2.9669 0.0341 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpret these results. At the \(\alpha = 0.05\) level would you reject the null hypothesis? Would do you conclude about the contribution of the SAT variables?

Our p-value is .0341 At the .05 confidence level, we reject the null hypothesis and have sufficient evidence to conclude that SAT related variables have a linear relationship with GPA.

Problem 3

Again using the GPA data problem 2 and the fullmodel and smallmodel, consider a new observation with values:

Variable	Value
HSM	9
HSS	9
HSE	9
SATM	630
SATCR	560
SATW	560

Create a 95% prediction interval using fullmodel.

newdata = data.frame(HSM = 9, HSS = 9, HSE = 9, SATM = 630, SATCR = 560, SATW = 560)
predict(fullmodel, newdata, interval = "prediction", level = .95)

##        fit      lwr      upr
## 1 2.924757 1.512202 4.337312

Create a 95% prediction interval using smallmodel.

newdata1 = data.frame(HSM = 9, HSS = 9, HSE = 9)
predict(smallmodel, newdata1, interval = "prediction", level = .95)

##        fit      lwr      upr
## 1 2.930049 1.489834 4.370264

Compare the two intervals. Do the models give similar predictions?

The models give extremely similar intervals. Both have nearly identical lower and upper bounds for the confidence intervals predicted using both models.

Problem 4

While linear regression is linear in the arguments, we can fit a polynomial curve of degree \(q\) using: \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_q x^q\] (Equivalently we can think of a set of new variables \(z_1 = x, z_2 = x^2, \ldots z_q = x^q\).)

Here is a plot of Body Mass Index (BMI, a ratio of height to weight) vs. physical activity (PA, in thousands of steps):

plot(BMI ~ PA, data = pabmi)

This plot is vaguely suggestive of a quadratic relationship between BMI and PA (i.e., \(q = 2\)).

It is often best to “center” the variable before squaring and regression. To center the variable, subtract of the sample mean. Create two new variables: the centered version of PA and a PA squared after subtracting the mean.

meanPA = mean(pabmi$PA)
CenteredPA = meanPA

for(i in 1:100)
{
  pabmi$PA[i] = pabmi$PA[i] - CenteredPA
}
newMeanPA = mean(pabmi$PA)
SQUAREDPA = pabmi$PA^2

Regress BMI on the two new variables.

simpleModel <- lm(BMI ~ PA, data = pabmi)
summary(simpleModel)

## 
## Call:
## lm(formula = BMI ~ PA, data = pabmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3819 -2.5636  0.2062  1.9820  8.5078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.9390     0.3655  65.499  < 2e-16 ***
## PA           -0.6547     0.1583  -4.135  7.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.655 on 98 degrees of freedom
## Multiple R-squared:  0.1485, Adjusted R-squared:  0.1399 
## F-statistic:  17.1 on 1 and 98 DF,  p-value: 7.503e-05

simpleModel1<- lm(BMI ~ SQUAREDPA, data = pabmi)
summary(simpleModel1)

## 
## Call:
## lm(formula = BMI ~ SQUAREDPA, data = pabmi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6565  -2.5813   0.5981   2.5205   9.4950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.51663    0.50677  46.405   <2e-16 ***
## SQUAREDPA    0.07927    0.06013   1.318     0.19    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.926 on 98 degrees of freedom
## Multiple R-squared:  0.01742,    Adjusted R-squared:  0.007397 
## F-statistic: 1.738 on 1 and 98 DF,  p-value: 0.1905

The \(R^2\) value for the simple regression of BMI on PA is:

simpleModel <- lm(BMI ~ PA, data = pabmi)
summary(simpleModel)$r.squared

## [1] 0.1485401

What is the \(R^2\) for the quadratic model? Interpret these values.

.017, this means that the data does not fit the linear model line very well at all.

Review residual plots for the quadratic model. Are there any lingering concerns about linearity, variance, or Normality?

resid = residuals(simpleModel1)

The model does not fit the data linearly well. Also, there is alot of variance in the difference between individual residual values.

Problem 5

In a hidden R chunk, we have loaded data on comerical architecture firms (in the billing data frame). We will build a model to predict total billing for a firm.

Using numerical and graphical summaries, describe the total billing (TotalBill02), number of architects (N_Arch), engineers (N_Eng), and staff (N_Staff).

summaryA = summary(billing$N_Arch)
summaryB = summary(billing$N_Eng)
summaryC = summary(billing$N_Staff)
summaryD = summary(billing$TotalBill02)

summaryA

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    7.00   10.57   16.00   39.00

plot(billing$N_Arch, ylab = "Number of Architects")

summaryB

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    2.00    6.81    7.00   36.00

plot(billing$N_Eng, xlab = "Index", ylab = "Number of Engineers")

summaryC

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    16.0    58.0    59.9    70.0   240.0

plot(billing$N_Staff, ylab = "Number of Staff")

summaryD

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   3.300   6.700   8.252  10.600  29.500

For each of the pairs of these variables, provide numerical and graphical summaries to describe their relationships (hint: see the cor function and plot.data.frame function).

cor(billing)

##             TotalBill02 ArchBill02 ArchBill01     N_Arch       N_Eng
## TotalBill02   1.0000000 0.84529126 0.82638347 0.78411784  0.81944925
## ArchBill02    0.8452913 1.00000000 0.98344484 0.96132106  0.49922217
## ArchBill01    0.8263835 0.98344484 1.00000000 0.95861422  0.46182119
## N_Arch        0.7841178 0.96132106 0.95861422 1.00000000  0.45688647
## N_Eng         0.8194492 0.49922217 0.46182119 0.45688647  1.00000000
## N_Staff       0.9586908 0.79323092 0.77645095 0.75791152  0.90177474
## Yr_Estab     -0.1228019 0.07979176 0.04273231 0.05217474 -0.09506632
##                N_Staff    Yr_Estab
## TotalBill02  0.9586908 -0.12280194
## ArchBill02   0.7932309  0.07979176
## ArchBill01   0.7764510  0.04273231
## N_Arch       0.7579115  0.05217474
## N_Eng        0.9017747 -0.09506632
## N_Staff      1.0000000 -0.11219607
## Yr_Estab    -0.1121961  1.00000000

plot(billing)

model1 = lm(TotalBill02 ~ ArchBill01 + ArchBill02  , data = billing)
summary(model1)

## 
## Call:
## lm(formula = TotalBill02 ~ ArchBill01 + ArchBill02, data = billing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.238 -2.570 -1.600  1.111 10.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2.4284     1.2061   2.013   0.0593 .
## ArchBill01   -0.1898     0.8803  -0.216   0.8317  
## ArchBill02    1.2282     0.8589   1.430   0.1699  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.793 on 18 degrees of freedom
## Multiple R-squared:  0.7153, Adjusted R-squared:  0.6836 
## F-statistic: 22.61 on 2 and 18 DF,  p-value: 1.231e-05

Carry out multiple linear regression. What is the estimated value of \(\sigma^2\)?

2.967

Investigate the residuals. Are there any concerns?

Based off the data, I do not think there are any concerns.