Hw5-sta108

A = read.table('CH01PR22.txt',header=F)
names(A) = c("y","x")

Problem 5.26: Solution

a1)

X = cbind(Intercept = 1, x = A$x)
xtx = solve(t(X) %*% X); xtx

##           Intercept           x
## Intercept  0.675000 -0.02187500
## x         -0.021875  0.00078125

a2)

b = solve(t(X) %*% X) %*% t(X) %*% A$y; b

##                 [,1]
## Intercept 168.600000
## x           2.034375

a3)

yhat = X %*% b; yhat

##          [,1]
##  [1,] 201.150
##  [2,] 201.150
##  [3,] 201.150
##  [4,] 201.150
##  [5,] 217.425
##  [6,] 217.425
##  [7,] 217.425
##  [8,] 217.425
##  [9,] 233.700
## [10,] 233.700
## [11,] 233.700
## [12,] 233.700
## [13,] 249.975
## [14,] 249.975
## [15,] 249.975
## [16,] 249.975

a5)

e = A$y - yhat
SSE = t(e) %*% e; SSE

##         [,1]
## [1,] 146.425

a6)

MSE = SSE / (length(e) - 2)
MSE = as.numeric(MSE)
s2b = MSE * solve(t(X) %*% X); s2b

##            Intercept            x
## Intercept  7.0597768 -0.228789063
## x         -0.2287891  0.008171038

a7)

xh = c(1,30)
xh.transpose = t(xh)
s2pred = MSE* (1 + xh.transpose %*% solve(t(X) %*% X) %*% xh); s2pred

##         [,1]
## [1,] 11.1453

b)

From the results in problem a6:
var(b0) = 7.0597768
s(b1) = 0.00408552
s(b0,b1)= -0.228789063

Problem 6.15: Solution

a)

Stem-and-leaf plots:

B = read.table('CH06PR15.txt',header=T)
stem(B$X1)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   2 | 23
##   2 | 58899999
##   3 | 012233344
##   3 | 66678
##   4 | 0012233344
##   4 | 557779
##   5 | 0233
##   5 | 55

stem(B$X2)

## 
##   The decimal point is at the |
## 
##   40 | 0
##   42 | 00
##   44 | 0
##   46 | 00000
##   48 | 0000000000
##   50 | 000000000000
##   52 | 000000
##   54 | 0000
##   56 | 00
##   58 | 0
##   60 | 0
##   62 | 0

stem(B$X3)

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   18 | 000000
##   20 | 00000000
##   22 | 0000000000000
##   24 | 00000000000
##   26 | 0000
##   28 | 0000

Some noteworthy features are high frequencies of the highest and lowest anxiety levels. Another thing worth to take note is that there is a large cluster of of both the highest and lowest anxiety levels, and a large cluster of 28 and 29 year old patients, which is an indication of bias. However, these are not likely to affect the regression by much.

b)

plot(B)

cor(B)

##             Y         X1         X2         X3
## Y   1.0000000 -0.7867555 -0.6029417 -0.6445910
## X1 -0.7867555  1.0000000  0.5679505  0.5696775
## X2 -0.6029417  0.5679505  1.0000000  0.6705287
## X3 -0.6445910  0.5696775  0.6705287  1.0000000

The interpretation of these plots is that they show us the relationship between all variables. The plots of fitted values against the predictor variables show strong relationship. In contrast, the plots of predictor variable against any other predictor variable doesn’t show linear relationship.

Interpreting the correlation matrices, these give us a more robust metric of how strongly related each variable is. The correlation between the fitted values and each predictor variable all have an absolute value of greater than 0.6, which is fairly good. The correlation between the predictor variables seem to all be below 0.6, except for cor(x2,x3), which is a bit above 0.6. I would suggest to further evaluate the relationship between the two and suggest a better model that accounts for their correlation.

c)

fit = lm(Y ~ X1 + X2 + X3, data = B)
fit$coefficients

## (Intercept)          X1          X2          X3 
## 158.4912517  -1.1416118  -0.4420043 -13.4701632

The estimated regression function would be: Y = 158.4912517 - 1.1416118 * x1 - 0.4420043 * x2 - 13.4701632 * x3.

b2 = -0.4420043 is how much the satisfaction of a patient changes per unit increase, assuming that anxiety and age remain constant. So the coefficient is telling us that the satisfaction decreases by 0.442 per unit increase.

d)

res = fit$residuals
boxplot(res, main = 'Residuals')

As shown, there are no outliers in the distribution of residuals.

e)

plot(B$Y, fit$res, xlab = 'Y (fitted values)', ylab = 'Residuals'); abline(h = 0)

plot(B$X1, fit$res, xlab = 'X1', ylab = 'Residuals'); abline(h = 0)

plot(B$X2, fit$res, xlab = 'X2', ylab = 'Residuals'); abline(h = 0)

plot(B$X3, fit$res, xlab = 'X3', ylab = 'Residuals'); abline(h = 0)

As you can see, there seems to be no pattern here, which tells us that nonlinearity is not violated. There doesn’t seem to be a violation to heteroskedasticity as the variancie is not unequal accross the range of values. There is also a relative consistent spread around 0, and there are no aparent outliers.

plot(B$X1 * B$X2, fit$res, xlab = 'X1*X2', ylab = 'Residuals'); abline(h = 0)

plot(B$X1 * B$X3, fit$res, xlab = 'X1*X3', ylab = 'Residuals'); abline(h = 0)

plot(B$X2 * B$X3, fit$res, xlab = 'X2*X3', ylab = 'Residuals'); abline(h = 0)

There is no aparent pattern in any of the plots of predictor variable against any other predictor variable, therefore there is no inndication that the predictor should be included in the model.

qqnorm(fit$res); qqline(fit$res)

What we want to see here is that points fall in the theoretical line, and many of them do. The ones that don’t fall exactly on the line are very close to it. So overall, the graph shows that there is normality in our model.

Problem 6.16: Source Code:

a)

B = read.table('CH06PR15.txt',header=T)
fit = lm(Y ~ X1 + X2 + X3, data = B)
anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## X1         1 8275.4  8275.4 81.8026 2.059e-11 ***
## X2         1  480.9   480.9  4.7539   0.03489 *  
## X3         1  364.2   364.2  3.5997   0.06468 .  
## Residuals 42 4248.8   101.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b)

coef = fit$coef
s.coef = summary(fit)$coef[, 2]

n = nrow(B)
p = 4
g = 3
alpha = 0.1
Bonf = qt(1 - alpha / (2 * g), n - p)

lbd = coef - Bonf * s.coef
ubd = coef + Bonf * s.coef
rbind(lbd, ubd)

##     (Intercept)         X1         X2         X3
## lbd    118.6076 -1.6142482 -1.5245098 -29.092028
## ubd    198.3749 -0.6689755  0.6405013   2.151701

c)

summary(fit)$r.sq

## [1] 0.6821943

Problem 6.17: Solution

a)

xnew = data.frame(1, X1 = 35, X2 = 45, X3 = 2.2); xnew; fit

##   X1.1 X1 X2  X3
## 1    1 35 45 2.2

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = B)
## 
## Coefficients:
## (Intercept)           X1           X2           X3  
##     158.491       -1.142       -0.442      -13.470

predict(fit, newdata = xnew, interval = 'confidence', se.fit = F, level = 0.90 )

##        fit      lwr      upr
## 1 69.01029 64.52854 73.49204

We are 90% confident that the mean response when X1 = 35, X2 = 45, and X3 = 2.2 is between 64.52854 and 73.49204.

b)

predict(fit, newdata = xnew, interval = 'prediction', se.fit = F, level = 0.90 )

##        fit      lwr      upr
## 1 69.01029 51.50965 86.51092

We are 90% confident that the new value of the response when X1 = 35, X2 = 45, and X3 = 2.2 is between 51.50965 and 86.51092.

Hw5-sta108

Rosalia

5/15/2018

Problem 5.26: Solution

a1)

a2)

a3)

a5)

a6)

a7)

b)

Problem 6.15: Solution

a)

b)

c)

d)

e)

Problem 6.16: Source Code:

a)

b)

c)

Problem 6.17: Solution

a)

b)