A = read.table('CH01PR22.txt',header=F)
names(A) = c("y","x")
X = cbind(Intercept = 1, x = A$x)
xtx = solve(t(X) %*% X); xtx
## Intercept x
## Intercept 0.675000 -0.02187500
## x -0.021875 0.00078125
b = solve(t(X) %*% X) %*% t(X) %*% A$y; b
## [,1]
## Intercept 168.600000
## x 2.034375
yhat = X %*% b; yhat
## [,1]
## [1,] 201.150
## [2,] 201.150
## [3,] 201.150
## [4,] 201.150
## [5,] 217.425
## [6,] 217.425
## [7,] 217.425
## [8,] 217.425
## [9,] 233.700
## [10,] 233.700
## [11,] 233.700
## [12,] 233.700
## [13,] 249.975
## [14,] 249.975
## [15,] 249.975
## [16,] 249.975
e = A$y - yhat
SSE = t(e) %*% e; SSE
## [,1]
## [1,] 146.425
MSE = SSE / (length(e) - 2)
MSE = as.numeric(MSE)
s2b = MSE * solve(t(X) %*% X); s2b
## Intercept x
## Intercept 7.0597768 -0.228789063
## x -0.2287891 0.008171038
xh = c(1,30)
xh.transpose = t(xh)
s2pred = MSE* (1 + xh.transpose %*% solve(t(X) %*% X) %*% xh); s2pred
## [,1]
## [1,] 11.1453
From the results in problem a6:
var(b0) = 7.0597768
s(b1) = 0.00408552
s(b0,b1)= -0.228789063
Stem-and-leaf plots:
B = read.table('CH06PR15.txt',header=T)
stem(B$X1)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 23
## 2 | 58899999
## 3 | 012233344
## 3 | 66678
## 4 | 0012233344
## 4 | 557779
## 5 | 0233
## 5 | 55
stem(B$X2)
##
## The decimal point is at the |
##
## 40 | 0
## 42 | 00
## 44 | 0
## 46 | 00000
## 48 | 0000000000
## 50 | 000000000000
## 52 | 000000
## 54 | 0000
## 56 | 00
## 58 | 0
## 60 | 0
## 62 | 0
stem(B$X3)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 18 | 000000
## 20 | 00000000
## 22 | 0000000000000
## 24 | 00000000000
## 26 | 0000
## 28 | 0000
Some noteworthy features are high frequencies of the highest and lowest anxiety levels. Another thing worth to take note is that there is a large cluster of of both the highest and lowest anxiety levels, and a large cluster of 28 and 29 year old patients, which is an indication of bias. However, these are not likely to affect the regression by much.
plot(B)
cor(B)
## Y X1 X2 X3
## Y 1.0000000 -0.7867555 -0.6029417 -0.6445910
## X1 -0.7867555 1.0000000 0.5679505 0.5696775
## X2 -0.6029417 0.5679505 1.0000000 0.6705287
## X3 -0.6445910 0.5696775 0.6705287 1.0000000
The interpretation of these plots is that they show us the relationship between all variables. The plots of fitted values against the predictor variables show strong relationship. In contrast, the plots of predictor variable against any other predictor variable doesn’t show linear relationship.
Interpreting the correlation matrices, these give us a more robust metric of how strongly related each variable is. The correlation between the fitted values and each predictor variable all have an absolute value of greater than 0.6, which is fairly good. The correlation between the predictor variables seem to all be below 0.6, except for cor(x2,x3), which is a bit above 0.6. I would suggest to further evaluate the relationship between the two and suggest a better model that accounts for their correlation.
fit = lm(Y ~ X1 + X2 + X3, data = B)
fit$coefficients
## (Intercept) X1 X2 X3
## 158.4912517 -1.1416118 -0.4420043 -13.4701632
The estimated regression function would be: Y = 158.4912517 - 1.1416118 * x1 - 0.4420043 * x2 - 13.4701632 * x3.
b2 = -0.4420043 is how much the satisfaction of a patient changes per unit increase, assuming that anxiety and age remain constant. So the coefficient is telling us that the satisfaction decreases by 0.442 per unit increase.
res = fit$residuals
boxplot(res, main = 'Residuals')
As shown, there are no outliers in the distribution of residuals.
plot(B$Y, fit$res, xlab = 'Y (fitted values)', ylab = 'Residuals'); abline(h = 0)
plot(B$X1, fit$res, xlab = 'X1', ylab = 'Residuals'); abline(h = 0)
plot(B$X2, fit$res, xlab = 'X2', ylab = 'Residuals'); abline(h = 0)
plot(B$X3, fit$res, xlab = 'X3', ylab = 'Residuals'); abline(h = 0)
As you can see, there seems to be no pattern here, which tells us that nonlinearity is not violated. There doesn’t seem to be a violation to heteroskedasticity as the variancie is not unequal accross the range of values. There is also a relative consistent spread around 0, and there are no aparent outliers.
plot(B$X1 * B$X2, fit$res, xlab = 'X1*X2', ylab = 'Residuals'); abline(h = 0)
plot(B$X1 * B$X3, fit$res, xlab = 'X1*X3', ylab = 'Residuals'); abline(h = 0)
plot(B$X2 * B$X3, fit$res, xlab = 'X2*X3', ylab = 'Residuals'); abline(h = 0)
There is no aparent pattern in any of the plots of predictor variable against any other predictor variable, therefore there is no inndication that the predictor should be included in the model.
qqnorm(fit$res); qqline(fit$res)
What we want to see here is that points fall in the theoretical line, and many of them do. The ones that don’t fall exactly on the line are very close to it. So overall, the graph shows that there is normality in our model.
B = read.table('CH06PR15.txt',header=T)
fit = lm(Y ~ X1 + X2 + X3, data = B)
anova(fit)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X1 1 8275.4 8275.4 81.8026 2.059e-11 ***
## X2 1 480.9 480.9 4.7539 0.03489 *
## X3 1 364.2 364.2 3.5997 0.06468 .
## Residuals 42 4248.8 101.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef = fit$coef
s.coef = summary(fit)$coef[, 2]
n = nrow(B)
p = 4
g = 3
alpha = 0.1
Bonf = qt(1 - alpha / (2 * g), n - p)
lbd = coef - Bonf * s.coef
ubd = coef + Bonf * s.coef
rbind(lbd, ubd)
## (Intercept) X1 X2 X3
## lbd 118.6076 -1.6142482 -1.5245098 -29.092028
## ubd 198.3749 -0.6689755 0.6405013 2.151701
summary(fit)$r.sq
## [1] 0.6821943
xnew = data.frame(1, X1 = 35, X2 = 45, X3 = 2.2); xnew; fit
## X1.1 X1 X2 X3
## 1 1 35 45 2.2
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = B)
##
## Coefficients:
## (Intercept) X1 X2 X3
## 158.491 -1.142 -0.442 -13.470
predict(fit, newdata = xnew, interval = 'confidence', se.fit = F, level = 0.90 )
## fit lwr upr
## 1 69.01029 64.52854 73.49204
We are 90% confident that the mean response when X1 = 35, X2 = 45, and X3 = 2.2 is between 64.52854 and 73.49204.
predict(fit, newdata = xnew, interval = 'prediction', se.fit = F, level = 0.90 )
## fit lwr upr
## 1 69.01029 51.50965 86.51092
We are 90% confident that the new value of the response when X1 = 35, X2 = 45, and X3 = 2.2 is between 51.50965 and 86.51092.