2. The systolic blood pressure data set contains 20 randomly selected patients in the age group of 45-60 years recruited from a hospital. All variables have been summarized in Table 1.
Fit a multiple linear regression model using all predictor variables. Briefly discuss your findings.
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
Residuals:
Min 1Q Median 3Q Max
-0.84629 -0.24890 0.08592 0.28963 0.53149
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.894244 2.790338 -4.621 0.000479 ***
x1 0.692539 0.053931 12.841 9.23e-09 ***
x2 0.955487 0.068101 14.030 3.14e-09 ***
x3 4.488179 1.685262 2.663 0.019516 *
x4 0.063758 0.052794 1.208 0.248680
x5 -0.077672 0.056046 -1.386 0.189102
x6 0.005355 0.003715 1.441 0.173158
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4438 on 13 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9933
F-statistic: 471.8 on 6 and 13 DF, p-value: 1.952e-14
Find the 95% confidence interval of the regression coefficient for age. Interpret your result.
confint(systolic_bp_model, level =0.95)[2,]
2.5 % 97.5 %
0.5760283 0.8090499
State the coefficient of determination, \(R^2\) and adjusted coefficient of determination, \(R_{adj}^2\) for the full model. Is it sufficient to only refer to the coefficient of determination to assess the model fit? Justify your answer.
summary(systolic_bp_model)$r.squared
[1] 0.9954283
summary(systolic_bp_model)$adj.r
[1] 0.9933183
No, because \(R^2\) does not penalize for overfitting, since it increases as more predictors are added. \(R_{adj}^2\) should be referred instead because it adjusts \(R^2\) by penalizing it for additional predictors that does not sufficienctly improve model performance.
Use forward stepwise regression to select the best model which is the “best” fit to the data, ased on Akaike Information Criterion (AIC).
i. Comment on the findings and state the final model.
library(MASS)step(lm(y ~1, data = systolic_bp), direction ="forward", scope =~ x1 + x2 + x3 + x4 + x5 + x6)
ii. Are there any substantial differences in the models obtained in part (a) and (d), in terms of significant predictor variables d changes in \(R^2\) and \(R_{adj}^2\) ? Justify your answer.
Call:
lm(formula = y ~ x2 + x1 + x3, data = systolic_bp)
Residuals:
Min 1Q Median 3Q Max
-0.77395 -0.38986 0.07945 0.31585 0.66234
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.68384 2.78007 -4.922 0.000153 ***
x2 0.89743 0.05105 17.580 6.92e-12 ***
x1 0.69324 0.04621 15.001 7.63e-11 ***
x3 5.21494 1.57035 3.321 0.004325 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4586 on 16 degrees of freedom
Multiple R-squared: 0.994, Adjusted R-squared: 0.9929
F-statistic: 882.3 on 3 and 16 DF, p-value: < 2.2e-16
summary(systolic_bp_model)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
Residuals:
Min 1Q Median 3Q Max
-0.84629 -0.24890 0.08592 0.28963 0.53149
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.894244 2.790338 -4.621 0.000479 ***
x1 0.692539 0.053931 12.841 9.23e-09 ***
x2 0.955487 0.068101 14.030 3.14e-09 ***
x3 4.488179 1.685262 2.663 0.019516 *
x4 0.063758 0.052794 1.208 0.248680
x5 -0.077672 0.056046 -1.386 0.189102
x6 0.005355 0.003715 1.441 0.173158
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4438 on 13 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9933
F-statistic: 471.8 on 6 and 13 DF, p-value: 1.952e-14
There is no substantial difference in the values of \(R^2\) and \(R_{adj}^2\) between the two models. This is because the variables \(x_4, x_5, x_6\) does not improve model performance.
3. a) Based on 2(a),
Plot the quantile-quantile (qq) graph for the residuals and perform the Shapiro-Wilks normality test at 5% significance level. Is there any problem with the normality assumption? Interpret your results.
## QQ plotresid <-rstudent(systolic_bp_model)qqnorm(resid, main ="QQ Normality Plot")qqline(resid)
Shapiro-Wilk normality test
data: resid(systolic_bp_model)
W = 0.92231, p-value = 0.1098
The p-value associated with the residuals is more than 0.05, meaning it’s not statistically significant. The normality assumption is not violated.
Plot the residuals versus the fitted values and test for constant variance at 5% significance level using the Breusch-Pagan studentized test. Give comments about the constant variance assumption of the model.
## Residuals vs Fitted value plotplot(fitted(systolic_bp_model), resid, xlab ="predicted", main ="Residual vs. Fitted plot")abline(h =0, col =2, lwd =2)## Breusch-Pagan Testlibrary(lmtest)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
bptest(systolic_bp_model)
studentized Breusch-Pagan test
data: systolic_bp_model
BP = 2.9208, df = 6, p-value = 0.8187
The p-value associated with the model is more than 0.05, meaning it’s not statistically significant. The residuals share a common variance (homoskedastic)
3. b) Based on 2(d),
Give suggestions how to deal with influential points,
## cook's distance for the modelcd <-cooks.distance(aic_model)plot(cd, ylim =c(0, 0.25))n <-20abline(h =4/n, col =2, lwd =1)
which(cd >4/n)
named integer(0)
new_systolic_bp <- systolic_bp[-c(7,19),]
To deal with influential points, simply remove the associated observations from the data.
Perform the model adequacy checks for the data. Identify the influential points if any.
par(mfrow =c(2,2))plot(aic_model)
shapiro.test(resid(aic_model))
Shapiro-Wilk normality test
data: resid(aic_model)
W = 0.94994, p-value = 0.3662
bptest(aic_model)
studentized Breusch-Pagan test
data: aic_model
BP = 1.7264, df = 3, p-value = 0.6311
par(mfrow =c(1,1))plot(aic_model, which =5)
The p-values associated with the Wilks-Shapiro and Breusch-Pagan tests are not statistically significant, so the residuals of the model fulfills the normality and homoscedasticity assumptions. Also, the model has no influential points.
4. Using the systolic blood pressure dataset in question 2, we are interested in investigating high blood pressure among patients.
Change the response variable, blood pressure/bp (mmHg) into a categorical variable according to Table 2 using R. Perform logistic regression analysis for this data using all predictor variables.
## changing bp into categorical variablesy_0 <-replace(y, (y <120), 0); y_0