SIT2009 Finals, Semester 2, 2022/2023

2. The systolic blood pressure data set contains 20 randomly selected patients in the age group of 45-60 years recruited from a hospital. All variables have been summarized in Table 1.

Fit a multiple linear regression model using all predictor variables. Briefly discuss your findings.

library(readxl)
systolic_bp <- read_excel("bloodpressure.xlsx")
systolic_bp

# A tibble: 20 × 7
      bp   age weight   bsa duration pulse stress
   <dbl> <dbl>  <dbl> <dbl>    <dbl> <dbl>  <dbl>
 1   105    47   85.4  1.75      5.1    63     33
 2   115    49   94.2  2.1       3.8    70     14
 3   116    49   95.3  1.98      8.2    72     10
 4   117    50   94.7  2.01      5.8    73     99
 5   112    51   89.4  1.89      7      72     95
 6   121    48   99.5  2.25      9.3    71     10
 7   121    49   99.8  2.25      2.5    69     42
 8   110    47   90.9  1.9       6.2    66      8
 9   110    49   89.9  1.83      7.1    69     62
10   114    48   92.7  2.07      5.6    64     35
11   114    47   94.4  2.07      5.3    74     90
12   115    49   94.1  1.98      5.6    71     21
13   114    50   91.6  2.05     10.2    68     47
14   106    45   87.1  1.92      5.6    67     80
15   125    52  101.   2.19     10      76     98
16   114    46   94.5  1.98      7.4    69     95
17   106    46   87    1.87      3.6    62     18
18   113    46   94.5  1.9       4.3    70     12
19   110    48   90.5  1.88      9      71     99
20   122    56   95.7  2.09      7      75     99

str(systolic_bp)

tibble [20 × 7] (S3: tbl_df/tbl/data.frame)
 $ bp      : num [1:20] 105 115 116 117 112 121 121 110 110 114 ...
 $ age     : num [1:20] 47 49 49 50 51 48 49 47 49 48 ...
 $ weight  : num [1:20] 85.4 94.2 95.3 94.7 89.4 99.5 99.8 90.9 89.9 92.7 ...
 $ bsa     : num [1:20] 1.75 2.1 1.98 2.01 1.89 2.25 2.25 1.9 1.83 2.07 ...
 $ duration: num [1:20] 5.1 3.8 8.2 5.8 7 9.3 2.5 6.2 7.1 5.6 ...
 $ pulse   : num [1:20] 63 70 72 73 72 71 69 66 69 64 ...
 $ stress  : num [1:20] 33 14 10 99 95 10 42 8 62 35 ...

y  <- systolic_bp$bp
x1 <- systolic_bp$age
x2 <- systolic_bp$weight
x3 <- systolic_bp$bsa
x4 <- systolic_bp$duration
x5 <- systolic_bp$pulse
x6 <- systolic_bp$stress

systolic_bp_model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
summary(systolic_bp_model)


Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.84629 -0.24890  0.08592  0.28963  0.53149 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -12.894244   2.790338  -4.621 0.000479 ***
x1            0.692539   0.053931  12.841 9.23e-09 ***
x2            0.955487   0.068101  14.030 3.14e-09 ***
x3            4.488179   1.685262   2.663 0.019516 *  
x4            0.063758   0.052794   1.208 0.248680    
x5           -0.077672   0.056046  -1.386 0.189102    
x6            0.005355   0.003715   1.441 0.173158    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4438 on 13 degrees of freedom
Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9933 
F-statistic: 471.8 on 6 and 13 DF,  p-value: 1.952e-14

Find the 95% confidence interval of the regression coefficient for age. Interpret your result.
```
confint(systolic_bp_model, level = 0.95)[2,]
```
```
    2.5 %    97.5 % 
0.5760283 0.8090499 
```
State the coefficient of determination, $R^2$ and adjusted coefficient of determination, $R_{adj}^2$ for the full model. Is it sufficient to only refer to the coefficient of determination to assess the model fit? Justify your answer.
```
summary(systolic_bp_model)$r.squared
```
```
[1] 0.9954283
```
```
summary(systolic_bp_model)$adj.r
```
```
[1] 0.9933183
```
No, because $R^2$ does not penalize for overfitting, since it increases as more predictors are added. $R_{adj}^2$ should be referred instead because it adjusts $R^2$ by penalizing it for additional predictors that does not sufficienctly improve model performance.

Use forward stepwise regression to select the best model which is the “best” fit to the data, ased on Akaike Information Criterion (AIC).

i. Comment on the findings and state the final model.

library(MASS)
step(lm(y ~ 1, data = systolic_bp), direction = "forward", 
     scope = ~ x1 + x2 + x3 + x4 + x5 + x6)

Start:  AIC=68.64
y ~ 1

       Df Sum of Sq    RSS    AIC
+ x2    1    505.96  54.04 23.880
+ x3    1    419.86 140.14 42.938
+ x5    1    291.44 268.56 55.946
+ x1    1    243.27 316.73 59.247
<none>              560.00 68.644
+ x4    1     48.02 511.98 68.851
+ x6    1     15.04 544.96 70.099

Step:  AIC=23.88
y ~ x2

       Df Sum of Sq    RSS     AIC
+ x1    1    48.357  5.684 -19.161
+ x6    1     9.345 44.695  22.083
+ x5    1     8.358 45.683  22.520
+ x4    1     5.634 48.407  23.678
<none>              54.041  23.880
+ x3    1     3.352 50.689  24.600

Step:  AIC=-19.16
y ~ x2 + x1

       Df Sum of Sq    RSS     AIC
+ x3    1   2.31923 3.3648 -27.647
+ x5    1   0.54092 5.1431 -19.162
<none>              5.6840 -19.161
+ x6    1   0.15299 5.5310 -17.707
+ x4    1   0.13436 5.5497 -17.640

Step:  AIC=-27.65
y ~ x2 + x1 + x3

       Df Sum of Sq    RSS     AIC
<none>              3.3648 -27.647
+ x4    1  0.301849 3.0629 -27.527
+ x6    1  0.216683 3.1481 -26.979
+ x5    1  0.028289 3.3365 -25.816


Call:
lm(formula = y ~ x2 + x1 + x3, data = systolic_bp)

Coefficients:
(Intercept)           x2           x1           x3  
   -13.6838       0.8974       0.6932       5.2149

Final model is given by

\[ y = -13.6838 + 0.8974( x_2 ) + 0.6932( x_1 ) + 5.2149( x_3 ) \]

ii. Are there any substantial differences in the models obtained in part (a) and (d), in terms of significant predictor variables d changes in $R^2$ and $R_{adj}^2$ ? Justify your answer.

aic_model <- lm(y ~ x2 + x1 + x3, data = systolic_bp)

summary(aic_model)


Call:
lm(formula = y ~ x2 + x1 + x3, data = systolic_bp)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77395 -0.38986  0.07945  0.31585  0.66234 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -13.68384    2.78007  -4.922 0.000153 ***
x2            0.89743    0.05105  17.580 6.92e-12 ***
x1            0.69324    0.04621  15.001 7.63e-11 ***
x3            5.21494    1.57035   3.321 0.004325 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4586 on 16 degrees of freedom
Multiple R-squared:  0.994, Adjusted R-squared:  0.9929 
F-statistic: 882.3 on 3 and 16 DF,  p-value: < 2.2e-16

summary(systolic_bp_model)


Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.84629 -0.24890  0.08592  0.28963  0.53149 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -12.894244   2.790338  -4.621 0.000479 ***
x1            0.692539   0.053931  12.841 9.23e-09 ***
x2            0.955487   0.068101  14.030 3.14e-09 ***
x3            4.488179   1.685262   2.663 0.019516 *  
x4            0.063758   0.052794   1.208 0.248680    
x5           -0.077672   0.056046  -1.386 0.189102    
x6            0.005355   0.003715   1.441 0.173158    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4438 on 13 degrees of freedom
Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9933 
F-statistic: 471.8 on 6 and 13 DF,  p-value: 1.952e-14

There is no substantial difference in the values of $R^2$ and $R_{adj}^2$ between the two models. This is because the variables $x_4, x_5, x_6$ does not improve model performance.

3. a) Based on 2(a),

Plot the quantile-quantile (qq) graph for the residuals and perform the Shapiro-Wilks normality test at 5% significance level. Is there any problem with the normality assumption? Interpret your results.
```
## QQ plot
resid <- rstudent(systolic_bp_model)
qqnorm(resid, main = "QQ Normality Plot")
qqline(resid)
```
```
## Shapiro-Wilk test
shapiro.test(resid(systolic_bp_model))
```
```
    Shapiro-Wilk normality test

data:  resid(systolic_bp_model)
W = 0.92231, p-value = 0.1098
```
The p-value associated with the residuals is more than 0.05, meaning it’s not statistically significant. The normality assumption is not violated.

Plot the residuals versus the fitted values and test for constant variance at 5% significance level using the Breusch-Pagan studentized test. Give comments about the constant variance assumption of the model.

## Residuals vs Fitted value plot
plot(fitted(systolic_bp_model), resid, xlab = "predicted", 
     main = "Residual vs. Fitted plot")
abline(h = 0, col = 2, lwd = 2)

## Breusch-Pagan Test
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(systolic_bp_model)


    studentized Breusch-Pagan test

data:  systolic_bp_model
BP = 2.9208, df = 6, p-value = 0.8187

The p-value associated with the model is more than 0.05, meaning it’s not statistically significant. The residuals share a common variance (homoskedastic)

3. b) Based on 2(d),

Give suggestions how to deal with influential points,

## cook's distance for the model
cd <- cooks.distance(aic_model)
plot(cd, ylim = c(0, 0.25))
n <- 20
abline(h = 4/n, col = 2, lwd = 1)

which(cd > 4/n)

named integer(0)

new_systolic_bp <- systolic_bp[-c(7,19),]

To deal with influential points, simply remove the associated observations from the data.

Perform the model adequacy checks for the data. Identify the influential points if any.

par(mfrow = c(2,2))
plot(aic_model)

shapiro.test(resid(aic_model))


    Shapiro-Wilk normality test

data:  resid(aic_model)
W = 0.94994, p-value = 0.3662

bptest(aic_model)


    studentized Breusch-Pagan test

data:  aic_model
BP = 1.7264, df = 3, p-value = 0.6311

par(mfrow = c(1,1))
plot(aic_model, which = 5)

The p-values associated with the Wilks-Shapiro and Breusch-Pagan tests are not statistically significant, so the residuals of the model fulfills the normality and homoscedasticity assumptions. Also, the model has no influential points.

4. Using the systolic blood pressure dataset in question 2, we are interested in investigating high blood pressure among patients.

Change the response variable, blood pressure/bp (mmHg) into a categorical variable according to Table 2 using R. Perform logistic regression analysis for this data using all predictor variables.

## changing bp into categorical variables
y_0 <- replace(y, (y <  120), 0); y_0

 [1]   0   0   0   0   0 121 121   0   0   0   0   0   0   0 125   0   0   0   0
[20] 122

y_new <- replace(y_0, (y_0 >= 120), 1); y_new

 [1] 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1

## logistic regression analysis
logistic.model <- glm(y_new ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
summary(logistic.model)


Call:
glm(formula = y_new ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -6.813209   1.765758  -3.859  0.00198 **
x1           0.065355   0.034128   1.915  0.07775 . 
x2           0.057149   0.043095   1.326  0.20763   
x3           0.798302   1.066453   0.749  0.46745   
x4           0.002136   0.033408   0.064  0.95000   
x5          -0.045820   0.035466  -1.292  0.21888   
x6           0.001785   0.002351   0.759  0.46120   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.07886281)

    Null deviance: 3.2000  on 19  degrees of freedom
Residual deviance: 1.0252  on 13  degrees of freedom
AIC: 13.341

Number of Fisher Scoring iterations: 2

Estimate the odds ratio and find the 95% confidence intervals of regression coefficients for age (years) and weight (kg). Justify your answers.

## Estimate the odds ratio
exp(coef(logistic.model))

(Intercept)          x1          x2          x3          x4          x5 
 0.00109916  1.06753829  1.05881360  2.22176486  1.00213818  0.95521415 
         x6 
 1.00178693

## confidence intervals for age and weight
confint(logistic.model, level = 0.95)[2:3,]

Waiting for profiling to be done...

         2.5 %    97.5 %
x1 -0.00153457 0.1322452
x2 -0.02731570 0.1416138