SIT2009 Finals, Semester 2, 2022/2023

2. The systolic blood pressure data set contains 20 randomly selected patients in the age group of 45-60 years recruited from a hospital. All variables have been summarized in Table 1.

  1. Fit a multiple linear regression model using all predictor variables. Briefly discuss your findings.

    library(readxl)
    systolic_bp <- read_excel("bloodpressure.xlsx")
    systolic_bp
    # A tibble: 20 × 7
          bp   age weight   bsa duration pulse stress
       <dbl> <dbl>  <dbl> <dbl>    <dbl> <dbl>  <dbl>
     1   105    47   85.4  1.75      5.1    63     33
     2   115    49   94.2  2.1       3.8    70     14
     3   116    49   95.3  1.98      8.2    72     10
     4   117    50   94.7  2.01      5.8    73     99
     5   112    51   89.4  1.89      7      72     95
     6   121    48   99.5  2.25      9.3    71     10
     7   121    49   99.8  2.25      2.5    69     42
     8   110    47   90.9  1.9       6.2    66      8
     9   110    49   89.9  1.83      7.1    69     62
    10   114    48   92.7  2.07      5.6    64     35
    11   114    47   94.4  2.07      5.3    74     90
    12   115    49   94.1  1.98      5.6    71     21
    13   114    50   91.6  2.05     10.2    68     47
    14   106    45   87.1  1.92      5.6    67     80
    15   125    52  101.   2.19     10      76     98
    16   114    46   94.5  1.98      7.4    69     95
    17   106    46   87    1.87      3.6    62     18
    18   113    46   94.5  1.9       4.3    70     12
    19   110    48   90.5  1.88      9      71     99
    20   122    56   95.7  2.09      7      75     99
    str(systolic_bp)
    tibble [20 × 7] (S3: tbl_df/tbl/data.frame)
     $ bp      : num [1:20] 105 115 116 117 112 121 121 110 110 114 ...
     $ age     : num [1:20] 47 49 49 50 51 48 49 47 49 48 ...
     $ weight  : num [1:20] 85.4 94.2 95.3 94.7 89.4 99.5 99.8 90.9 89.9 92.7 ...
     $ bsa     : num [1:20] 1.75 2.1 1.98 2.01 1.89 2.25 2.25 1.9 1.83 2.07 ...
     $ duration: num [1:20] 5.1 3.8 8.2 5.8 7 9.3 2.5 6.2 7.1 5.6 ...
     $ pulse   : num [1:20] 63 70 72 73 72 71 69 66 69 64 ...
     $ stress  : num [1:20] 33 14 10 99 95 10 42 8 62 35 ...
    y  <- systolic_bp$bp
    x1 <- systolic_bp$age
    x2 <- systolic_bp$weight
    x3 <- systolic_bp$bsa
    x4 <- systolic_bp$duration
    x5 <- systolic_bp$pulse
    x6 <- systolic_bp$stress
    
    systolic_bp_model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
    summary(systolic_bp_model)
    
    Call:
    lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.84629 -0.24890  0.08592  0.28963  0.53149 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -12.894244   2.790338  -4.621 0.000479 ***
    x1            0.692539   0.053931  12.841 9.23e-09 ***
    x2            0.955487   0.068101  14.030 3.14e-09 ***
    x3            4.488179   1.685262   2.663 0.019516 *  
    x4            0.063758   0.052794   1.208 0.248680    
    x5           -0.077672   0.056046  -1.386 0.189102    
    x6            0.005355   0.003715   1.441 0.173158    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.4438 on 13 degrees of freedom
    Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9933 
    F-statistic: 471.8 on 6 and 13 DF,  p-value: 1.952e-14
  2. Find the 95% confidence interval of the regression coefficient for age. Interpret your result.

    confint(systolic_bp_model, level = 0.95)[2,]
        2.5 %    97.5 % 
    0.5760283 0.8090499 
  3. State the coefficient of determination, \(R^2\) and adjusted coefficient of determination, \(R_{adj}^2\) for the full model. Is it sufficient to only refer to the coefficient of determination to assess the model fit? Justify your answer.

    summary(systolic_bp_model)$r.squared
    [1] 0.9954283
    summary(systolic_bp_model)$adj.r
    [1] 0.9933183

    No, because \(R^2\) does not penalize for overfitting, since it increases as more predictors are added. \(R_{adj}^2\) should be referred instead because it adjusts \(R^2\) by penalizing it for additional predictors that does not sufficienctly improve model performance.

  4. Use forward stepwise regression to select the best model which is the “best” fit to the data, ased on Akaike Information Criterion (AIC).

    i. Comment on the findings and state the final model.

    library(MASS)
    step(lm(y ~ 1, data = systolic_bp), direction = "forward", 
         scope = ~ x1 + x2 + x3 + x4 + x5 + x6)
    Start:  AIC=68.64
    y ~ 1
    
           Df Sum of Sq    RSS    AIC
    + x2    1    505.96  54.04 23.880
    + x3    1    419.86 140.14 42.938
    + x5    1    291.44 268.56 55.946
    + x1    1    243.27 316.73 59.247
    <none>              560.00 68.644
    + x4    1     48.02 511.98 68.851
    + x6    1     15.04 544.96 70.099
    
    Step:  AIC=23.88
    y ~ x2
    
           Df Sum of Sq    RSS     AIC
    + x1    1    48.357  5.684 -19.161
    + x6    1     9.345 44.695  22.083
    + x5    1     8.358 45.683  22.520
    + x4    1     5.634 48.407  23.678
    <none>              54.041  23.880
    + x3    1     3.352 50.689  24.600
    
    Step:  AIC=-19.16
    y ~ x2 + x1
    
           Df Sum of Sq    RSS     AIC
    + x3    1   2.31923 3.3648 -27.647
    + x5    1   0.54092 5.1431 -19.162
    <none>              5.6840 -19.161
    + x6    1   0.15299 5.5310 -17.707
    + x4    1   0.13436 5.5497 -17.640
    
    Step:  AIC=-27.65
    y ~ x2 + x1 + x3
    
           Df Sum of Sq    RSS     AIC
    <none>              3.3648 -27.647
    + x4    1  0.301849 3.0629 -27.527
    + x6    1  0.216683 3.1481 -26.979
    + x5    1  0.028289 3.3365 -25.816
    
    Call:
    lm(formula = y ~ x2 + x1 + x3, data = systolic_bp)
    
    Coefficients:
    (Intercept)           x2           x1           x3  
       -13.6838       0.8974       0.6932       5.2149  

    Final model is given by

    \[ y = -13.6838 + 0.8974( x_2 ) + 0.6932( x_1 ) + 5.2149( x_3 ) \]

    ii. Are there any substantial differences in the models obtained in part (a) and (d), in terms of significant predictor variables d changes in \(R^2\) and \(R_{adj}^2\) ? Justify your answer.

    aic_model <- lm(y ~ x2 + x1 + x3, data = systolic_bp)
    
    summary(aic_model)
    
    Call:
    lm(formula = y ~ x2 + x1 + x3, data = systolic_bp)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.77395 -0.38986  0.07945  0.31585  0.66234 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -13.68384    2.78007  -4.922 0.000153 ***
    x2            0.89743    0.05105  17.580 6.92e-12 ***
    x1            0.69324    0.04621  15.001 7.63e-11 ***
    x3            5.21494    1.57035   3.321 0.004325 ** 
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.4586 on 16 degrees of freedom
    Multiple R-squared:  0.994, Adjusted R-squared:  0.9929 
    F-statistic: 882.3 on 3 and 16 DF,  p-value: < 2.2e-16
    summary(systolic_bp_model)
    
    Call:
    lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.84629 -0.24890  0.08592  0.28963  0.53149 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -12.894244   2.790338  -4.621 0.000479 ***
    x1            0.692539   0.053931  12.841 9.23e-09 ***
    x2            0.955487   0.068101  14.030 3.14e-09 ***
    x3            4.488179   1.685262   2.663 0.019516 *  
    x4            0.063758   0.052794   1.208 0.248680    
    x5           -0.077672   0.056046  -1.386 0.189102    
    x6            0.005355   0.003715   1.441 0.173158    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.4438 on 13 degrees of freedom
    Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9933 
    F-statistic: 471.8 on 6 and 13 DF,  p-value: 1.952e-14

    There is no substantial difference in the values of \(R^2\) and \(R_{adj}^2\) between the two models. This is because the variables \(x_4, x_5, x_6\) does not improve model performance.

3. a) Based on 2(a),

  1. Plot the quantile-quantile (qq) graph for the residuals and perform the Shapiro-Wilks normality test at 5% significance level. Is there any problem with the normality assumption? Interpret your results.

    ## QQ plot
    resid <- rstudent(systolic_bp_model)
    qqnorm(resid, main = "QQ Normality Plot")
    qqline(resid)

    ## Shapiro-Wilk test
    shapiro.test(resid(systolic_bp_model))
    
        Shapiro-Wilk normality test
    
    data:  resid(systolic_bp_model)
    W = 0.92231, p-value = 0.1098

    The p-value associated with the residuals is more than 0.05, meaning it’s not statistically significant. The normality assumption is not violated.

  2. Plot the residuals versus the fitted values and test for constant variance at 5% significance level using the Breusch-Pagan studentized test. Give comments about the constant variance assumption of the model.

    ## Residuals vs Fitted value plot
    plot(fitted(systolic_bp_model), resid, xlab = "predicted", 
         main = "Residual vs. Fitted plot")
    abline(h = 0, col = 2, lwd = 2)
    
    ## Breusch-Pagan Test
    library(lmtest)
    Loading required package: zoo
    
    Attaching package: 'zoo'
    The following objects are masked from 'package:base':
    
        as.Date, as.Date.numeric

    bptest(systolic_bp_model)
    
        studentized Breusch-Pagan test
    
    data:  systolic_bp_model
    BP = 2.9208, df = 6, p-value = 0.8187

    The p-value associated with the model is more than 0.05, meaning it’s not statistically significant. The residuals share a common variance (homoskedastic)

3. b) Based on 2(d),

  1. Give suggestions how to deal with influential points,

    ## cook's distance for the model
    cd <- cooks.distance(aic_model)
    plot(cd, ylim = c(0, 0.25))
    n <- 20
    abline(h = 4/n, col = 2, lwd = 1)

    which(cd > 4/n)
    named integer(0)
    new_systolic_bp <- systolic_bp[-c(7,19),]

    To deal with influential points, simply remove the associated observations from the data.

  2. Perform the model adequacy checks for the data. Identify the influential points if any.

    par(mfrow = c(2,2))
    plot(aic_model)

    shapiro.test(resid(aic_model))
    
        Shapiro-Wilk normality test
    
    data:  resid(aic_model)
    W = 0.94994, p-value = 0.3662
    bptest(aic_model)
    
        studentized Breusch-Pagan test
    
    data:  aic_model
    BP = 1.7264, df = 3, p-value = 0.6311
    par(mfrow = c(1,1))
    plot(aic_model, which = 5)

    The p-values associated with the Wilks-Shapiro and Breusch-Pagan tests are not statistically significant, so the residuals of the model fulfills the normality and homoscedasticity assumptions. Also, the model has no influential points.

4. Using the systolic blood pressure dataset in question 2, we are interested in investigating high blood pressure among patients.

  1. Change the response variable, blood pressure/bp (mmHg) into a categorical variable according to Table 2 using R. Perform logistic regression analysis for this data using all predictor variables.

    ## changing bp into categorical variables
    y_0 <- replace(y, (y <  120), 0); y_0
     [1]   0   0   0   0   0 121 121   0   0   0   0   0   0   0 125   0   0   0   0
    [20] 122
    y_new <- replace(y_0, (y_0 >= 120), 1); y_new
     [1] 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1
    ## logistic regression analysis
    logistic.model <- glm(y_new ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
    summary(logistic.model)
    
    Call:
    glm(formula = y_new ~ x1 + x2 + x3 + x4 + x5 + x6, data = systolic_bp)
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
    (Intercept) -6.813209   1.765758  -3.859  0.00198 **
    x1           0.065355   0.034128   1.915  0.07775 . 
    x2           0.057149   0.043095   1.326  0.20763   
    x3           0.798302   1.066453   0.749  0.46745   
    x4           0.002136   0.033408   0.064  0.95000   
    x5          -0.045820   0.035466  -1.292  0.21888   
    x6           0.001785   0.002351   0.759  0.46120   
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for gaussian family taken to be 0.07886281)
    
        Null deviance: 3.2000  on 19  degrees of freedom
    Residual deviance: 1.0252  on 13  degrees of freedom
    AIC: 13.341
    
    Number of Fisher Scoring iterations: 2
  2. Estimate the odds ratio and find the 95% confidence intervals of regression coefficients for age (years) and weight (kg). Justify your answers.

    ## Estimate the odds ratio
    exp(coef(logistic.model))
    (Intercept)          x1          x2          x3          x4          x5 
     0.00109916  1.06753829  1.05881360  2.22176486  1.00213818  0.95521415 
             x6 
     1.00178693 
    ## confidence intervals for age and weight
    confint(logistic.model, level = 0.95)[2:3,]
    Waiting for profiling to be done...
             2.5 %    97.5 %
    x1 -0.00153457 0.1322452
    x2 -0.02731570 0.1416138