knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

  1. Clustering practice: https://rpubs.com/minhtri/923711
  2. Table Descriptive Analysis: https://rpubs.com/minhtri/929114
  3. Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
  4. The R environment - Data Wrangling: https://rpubs.com/minhtri/968607
  5. Correlation analysis: https://rpubs.com/minhtri/968611
  6. Parametric or Nonparametric data: https://rpubs.com/minhtri/968616
  7. Dealing with outliers or non-normal distribution: https://rpubs.com/minhtri/968622

 

Note: This analysis is used for personal study purpose. In this section, I will summerize some basic infomation about simple linear regression based on several sources.

 

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

 

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

  • Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

 

1 Simulated fakestroke data

Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) | 
Variable Description
studyid Study ID # (z001 through z500)
trt Treatment group (Intervention or Control)
age Age in years
sex Male or Female
nihss NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location Stroke Location - Left or Right Hemisphere
hx.isch History of Ischemic Stroke (Yes/No)
afib Atrial Fibrillation (1 = Yes, 0 = No)
dm Diabetes Mellitus (1 = Yes, 0 = No)
mrankin Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp Systolic blood pressure, in mm Hg
iv.altep Treatment with IV alteplase (Yes/No)
time.iv Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus Intracranial arterial occlusion, based on vessel imaging - five categories3
extra.ica Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand Time from stroke onset to study randomization, in minutes
time.punc Time from stroke onset to groin puncture, in minutes (only if Intervention)

 

2 Data preprocessing step

2.1 Step 1: Data wrangling

Please refer to this section: The R environment - Data Wrangling: https://rpubs.com/minhtri/933332

  1. Load the data to R and examine some basic infomation about variables. When loading the data, I also define the type of variables (numeric or text):
df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
names(df)
##  [1] "studyid"   "trt"       "age"       "sex"       "nihss"     "location" 
##  [7] "hx.isch"   "afib"      "dm"        "mrankin"   "sbp"       "iv.altep" 
## [13] "time.iv"   "aspects"   "ia.occlus" "extra.ica" "time.rand" "time.punc"
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

 

  1. Define factor for all character variables:
df <- df%>%mutate_if(is.character, as.factor)
str(df) 
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

 

  1. afib, dm, extra.ica are variables but in the excel file they are noted as “1” and “0”. Therefore, I must convert them back to “Yes-No” for factor variables:
# Changing coding variable to categorical:
df$afib       <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm         <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica  <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Checking after converting
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

 

  1. Reorder the level of some character variables. After this, basically we have tidy data that are nearly ready for Missingness checking:
  • trt: Intervention (baseline) , Control
  • mrankin: 0, 1, 2, > 2.
  • ia.occlus: rearrange the ia.occlus variable to the order presented in Berkhemer et al. (2015).
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Check order of factor after relevel: 
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

 

2.2 Step 2: Missingness checking and imputation when necessary

Package: VIM

Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)

  • plot: whether the results should be plotted.
  • numbers: whether the proportion or frequencies of the different combinations should be represented by numbers.
  • prop: whether the proportion of missing/imputed values should be used rather than the total amount.
  • combined: a logical indicating whether the two plots should be combined.
  • sortVars: whether the variables should be sorted by the number of missing/imputed values.

For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007

 

Checking the percentage of missingness:

library(VIM)
aggr(df, plot = F, numbers = T, prop = T)
## 
##  Missings in variables:
##   Variable Count
##        sbp     1
##    time.iv    55
##    aspects     4
##  ia.occlus     1
##  extra.ica     1
##  time.rand     2
##  time.punc   267
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc 0.534
##    time.iv 0.110
##    aspects 0.008
##  time.rand 0.004
##        sbp 0.002
##  ia.occlus 0.002
##  extra.ica 0.002
##    studyid 0.000
##        trt 0.000
##        age 0.000
##        sex 0.000
##      nihss 0.000
##   location 0.000
##    hx.isch 0.000
##       afib 0.000
##         dm 0.000
##    mrankin 0.000
##   iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc   267
##    time.iv    55
##    aspects     4
##  time.rand     2
##        sbp     1
##  ia.occlus     1
##  extra.ica     1
##    studyid     0
##        trt     0
##        age     0
##        sex     0
##      nihss     0
##   location     0
##    hx.isch     0
##       afib     0
##         dm     0
##    mrankin     0
##   iv.altep     0

 

Discuss

  • There are 7 variables with missingess.
  • However, the missingness in time.punc and time.iv is due to the experimental design (time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep =Yes). Therefore, I will exclude them from the missingness analysis.
  • Only 5 variables left: the proportion of each < 5%.
    => Complete case analysis can be applied.
    => We can ignore the missingness and progress to next step - step 3.
df = df %>% select (studyid, nihss, age, sbp, time.rand, sex, ia.occlus) %>% na.omit()

 

3 Simple Regression - Continuous predictor

Continuous outcome - 1 Continuous predictor

The outcome of the above dataset is nihss and mrankin.

For the simple linear model regression: we will start with a question: “Can we use sbp to predict nihss?” or “In what sbp range can we make a reasonable prediction of nihss?”

 

3.1 Step 3: Drawing a plot

Let draw a scatter plot to illustrate the relationship between nihss and sbp

ggplot(df, aes(nihss, sbp)) + geom_point() + geom_smooth(method = "lm", colour = "Red")

 

3.2 Step 4: A simple linear regression model

Package: car
Command: newModel<-lm(outcome ~ predictor(s), data = dataFrame, na.action = an action))

with na.action:

  • na.action = na.fail: This is the default and it simply means that if there are any missing values the model will fail to compute.
  • na.action = na.omit or na.exclude: This estimates the model but excludes any case that has any missing data on any variable in the model.
model_A <- lm(nihss ~ sbp, data = df)
summary(model_A)
## 
## Call:
## lm(formula = nihss ~ sbp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1208 -3.9554 -0.3104  3.8197 10.1171 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.094731   1.240217  15.396   <2e-16 ***
## sbp         -0.007434   0.008393  -0.886    0.376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.682 on 494 degrees of freedom
## Multiple R-squared:  0.001586,   Adjusted R-squared:  -0.0004354 
## F-statistic: 0.7846 on 1 and 494 DF,  p-value: 0.3762
confint(model_A, level = 0.95)
##                   2.5 %       97.5 %
## (Intercept) 16.65798015 21.531481749
## sbp         -0.02392526  0.009056467

 

3.3 Explaining the result

3.3.1 Overall fit

First, we should look at the final part of the result and Discuss the Multiple R-squared

For these data, R2 has a value of 0.0016. Because there is only one predictor, this value represents the square of the simple correlation between nihss and sbp – we can find the square root of R2 by:

sqrt(0.0016)
## [1] 0.04
  • This value is equal to the Pearson correlation coefficient: -0.0387 but not the sign (positive or negative) of the value (https://rpubs.com/minhtri/934689).
  • R2 = 0.0016 also tells us that sbp can account for only 0.16% of the variation in nihss.

=> 99.84% of the variation in nihss cannot be explained by sbp alone. Therefore, there must be other variables that have an influence.

 

Discuss the Adjust R-squared
From thomaselove.github.io/432-notes:

  • The Adjusted R-squared value “adjusts” for the size of our model in terms of the number of coefficients included in the model.

  • The adjusted R-squared will always be smaller than the Multiple R-squared.

  • In particular, we hope to find models where the adjusted R2 isn’t substantially less than the Multiple R-squared.

  • The adjusted R-squared is usually a better estimate of likely performance of our model in new data than is the Multiple R-squared.

  • The adjusted R-squared result is no longer interpretable as a proportion of anything - in fact, it can fall below 0.

  • We can obtain the adjusted R2 from the raw R2, the number of observations N and the number of predictors p included in the model: Adjust R2 = 1 - (1 - R2)(N-1)/ (N-p-1).

     

Discuss the F statistic and its p-value
p-value = 0.3762. This result tells us that there is more than 0.05% chance that an F-ratio this large would happen if the null hypothesis were true. Therefore, we can conclude that our regression model results in not significantly better prediction of nihss than if we used the mean value of nihss. In short, the regression model overall predicts nihss not significantly well => bad model.

 

3.3.2 Coefficient parameter

Discuss the equation
The equation is nihss = 19.09 - 0.007 sbp.

  • Intercept (b0) = 19.09: When Systolic blood pressure of the patient is 0 (sbp=0), the model predicta that NIH Stroke Scale Score will be 19.09.

=> Intercept is meaningless, just for adjust the height of the line.

  • Slope or Gradient (b1) = -0.007: the change in the outcome associated with a unit change in the predictor. If our predictor variable is increased by one unit (if sbp is increased by 1), then our model predicts that 0.007 units of NIH score will be reduced.

     

Discuss the p-value of coefficients
In general, values of the regression coefficient b represent the change in the outcome resulting from a unit change in the predictor and that if a predictor is having a significant impact on our ability to predict the outcome then this b should be different from 0 (and big relative to its standard error). We also saw that the t-test tells us whether the b-value is different from 0. R provides the exact probability that the observed value of t would occur if the value of b in the population were 0. If this observed significance is less than 0.05, then scientists agree that the result reflects a genuine effect.

For sbp value, the probabilities are 0.376 => not significant => We can say that the probability of sbp t-values occurring if the values of b in the population were 0 is more than 0.05.
=> The bs are not different from 0.
=> sbp makes no significant contribution (p >.05) to predicting nihss OR We don’t have enough evidence to say that There is a relationship between nihss and sbp.

 

Discuss the Confident Interval

confint(model_A, level = 0.95)
##                   2.5 %       97.5 %
## (Intercept) 16.65798015 21.531481749
## sbp         -0.02392526  0.009056467

Explaining

  • The confidence intervals can be interpreted as: if we’d collected 100 samples, and calculated the confidence intervals for b, we are saying that 95% of these confidence intervals would contain the true value of b. Therefore, we can be fairly confident that the confidence interval we have constructed for this sample will contain the true value of b in the population.

  • A good model will have a small confidence interval, indicating that the value of b in this sample is close to the true value of b in the population.

  • The sign (positive or negative) of the b-values tells us about the direction of the relationship between the predictor and the outcome.

=> Therefore, we would expect a very bad model to have confidence intervals that cross zero, indicating that in some samples the predictor has a negative relationship to the outcome whereas in others it has a positive relationship.

Discuss:
In this model, the predictor (sbp) have very tight confidence intervals but the value cross zero.
=> This is a bad model and there is no relationship between nihss and sbp. OR We don’t have enough evidence to say that There is a relationship between nihss and sbp.

 

3.3.3 Residuals

Explain
Let check our 1st observation z001 who has nihss = 21; sbp = 127:

  • Observed value sbp = 127.

  • Fitted value nihss = 19.09 - 0.007 * 127 = 18.201

  • Residuals = Observed - Fitted = 21 - 18.201 = 2.799

  • Points above the regression line will have positive residuals, and points below the regression line will have negative residuals. Points on the line have zero residuals.

     

Discuss
According to thomaselove.github.io/432-notes:

  • The mean residual will always be zero in an ordinary least squares model, but a five number summary of the residuals is provided by the summary, as is an estimated standard deviation of the residuals (called here the Residual standard error).

  • In the fakestroke data, the minimum residual was -8.13, so for one subject, the observed value was 8.13 score smaller than the predicted (fitted) value. This means that the prediction was 8.13 score too large for that subject.

  • Similarly, the maximum residual was 10.11 score, so for one subject the prediction was 10.11 score too small.

     

3.3.4 Plot: Residuals vs. Fitted Values

Creates a variable called residuals in the dataframe called df2 (df2$residuals) that contains the residuals from the model_A model (resid(model_A)):

df2 = df %>% select (studyid, nihss, sbp)  
df2$residuals = resid(model_A)

df2$fitted = fitted(model_A)
df2$standardized.residuals = rstandard(model_A)
df2$studentized.residuals = rstudent(model_A)
df2$cooks.distance = cooks.distance(model_A)
df2$dfbeta = dfbeta(model_A)
df2$dffit = dffits(model_A)
df2$leverage = hatvalues(model_A)
df2$covariance.ratios = covratio(model_A)
df2
## # A tibble: 496 x 12
##    studyid nihss   sbp residu~1 fitted stand~2 stude~3 cooks~4 dfbeta~5    dffit
##    <fct>   <dbl> <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1 z001       21   127   2.85     18.2  0.610   0.609  5.83e-4  3.07e-2  3.41e-2
##  2 z002       23   137   4.92     18.1  1.05    1.05   1.25e-3  2.99e-2  5.01e-2
##  3 z003       11   138  -7.07     18.1 -1.51   -1.51   2.52e-3 -3.96e-2 -7.11e-2
##  4 z004       22   122   3.81     18.2  0.816   0.815  1.27e-3  5.00e-2  5.04e-2
##  5 z005       24   162   6.11     17.9  1.31    1.31   2.46e-3 -3.46e-2  7.02e-2
##  6 z006       18   166   0.139    17.9  0.0298  0.0298 1.49e-6 -1.05e-3  1.73e-3
##  7 z007       25   140   6.95     18.1  1.49    1.49   2.34e-3  3.24e-2  6.85e-2
##  8 z008       18   157   0.0725   17.9  0.0155  0.0155 2.93e-7 -2.40e-4  7.64e-4
##  9 z009       25   129   6.86     18.1  1.47    1.47   3.14e-3  6.74e-2  7.93e-2
## 10 z010       27   143   8.97     18.0  1.92    1.92   3.75e-3  2.92e-2  8.69e-2
## # ... with 486 more rows, 3 more variables: dfbeta[2] <dbl>, leverage <dbl>,
## #   covariance.ratios <dbl>, and abbreviated variable names 1: residuals,
## #   2: standardized.residuals, 3: studentized.residuals, 4: cooks.distance,
## #   5: dfbeta[,"(Intercept)"]

 

Checking the distribution: Plot a histogram and Q-Q plot of the studentized residuals.

# Histogram and density plot:
histogram <-ggplot(df2, aes(studentized.residuals)) + 
  theme(legend.position = "none") + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
  labs(x = "Studentized Residual", y = "Density")

histogram + stat_function(fun = dnorm, args = list(mean = mean(df2$studentized.residuals, na.rm = TRUE), sd = sd(df2$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)

# Create a Q-Q plot: 3 ways to draw
## 1st way
qqplot.resid <- qplot(sample = df2$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values")
qqplot.resid

## 2nd way
### library(ggpubr)
### ggqqplot(df2$studentized.residuals, ylab = "Observed Values")

## 3rd way
ggplot(data = df2, aes(sample = studentized.residuals)) + 
      geom_qq() +
      geom_qq_line() +
      labs( x = "Theoretical Values", y = "Observed Values")

 

Checking the Homoscedasticity: Plot a scatterplot of studentized residuals against predicted values

ggplot(df2, aes(fitted, studentized.residuals)) +
  geom_point() + 
  geom_smooth(method = "lm", colour = "Blue")+ labs(x = "Fitted Values  
  lm(nihss ~ sbp)", y = "Studentized Residual")

Or we can just simply use the following 3 commands:

  • hist(rstudent(albumSales.3))
  • hist(rstandard(albumSales.3))
  • plot(albumSales.3)
hist(rstudent(model_A))

hist(rstandard(model_A)) 

par(mfrow=c(2,2))
plot(model_A)

 

 

4 Simple Regression - Categorical predictor

Continuous outcome - 1 Categorical predictor

4.1 Step 3: Linear regression model

Package: car
Command: newModel<-lm(outcome ~ predictor(s), data = dataFrame, na.action = an action))

with na.action:

  • na.action = na.fail: This is the default and it simply means that if there are any missing values the model will fail to compute.
  • na.action = na.omit or na.exclude: This estimates the model but excludes any case that has any missing data on any variable in the model.
library(car)
model_C <- lm(nihss ~ ia.occlus, data = df)
summary(model_C)
## 
## Call:
## lm(formula = nihss ~ ia.occlus, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2152 -4.2152 -0.2313  3.7848 10.7687 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            14.750      2.325   6.345 5.05e-10 ***
## ia.occlusICA with M1    2.481      2.359   1.052   0.2934    
## ia.occlusM1             3.465      2.339   1.481   0.1392    
## ia.occlusM2             4.301      2.441   1.762   0.0787 .  
## ia.occlusA1 or A2       7.583      3.551   2.136   0.0332 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.649 on 491 degrees of freedom
## Multiple R-squared:  0.0217, Adjusted R-squared:  0.01373 
## F-statistic: 2.723 on 4 and 491 DF,  p-value: 0.02897
confint(model_C, level = 0.95)
##                           2.5 %    97.5 %
## (Intercept)          10.1827135 19.317286
## ia.occlusICA with M1 -2.1536104  7.116297
## ia.occlusM1          -1.1309126  8.061292
## ia.occlusM2          -0.4945081  9.097072
## ia.occlusA1 or A2     0.6066880 14.559979

 

Discuss
We can see that:

  • The output shows that 4 variables in ia.occlus explained 2.17% of the variance in the change in sbp scores.
  • The F statistics = 2.73, p-value < 0.05: the model is significantly better at predicting the change in sbp scores than having no model.
round(tapply(df$nihss, df$ia.occlus, mean, na.rm=TRUE), 3)
## Intracranial ICA      ICA with M1               M1               M2 
##           14.750           17.231           18.215           19.051 
##         A1 or A2 
##           22.333

 

Let look at our 4 predictors here: The 1st one is ia.occlusICA with M1. The first dummy variable shows the difference between the change in sbp scores for the ICA with M1 group to the baseline ICA group: mean of ICA with M1 - ICA = 17.231 - 14.750 = 2.48. It means that the change in sbp scores is greater for the ICA with M1 than it is for the ICA. So, the beta values tell us the relative difference between each group and the group that we chose as a baseline category.

  • For our ICA with M1 variable, the t-test is non-significant (p>0.05) => the change in sbp score is the same if a person changes from ICA to ICA with M1.

  • For our A1 or A2 variable, the t-test is positive value and is significant (p<0.05)=> the change in sbp score goes up if a person changes from ICA to A1 or A2.

     

5 Multiple regression

Continuous outcome - Continuous predictor(s)

The outcome of the above dataset is nihss and mrankin.

We will start with a question: “Can we use age, sbp and time.rand to predict nihss?” or “In what sbp range can we make a reasonable prediction of nihss?”

 

5.1 Step 3: Linear regression model

Package: car
Command: newModel<-lm(outcome ~ predictor(s), data = dataFrame, na.action = an action))

with na.action:

  • na.action = na.fail: This is the default and it simply means that if there are any missing values the model will fail to compute.
  • na.action = na.omit or na.exclude: This estimates the model but excludes any case that has any missing data on any variable in the model.
library(car)
model_B <- lm(nihss ~ age + sbp + time.rand, data = df)
summary(model_B)
## 
## Call:
## lm(formula = nihss ~ age + sbp + time.rand, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5771 -3.9720 -0.4679  3.6668 10.5137 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.843881   1.602149  13.010   <2e-16 ***
## age         -0.010033   0.012355  -0.812   0.4172    
## sbp         -0.006869   0.008385  -0.819   0.4131    
## time.rand   -0.005679   0.003253  -1.746   0.0815 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.675 on 492 degrees of freedom
## Multiple R-squared:  0.008917,   Adjusted R-squared:  0.002873 
## F-statistic: 1.475 on 3 and 492 DF,  p-value: 0.2204
confint(model_B, level = 0.95)
##                   2.5 %       97.5 %
## (Intercept) 17.69598277 2.399178e+01
## age         -0.03430919 1.424283e-02
## sbp         -0.02334311 9.605859e-03
## time.rand   -0.01206948 7.124692e-04

 

5.2 Explaining the result

5.2.1 Multiple R-squared

First, we should look at the final part of the result: The Multiple R-squared

For these data, R2 has a value of 0.009. Two new predictors (0.9-0.2 = 0.7%) has explained quite a large amount of the variation in nihss compare to sbp alone (0.2%).

=> 99.1% of the variation in nihss cannot be explained by age, sbp and time.rand alone. Therefore, there must be other variables that have an influence also.

 

5.2.2 Adjust R-squared

From thomaselove.github.io/432-notes:

  • The Adjusted R-squared value “adjusts” for the size of our model in terms of the number of coefficients included in the model.

  • The adjusted R-squared will always be smaller than the Multiple R-squared.

  • In particular, we hope to find models where the adjusted R2 isn’t substantially less than the Multiple R-squared.

  • The adjusted R-squared is usually a better estimate of likely performance of our model in new data than is the Multiple R-squared.

  • The adjusted R-squared result is no longer interpretable as a proportion of anything - in fact, it can fall below 0.

  • We can obtain the adjusted R2 from the raw R2, the number of observations N and the number of predictors p included in the model: Adjust R2 = 1 - (1 - R2)(N-1)/ (N-p-1).

     

5.2.3 F statistic and its p-value

p-value = 0.2204 This result tells us that there is more than 0.05% chance that an F-ratio this large would happen if the null hypothesis were true. Therefore, we can conclude that our regression model results in not significantly better prediction of nihss than if we used the mean value of nihss. In short, the regression model overall predicts nihss not significantly well

=> Bad model or there is no sufficient evidence to say that there is a significant relationship between nihss and age + sbp + time.rand.

 

5.2.4 Establish the equation

The equation is nihss = 20.84 - 0.01age - 0.0069sbp - 0.0057time.rand:

  • If the value is positive we can tell that there is a positive relationship between the predictor and the outcome, whereas a negative coefficient represents a negative relationship - All three predictors have negative b-values.
  • Intercept (b0) = 20.84: When age=0, sbp=0 and time.rand=0, the model predict that NIH Stroke Scale Score will be 20.84 => Intercept is meaningless, just for adjust the height of the line.

5.2.5 t value and its significance

In general, values of the regression coefficient b represent the change in the outcome resulting from a unit change in the predictor and that if a predictor is having a significant impact on our ability to predict the outcome then this b should be different from 0 (and big relative to its standard error). We also saw that the t-test tells us whether the b-value is different from 0. R provides the exact probability that the observed value of t would occur if the value of b in the population were 0. If this observed significance is less than 0.05, then scientists agree that the result reflects a genuine effect:

  • In simple regression, a significant value of t indicates that the slope of the regression line is significantly different from horizontal.
  • In multiple regression, the smaller the value of Pr(> |t|) (and the larger the value of t), the greater the contribution of that predictor.

• age, t(492) = -0.812 , p > .05,
• sbp, t(492) = -0.819 , p > .05,
• time.rand, t(492) = -1.746 , p > .05,
=> p > 0.05 => Not significant predictors to sbp.
=> Because absolute t value of time.rand > age and sbp => age and sbp have similar effect, while time.rand has higher effect to sbp.

 

5.2.6 Standadize beta value

library(QuantPsyc)
lm.beta(model_B)
##         age         sbp   time.rand 
## -0.03646526 -0.03679016 -0.07840996

This value tells us the number of standard deviations by which the outcome will change as a result of one standard deviation change in the predictor – The standardized beta values are all measured in standard deviation units => can be comparable => tells us the importance of each predictor (bigger absolute value = more important).
=> time.rand > age = sbp.

 

5.2.7 Confidence intervals

confint(model_B, level = 0.95)
##                   2.5 %       97.5 %
## (Intercept) 17.69598277 2.399178e+01
## age         -0.03430919 1.424283e-02
## sbp         -0.02334311 9.605859e-03
## time.rand   -0.01206948 7.124692e-04

 

Explaining

  • The confidence intervals can be interpreted as: if we’d collected 100 samples, and calculated the confidence intervals for b, we are saying that 95% of these confidence intervals would contain the true value of b. Therefore, we can be fairly confident that the confidence interval we have constructed for this sample will contain the true value of b in the population.

  • A good model will have a small confidence interval, indicating that the value of b in this sample is close to the true value of b in the population.

  • The sign (positive or negative) of the b-values tells us about the direction of the relationship between the predictor and the outcome.

=> Therefore, we would expect a very bad model to have confidence intervals that cross zero, indicating that in some samples the predictor has a negative relationship to the outcome whereas in others it has a positive relationship.

Discuss: In this model, all predictors confidence intervals cross zero.
=> This is a bad model and there is no relationship between nihss and age, sbp and time.rand. OR We don’t have enough evidence to say that There is a relationship between nihss and age, sbp and time.rand.

 

5.2.8 Comparing models (The F test)

We have the following hypothesis:

  • H0: Null Hypothesis: Models do not significantly better

  • Ha: Alternative Hypothesis: Full model is significantly better

     

Using anova() function: model 2 must have more variables than model 1
Command: anova(model.1, model.2, … , model.n)

anova(model_A, model_B)
## Analysis of Variance Table
## 
## Model 1: nihss ~ sbp
## Model 2: nihss ~ age + sbp + time.rand
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    494 10831                           
## 2    492 10751  2    79.526 1.8196 0.1632

 

Discuss
F(2, 492) = 1.82 , p-value > 0.05
=> Model 2 do not significantly improve the fit of the model to the data compared to model 1.

 

5.2.9 Assess the importance of variable

Package: relaimpo
Command: calc.relimp(model, type = “lmg”)

library(relaimpo)
calc.relimp(model_B, type = "lmg") 
## Response variable: nihss 
## Total response variance: 21.915 
## Analysis based on 496 observations 
## 
## 3 Regressors: 
## age sbp time.rand 
## Proportion of variance explained by model: 0.89%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##                   lmg
## age       0.001289178
## sbp       0.001468978
## time.rand 0.006158490
## 
## Average coefficients for different model sizes: 
## 
##                     1X          2Xs          3Xs
## age       -0.009725877 -0.009876461 -0.010033176
## sbp       -0.007434398 -0.007157730 -0.006868625
## time.rand -0.005691923 -0.005685081 -0.005678503

 

Discuss
In this model, Proportion of variance explained by model: 0.9%:

  • age 0.12%
  • sbp 0.14%
  • time.rand 0.62%

To calculate the Percentage confidence interval + ranking:

boot = boot.relimp(model_B, b = 1000, type = c("lmg"), fixed = F)
booteval.relimp(boot, typesel = "lmg", level = 0.9, bty ="perc", nodiff = T  )
## Response variable: nihss 
## Total response variance: 21.915 
## Analysis based on 496 observations 
## 
## 3 Regressors: 
## age sbp time.rand 
## Proportion of variance explained by model: 0.89%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##                   lmg
## age       0.001289178
## sbp       0.001468978
## time.rand 0.006158490
## 
## Average coefficients for different model sizes: 
## 
##                     1X          2Xs          3Xs
## age       -0.009725877 -0.009876461 -0.010033176
## sbp       -0.007434398 -0.007157730 -0.006868625
## time.rand -0.005691923 -0.005685081 -0.005678503
## 
##  
##  Confidence interval information ( 1000 bootstrap replicates, bty= perc ): 
## Relative Contributions with confidence intervals: 
##  
##                              Lower  Upper
##               percentage 0.9 0.9    0.9   
## age.lmg       0.0013     ABC 0.0000 0.0136
## sbp.lmg       0.0015     ABC 0.0000 0.0119
## time.rand.lmg 0.0062     ABC 0.0001 0.0233
## 
## Letters indicate the ranks covered by bootstrap CIs. 
## (Rank bootstrap confidence intervals always obtained by percentile method) 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.

 

5.3 Step 4: Checking the 4 Key Regression Assumptions

The assumptions and conditions for a multiple regression model are nearly the same as those for simple regression. The key assumptions that must be checked in a multiple regression model are:

  • Linearity Assumption

  • Independence Assumption

  • Equal Variance (Constant Variance / Homoscedasticity) Assumption

  • Normality Assumption

     

Creates a variable called residuals in the dataframe called df2 (df2$residuals) that contains the residuals from the model_B model (resid(model_B)):

df2 = df %>% select (studyid, nihss, age, sbp, time.rand) %>% na.omit
df2$residuals = resid(model_B)

df2$fitted = fitted(model_B)
df2$standardized.residuals = rstandard(model_B)
df2$studentized.residuals = rstudent(model_B)
df2$cooks.distance = cooks.distance(model_B)
df2$dfbeta = dfbeta(model_B)
df2$dffit = dffits(model_B)
df2$leverage = hatvalues(model_B)
df2$covariance.ratios = covratio(model_B)
df2
## # A tibble: 496 x 14
##    studyid nihss   age   sbp time.rand residu~1 fitted standa~2 studen~3 cooks~4
##    <fct>   <dbl> <dbl> <dbl>     <dbl>    <dbl>  <dbl>    <dbl>    <dbl>   <dbl>
##  1 z001       21    53   127       139   2.35     18.7  0.504    0.504   4.05e-4
##  2 z002       23    51   137       118   4.28     18.7  0.919    0.919   1.60e-3
##  3 z003       11    68   138       178  -7.20     18.2 -1.54    -1.55    1.62e-3
##  4 z004       22    28   122       160   3.18     18.8  0.686    0.686   1.69e-3
##  5 z005       24    91   162       214   6.40     17.6  1.37     1.38    3.66e-3
##  6 z006       18    34   166       154  -0.488    18.5 -0.105   -0.105   3.26e-5
##  7 z007       25    75   140       122   6.56     18.4  1.41     1.41    3.19e-3
##  8 z008       18    89   157       147  -0.0378   18.0 -0.00812 -0.00811 1.38e-7
##  9 z009       25    75   129       271   7.33     17.7  1.57     1.58    3.57e-3
## 10 z010       27    26   143       141   8.20     18.8  1.77     1.77    1.18e-2
## # ... with 486 more rows, 4 more variables: dfbeta <dbl[,4]>, dffit <dbl>,
## #   leverage <dbl>, covariance.ratios <dbl>, and abbreviated variable names
## #   1: residuals, 2: standardized.residuals, 3: studentized.residuals,
## #   4: cooks.distance

 

OR we can use augment command:

library(broom)
aug_model_B = augment(model_B, data = df)

 

5.3.1 Linearity Assumption

According to thomaselove.github.io/431-notes, We are fitting a linear model:

  • By linear, we mean that each predictor value, x, appears simply multiplied by its coefficient and added to the model.

  • No x appears in an exponent or some other more complicated function.

  • If the regression model is true, then the outcome y is linearly related to each of the x’s.

     

5.3.1.1 Initial Scatterplots for the “Straight Enough” Condition

  • Scatterplots of y against each of the predictors are reasonably straight.

  • The scatterplots need not show a strong (or any!) slope; we just check that there isn’t a bend or other nonlinearity.

  • Any substantial curve is indicative of a potential problem.

  • Modest bends are not usually worthy of serious attention.

     

For example, in the example data, here are the relevant scatterplots for the quantitative predictors (age, sbp) as well as a boxplot with violin for the categorical predictor (sex).

p1 <- ggplot(df, aes(x = age, y = nihss)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE, span = 2, formula = y ~ x) +
    geom_smooth(method = "lm", se = FALSE, col = "tomato", formula = y ~ x)

p2 <- ggplot(df, aes(x = sbp, y = nihss)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE, span = 2, formula = y ~ x) +
    geom_smooth(method = "lm", se = FALSE, col = "tomato", formula = y ~ x)

p3 <- ggplot(df, aes(x = sex, y = nihss)) +
    geom_violin(aes(fill = sex)) +
    geom_boxplot(width = 0.4) +
    coord_flip() + guides(fill = "none")

# Combine plot
library(ggpubr)
plot = ggarrange(
  ggarrange(p1, p2, ncol = 2, labels = c("A", "B")), # 1st row with 2 plots
  p3, # 2nd row with 1 plot
  nrow = 2, 
  labels = c("", "C")       # Label of the 3rd plot
  )   

annotate_figure(plot, top = text_grob("nihss vs Predictors", 
               color = "red", face = "bold", size = 12))

Discuss
I’ve added a straight line fit [in tomato red] and a loess smooth [in blue] to each plot to guide your assessment of the “straight enough” condition. Note the use here of span = 2 in the loess smooths to produce lines that are less wiggly than they would be using the default span = 0.75. We can conclude that:

  • Each of these is “straight enough” for our purposes, in initially fitting the data.

  • If one of these was not, we might consider a transformation of the relevant predictor, or, if all were problematic, we might transform the outcome Y.

     

5.3.1.2 Residuals vs. Predicted Values to Check for Non-Linearity

The residuals should appear to have no pattern (no curve, for instance) with respect to the predicted (fitted) values. It is a very good idea to plot the residuals against the fitted values to check for patterns, especially bends or other indications of non-linearity. For a multiple regression, the fitted values are a combination of the x’s given by the regression equation, so they combine the effects of the x’s in a way that makes sense for our particular regression model. That makes them a good choice to plot against. We’ll check for other things in this plot, as well.

When you ask R to plot the result of a linear model, it will produce up to five separate plots: the first of which is a plot of residuals vs. fitted values.

 

5.3.1.2.1 Using plot(model, which = 1)

To obtain this plot for the model (model_B) including age, sbp and time.rand to predict nihss we have two options. The simpler approach uses the following code to indicate plot 1 using the which command within the plot function.

plot(model_B, which=1)

Discuss
A smooth is again added (in red) to help you identify serious non-linearity. In this case, I would conclude that there were no serious problems with linearity in these data.

The plot also, by default, identifies the 3 values with the largest (in absolute value) residuals. Here, these are rows 45, 97 and 255, each of which has a positive residual (i.e. they represent under-predictions by the model.) To identify these points in the data, use the slice function. These turn out to be the observations with id = 45, 97 and 255, respectively.

slice(df2, c(45, 97, 255))
## # A tibble: 3 x 14
##   studyid nihss   age   sbp time.rand residuals fitted standar~1 stude~2 cooks~3
##   <fct>   <dbl> <dbl> <dbl>     <dbl>     <dbl>  <dbl>     <dbl>   <dbl>   <dbl>
## 1 z045       28    93   162       231      10.5   17.5      2.26    2.27 0.0112 
## 2 z097       28    72   163       252      10.4   17.6      2.24    2.24 0.00532
## 3 z256       28    62   149       293      10.5   17.5      2.24    2.25 0.00699
## # ... with 4 more variables: dfbeta <dbl[,4]>, dffit <dbl>, leverage <dbl>,
## #   covariance.ratios <dbl>, and abbreviated variable names
## #   1: standardized.residuals, 2: studentized.residuals, 3: cooks.distance

 

5.3.1.2.2 Using ggplot2 to plot Residuals vs. Predicted Values
library(ggrepel)

ggplot(aug_model_B, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_point(data = aug_model_B |> 
               slice_max(abs(.resid), n = 5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_model_B |> 
               slice_max(abs(.resid), n = 5),
               aes(label = studyid), col = "red") +  
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted Values from data",
       caption = "5 largest |residuals| highlighted in red.",
       x = "Fitted nihss", y = "Residual") +
  theme(aspect.ratio = 1)

 

5.3.1.3 Residuals vs. Predictors To Further Check for Non-Linearity

If we do see evidence of non-linearity in the plot of residuals against fitted values, I usually then proceed to look at residual plots against each of the individual predictors, in turn, to try to identify the specific predictor (or predictors) where we may need to use a transformation.

The appeal of such plots (as compared to the initial scatterplots we looked at of the outcome against each predictor) is that they eliminate the distraction of the linear fit, and let us look more closely at the non-linear part of the relationship between the outcome and each quantitative predictor. We might also look at a boxplot of the relationship between the outcome and a categorical predictor, although this is primarily for the purpose of assessing the potential for non-constant variance, rather than non-linearity.

p1 <- ggplot(aug_model_B, aes(x = age, y = nihss)) +
    geom_point() + 
    geom_smooth(method = "loess", formula = y ~ x, span = 2, se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "tomato")
p2 <- ggplot(aug_model_B, aes(x = sbp, y = nihss)) +
    geom_point() + 
    geom_smooth(method = "loess", formula = y ~ x, span = 2, se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "tomato")
p3 <- ggplot(aug_model_B, aes(x = sex, y = nihss)) +
    geom_violin(aes(fill = sex)) + geom_boxplot(width = 0.3) + 
    coord_flip() + guides(fill = "none")

# Combine plot
library(ggpubr)
plot = ggarrange(
  ggarrange(p1, p2, ncol = 2, labels = c("A", "B")), # 1st row with 2 plots
  p3, # 2nd row with 1 plot
  nrow = 2, 
  labels = c("", "C")       # Label of the 3rd plot
  )   

annotate_figure(plot, top = text_grob("Residuals vs. Predictors in aug_model_B model", 
               color = "red", face = "bold", size = 12))

Again, no particularly problematic issues with the assumption of linearity in either of the scatterplots here.

 

5.3.2 Assumption of independence

Independent errors: For any two observations the residual terms should be uncorrelated (or independent). This assumption can be tested with the Durbin–Watson test, which tests for serial correlations between errors. Specifically, it tests whether adjacent residuals are correlated. The test statistic d can vary between 0 and 4:

  • A value of 2: the residuals are uncorrelated.
  • A value greater than 2: a negative correlation between adjacent residuals
  • A value less than 2 indicates a positive correlation.

Values less than 1 or greater than 3 are definitely cause for concern; however, values closer to 2 may still be problematic depending on your sample and model.

 

If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options to correct this problem if you deem it to be serious enough:

  • For positive serial correlation, consider adding lags of the dependent and/or independent variable to the model.

  • For negative serial correlation, check to make sure that none of your variables are overdifferenced.

  • For seasonal correlation, consider adding seasonal dummy variables to the model.

     

Note: Be very careful with the Durbin–Watson test, though, as it depends on the order of the data: if you reorder your data, you’ll get a different value.

library(car)
durbinWatsonTest(model_B)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.06230072      2.119325   0.168
##  Alternative hypothesis: rho != 0

Discuss
The value is 2.1, p = 0.18 > 0.05 => There is no correlation among the residuals - Independent

 

5.3.3 Assumption of no multicollinearity

We must avoid collinearity in multiple regression. One way of identifying multicollinearity is to scan a correlation matrix of all of the predictor variables and see if any correlate very highly (> 0.80 or 0.90). Please refer to this link for Correlation analysis: https://rpubs.com/minhtri/933782

 

The VIF and tolerance statistics (with tolerance being 1 divided by the VIF) are useful statistics to assess collinearity. We can obtain the VIF using the vif() function. Tolerance = 1/VIF. Then we calculate mean VIF.

vif(model_B)
##       age       sbp time.rand 
##  1.001050  1.001306  1.001447
1/vif(model_B)
##       age       sbp time.rand 
## 0.9989510 0.9986955 0.9985552
mean(vif(model_B))
## [1] 1.001268

 

Discuss
Guidelines:

  • If the largest VIF is greater than 10 => cause for concern.
  • If the average VIF is substantially greater than 1 => the regression may be biased.
  • Tolerance below 0.1 => a serious problem.
  • Tolerance below 0.2 => a potential problem.

In this model: VIF values are all well below 10 + the tolerance statistics all well above 0.2 + the average VIF is very close to 1. => No collinearity within our data.

 

5.3.4 The Constant Variance Assumption

5.3.4.1 The Scale-Location Plot to Check for Non-Constant Variance

R does provide an additional plot to help assess this issue as linear model diagnostic plot 3. This one looks at the square root of the standardized residual plotted against the fitted values. You want the loess smooth in this plot to be flat, rather than sloped.

plot(model_B, which=3)

 

5.3.4.2 Using ggplot2: Square Root of |Standardized Residual| vs Fitted

library(ggrepel)
ggplot(aug_model_B, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_point(data = aug_model_B |> 
               slice_max(sqrt(abs(.std.resid)), n = 3),
             col = "red", size = 1) +
  geom_text_repel(data = aug_model_B |> 
               slice_max(sqrt(abs(.std.resid)), n = 3),
               aes(label = studyid), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  labs(title = "Scale-Location Plot for aug_model_B model",
       caption = "3 largest |Standardized Residual| in red.",
       x = "Fitted Value of nihss", 
       y = "Square Root of |Standardized Residual|") +
  theme(aspect.ratio = 1)

Discuss
For trouble in Constant Variance Assumption: Note the funnel shape for the upper two heteroscedastic (non-constant variance) plots, and the upward sloping loess line in the last one.

 

More info can be found at:
https://thomaselove.github.io/431-notes/30-regr_diagnostics.html#the-scale-location-plot-to-check-for-non-constant-variance

 

5.3.4.3 Using ggplot2: Studentized Residuals vs Fitted

The first useful graph is a plot of fitted values against residuals. This should look like a random array of dots evenly dispersed around zero. If this graph funnels out, then the chances are that there is heteroscedasticity in the data => a violation of the assumption of homogeneity of variance. If there is any sort of curve in this graph then the chances are that the data have violated the assumption of linearity. If the dots seem to have a pattern and are more spread out at some points on the plot than others => violations of both homogeneity of variance and linearity. Any of these scenarios puts the validity of your model into question. Repeat the above for all partial plots too.

# Checking the Homoscedasticity: Plot a scatterplot of studentized residuals against predicted values  
ggplot(df2, aes(fitted, studentized.residuals)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Blue") + 
  labs(x = "Fitted Values", y = "Studentized Residual")

 

5.3.5 The Normality Assumption

If the plot is straight enough, the data are independent, and the plots don’t thicken, you can now move on to the final assumption, that of Normality.

We assume that the errors around the idealized regression model at any specified values of the x-variables follow a Normal model. We need this assumption so that we can use a Student’s t-model for inference. As with other times when we’ve used Student’s t, we’ll settle for the residuals satisfying the Nearly Normal condition. To assess this, we simply look at a histogram or Normal probability plot of the residuals. Note that the Normality Assumption also becomes less important as the sample size grows.

 

The plot that is produced by the plot() function is a Q-Q plot, which shows up deviations from normality. The straight line in this plot represents a normal distribution, and the points represent the observed residuals. Therefore, in a perfectly normally distributed data set, all points will lie on the line. If the histogram looks non-normal, then things are less good. Be warned, though: distributions can look very non-normal in small samples even when they are normal!

# Histogram and density plot:
histogram <-ggplot(df2, aes(studentized.residuals)) + 
  theme(legend.position = "none") + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
  labs(x = "Studentized Residual", y = "Density")

histogram + stat_function(fun = dnorm, args = list(mean = mean(df2$studentized.residuals, na.rm = TRUE), sd = sd(df2$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)

# Create a Q-Q plot: 3 ways to draw
## 1st way
### qplot(sample = df2$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values")

## 2nd way
### library(ggpubr)
### ggqqplot(df2$studentized.residuals, ylab = "Observed Values")

## 3rd way
ggplot(data = df2, aes(sample = studentized.residuals)) + 
      geom_qq() +
      geom_qq_line() +
      labs(x = "Theoretical Values", y = "Observed Values")

## 4th way  
plot(model_B, which=2)

 

5.4 Step 5: Testing the accuracy of the Model

If in the step 3, we find a good model, we will continue with step 4. In case we don’t, just repeat step 3 with different predictors until the p-value < 0.05. I will do step 4 for the above model just for illustration.

 

Outliers: Residuals can be obtained with the resid() function, standardized residuals with the rstandard() function and studentized residuals with the rstudent() function.

Influential cases: Cook’s distances can be obtained with the cooks.distance() function, DFBeta with the dfbeta() function, DFFit with the dffits() function, hat values (leverage) with the hatvalues() function, and the covariance ratio with the covratio() function.

 

5.4.1 Standardized residuals - Outliers and influential cases

In an ordinary sample we would expect 95% of cases to have standardized residuals within about ±2 and approximately 99.74% of values between -3 and +3.

A natural check, therefore, will be to identify the most extreme/unusual standardized residual in our data set, and see whether its value is within the general bounds implied by the Empirical Rule associated with the Normal distribution. If, for instance, we see a standardized residual below -3 or above +3, this is likely to be a very poorly fit point (we would expect to see a point like this about 2.6 times for every 1,000 observations.)

A very poorly fitting point, especially if it also has high influence on the model (a concept we’ll discuss shortly), may be sufficiently different from the rest of the points in the data set to merit its removal before modeling. If we did remove such a point, we’d have to have a good reason for this exclusion, and include that explanation in our report.

5.4.1.1 Standardized residuals approach

We have a sample of 496, therefore it is reasonable to expect about 24 cases (5%) to have standardized residuals outside these limits.

df2$large.residual <- df2$standardized.residuals > 2 | df2$standardized.residuals < -2
sum(df2$large.residual)
## [1] 14

 

Discuss
Only 14 cases had a large residual (which we defined as one bigger than 2 or smaller than −2). It might be better to know not just how many cases there are, but also which cases they are. We can do this by selecting only those cases for which the variable large.residual is TRUE. To know which cases they are: dataFrame[rows, columns]

  • df2$large.residual: Take all the rows that residuals > 2 or < -2
  • c(“studyid”,“nihss”, “age”, “sbp”, “time.rand”, “standardized.residuals”): Take the following column.
df2[df2$large.residual,c("studyid","nihss", "age", "sbp", "time.rand", "standardized.residuals")]
## # A tibble: 14 x 6
##    studyid nihss   age   sbp time.rand standardized.residuals
##    <fct>   <dbl> <dbl> <dbl>     <dbl>                  <dbl>
##  1 z022       27  67     148       291                   2.04
##  2 z045       28  93     162       231                   2.26
##  3 z079       28  54     154       112                   2.02
##  4 z097       28  72     163       252                   2.24
##  5 z129       28  43     109       281                   2.14
##  6 z184       28  38     131       216                   2.08
##  7 z243       28  54.5   130       187                   2.07
##  8 z256       28  62     149       293                   2.24
##  9 z281       28  74     159       189                   2.16
## 10 z290       28  56     157       237                   2.17
## 11 z375       28  42     158       124                   2.01
## 12 z388       28  44     158       248                   2.17
## 13 z391       28  66     128       159                   2.06
## 14 z435       28  56     146       133                   2.03

 

Discuss
In a normally distributed sample,

  • 95% of z-scores should lie between −1.96 and +1.96 (or 2)
  • 99% should lie between −2.58 and +2.58
  • 99.9% (i.e., nearly all of them) should lie between −3.29 and +3.29. If any observation has value outside this criteria, we should pay more attention to that observation.

=> In our model, none of the case outside the limit ( > 2.58 or > 3 to be investigated) => a fairly accurate model.

 

5.4.1.2 Assessing Standardized Residuals with an Outlier Test

plot(model_B, which=2)

The Q-Q plot of standardized residuals shows no point that is far away from our expectations under the Normal model. If there was 1 point (point 45), we can look at by command:

library(kableExtra)
aug_model_B |> slice(45) |> select(studyid:.std.resid)|> 
  kbl(digits = 2) |> kable_classic_2()
studyid nihss age sbp time.rand sex ia.occlus .fitted .resid .hat .sigma .cooksd .std.resid
z045 28 93 162 231 Female M1 17.49 10.51 0.01 4.66 0.01 2.26

Discuss: The std.resid is not so high, we can ignore this point. However, in case the standardized residuals exceed 4 and we want to Assessing Standardized Residuals with an Outlier Test:

library(car)
outlierTest(model_B)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferroni p
## 45 2.268453           0.023735           NA

Discuss
When p < 0.05: Our conclusion from the Bonferroni p value here is that there’s evidence to support the belief that we could have a serious problem with Normality, in that a studentized residual of > 3 is out of the range we might reasonably expect to see given a Normal distribution of errors and observations.

Which value is that? It’s the subject with row = …. (For example: if row 45)

aug_model_B |> slice(45) |>
    select(studyid:.std.resid) |> kbl(digits = 2) |> kable_classic_2()
studyid nihss age sbp time.rand sex ia.occlus .fitted .resid .hat .sigma .cooksd .std.resid
z045 28 93 162 231 Female M1 17.49 10.51 0.01 4.66 0.01 2.26

 

5.4.2 The leverage - Outliers and influential cases

Hat values (sometimes called leverage), which gauge the influence of the observed value of the outcome variable over the predicted values. Leverage values can lie between 0 (indicating that the case has no influence whatsoever) and 1 (indicating that the case has complete influence over prediction. The average leverage value is defined as (k+1)/n. Should Investigating cases with values greater than twice the average (2(k + 1)/n) and three times the average (3(k + 1)/n) as a cut-off point for identifying cases having undue influence.

  • k is the number of predictors in the model.
  • n is the number of participants.

 

In this model nihss ~ age + sbp + time.rand: Leverage = (3+1)/496 = 0.008
=> 2 times of Leverage = 0.016
=> 3 times of Leverage = 0.024
=> All cases are within the boundary of three times the average.

 

5.4.2.1 Searching the leverage value

C1: We can search the id that has large residual (>2) => cooks.distance, leverage and coveriance ratio:

df2[df2$large.residual , c("studyid", "cooks.distance", "leverage", "covariance.ratios")]
## # A tibble: 14 x 4
##    studyid cooks.distance leverage covariance.ratios
##    <fct>            <dbl>    <dbl>             <dbl>
##  1 z022           0.00561  0.00538             0.980
##  2 z045           0.0112   0.00871             0.975
##  3 z079           0.00789  0.00770             0.983
##  4 z097           0.00532  0.00424             0.972
##  5 z129           0.0139   0.0120              0.983
##  6 z184           0.00821  0.00757             0.981
##  7 z243           0.00397  0.00369             0.977
##  8 z256           0.00699  0.00552             0.973
##  9 z281           0.00394  0.00337             0.974
## 10 z290           0.00392  0.00331             0.973
## 11 z375           0.0101   0.00983             0.985
## 12 z388           0.00727  0.00616             0.976
## 13 z391           0.00441  0.00415             0.978
## 14 z435           0.00555  0.00535             0.980

 

C2: OR we can search what value have high leverage only:

aug_model_B |> select(studyid:ia.occlus, .hat) |> 
    filter(.hat > 0.024) |>
    arrange(desc(.hat))
## # A tibble: 3 x 8
##   studyid nihss   age   sbp time.rand sex    ia.occlus   .hat
##   <fct>   <dbl> <dbl> <dbl>     <dbl> <fct>  <fct>      <dbl>
## 1 z148       23    75   231       120 Male   M1        0.0303
## 2 z115       11    89    78       296 Female M1        0.0256
## 3 z372       28    27    94       117 Male   M1        0.0241

3 of our 496 subjects meet the standard for large leverage values (.hat values exceeding three times the average .hat value) but none of these are the outlier (id = z045) we identified as having an especially poor fit (large standardized residual.)

aug_model_B |>
    filter(studyid == "z045") |>
    select(studyid, .std.resid, .hat)
## # A tibble: 1 x 3
##   studyid .std.resid    .hat
##   <fct>        <dbl>   <dbl>
## 1 z045          2.26 0.00871

So, let’s look at a plot of the two quantitative predictors in our data to see which 3 points show up as those with “high” leverage. The points with high leverage (large values of .hat) are simply those with an unusual combination of the two quantitative predictors.

ggplot(aug_model_B, aes(x = age, y = sbp)) +
    geom_point(col = "gray50") + 
    geom_point(data = aug_model_B |> 
                   slice_max(.hat, n = 3),
               col = "red", size = 2) + 
  geom_text_repel(data = aug_model_B |> 
               slice_max(.hat, n = 3),
               aes(label = studyid), col = "red") +
    labs(x = "age", y = "sbp",
         title = "Where are the highly leveraged points in our model?",
         subtitle = "model nihss, 3 highest leverage points in red")

 

5.4.2.2 C2 residuals vs. leverage graph

Another way to understand the impact of leverage is to look at a plot of residuals vs. leverage, which is diagnostic plot 5 for a linear model. Here’s the base R version for our model_B model.

plot(model_B, which=5)

Here is the ggplot2 version.

ggplot(aug_model_B, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_model_B |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_model_B |> filter(.cooksd >= 0.5),
               aes(label = studyid), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

We see that the most highly leveraged points are shown furthest to the right in this plot (in fact, I’ve inserted a dotted vertical line at the cutpoint of 3*mean(.hat)).

 

5.4.3 Cook’s distance - Outliers and influential cases

5.4.3.1 Using command

Cook’s distance is a measure of the overall influence of a case on the model, values greater than 1 may be cause for concern. Even points with Cook’s d > 0.5 may merit further investigation.

df2[df2$large.residual , c("studyid", "cooks.distance", "leverage", "covariance.ratios")]
## # A tibble: 14 x 4
##    studyid cooks.distance leverage covariance.ratios
##    <fct>            <dbl>    <dbl>             <dbl>
##  1 z022           0.00561  0.00538             0.980
##  2 z045           0.0112   0.00871             0.975
##  3 z079           0.00789  0.00770             0.983
##  4 z097           0.00532  0.00424             0.972
##  5 z129           0.0139   0.0120              0.983
##  6 z184           0.00821  0.00757             0.981
##  7 z243           0.00397  0.00369             0.977
##  8 z256           0.00699  0.00552             0.973
##  9 z281           0.00394  0.00337             0.974
## 10 z290           0.00392  0.00331             0.973
## 11 z375           0.0101   0.00983             0.985
## 12 z388           0.00727  0.00616             0.976
## 13 z391           0.00441  0.00415             0.978
## 14 z435           0.00555  0.00535             0.980

 

Discuss: None of them has a Cook’s distance greater than 1 => None of the cases is having an undue influence.

 

5.4.3.2 Using graph

plot(model_B, which=5)

No points in this plot indicate substantial influence on our model. To do so, they’d have to be outside the Cook’s distance contour shown at the top right of this plot (there is a similar arc at the bottom right, but it’s too far away from the data to show up in the plot.)

 

OR we can draw Index Plot of Cook’s Distance:

plot(model_B, which=4)

aug_model_B_ex <- aug_model_B |> 
  mutate(obsnum = 1:nrow(aug_model_B |> select(.cooksd)))

ggplot(aug_model_B_ex, aes(x = obsnum, y = .cooksd)) + 
    geom_point() + 
    geom_segment(aes(x = obsnum, xend = obsnum, y = 0, yend = .cooksd)) +
    geom_text_repel(data = aug_model_B_ex |> 
               slice_max(.cooksd, n = 3),
               aes(label = studyid)) +
    labs(x = "Row Number",
         y = "Cook's Distance",
         title = "Cook's distance Index plot for chol_m1",
         subtitle = "Subjects with the 3 largest Cook's d values are identified.")

Remember that we’d need a Cook’s distance value to be at least 0.50 for us to worry about it in a serious way. Here, the largest we see (0.02 for id = 372) is still far away from that standard.

Let’s look at the six largest Cook’s distance values.

aug_model_B |> select(studyid, .std.resid, .hat, .cooksd) |>
    arrange(desc(.cooksd)) |> head() |> kbl(digits = 3) |> kable_classic()
studyid .std.resid .hat .cooksd
z372 1.892 0.024 0.022
z357 1.751 0.022 0.017
z115 -1.459 0.026 0.014
z129 2.138 0.012 0.014
z330 -1.849 0.016 0.014
z010 1.767 0.015 0.012

If we want to search for a particular id:

aug_model_B |> filter(studyid == "z372") |>
    select(studyid:.fitted, .std.resid, .hat, .cooksd) |> 
    kbl(digits = 2) |> kable_classic()
studyid nihss age sbp time.rand sex ia.occlus .fitted .std.resid .hat .cooksd
z372 28 27 94 117 Male M1 19.26 1.89 0.02 0.02

 

5.4.4 Covariance Ratio - Outliers and influential cases

We consider the following criteria:

  • CVR > 1 + [3(k+1)/n] = 1 + [3(3+1)/496] = 1.024
  • CVR < 1 - [3(k+1)/n] = 1 - [3(3+1)/496] = 0.976

Most of our 14 potential outliers have CVR values within or just outside these boundaries. Besides, given the Cook’s distance for all case, there is probably no cause for alarm.

 

6 Violated Assumptions

For any violation, please read at this page: 30.14 Violated Assumptions: Problems with Linearity
https://thomaselove.github.io/431-notes/30-regr_diagnostics.html#the-scale-location-plot-to-check-for-non-constant-variance

Overall, we need to eliminate an outlier observation or transform the outcome/ predictors depend on the types of violation.
https://www.r-bloggers.com/2022/10/box-cox-transformation-in-r/

 

6.1 Violated Assumptions: Problems with Linearity

So what do serious assumption violations look like, and what can we do about them?

Here is a simulated example that shows a clear problem with non-linearity.

set.seed(4311); x1 <- rnorm(n = 100, mean = 15, sd = 5)
set.seed(4312); x2 <- rnorm(n = 100, mean = 10, sd = 5)
set.seed(4313); e1 <- rnorm(n = 100, mean = 0, sd = 15)
y <- 15 + x1 + x2^2 + e1
viol1 <- data.frame(outcome = y, pred1 = x1, pred2 = x2) |> tibble()
model_1 <- lm(outcome ~ pred1 + pred2, data = viol1)
plot(model_1, which = 1)

In light of this, I would be looking for a potential transformation of outcome. Does the Box-Cox plot make any useful suggestions?

boxCox(model_1); powerTransform(model_1)

## Estimated transformation parameter 
##        Y1 
## 0.4414775

Note that if the outcome was negative, we would have to add some constant value to every outcome in order to get every outcome value to be positive, and Box-Cox to run. This suggests fitting a new model, using the square root of the outcome.

model_2 <- lm(sqrt(outcome) ~ pred1 + pred2, data = viol1)
plot(model_2, which = 1)

This is meaningfully better in terms of curve, but now looks a bit fan-shaped, indicating a potential problem with heteroscedasticity. Let’s look at the scale-location plot for this model.

plot(model_2, which = 3)

This definitely looks like there’s a trend down in this plot. So the square root transformation, by itself, probably hasn’t resolved assumptions sufficiently well. We’ll have to be very careful about our interpretation of the model.

 

6.2 Problems with Non-Normality: Skew

set.seed(4314); x1 <- rnorm(n = 100, mean = 15, sd = 5)
set.seed(4315); x2 <- rnorm(n = 100, mean = 10, sd = 5)
set.seed(4316); e2 <- rnorm(n = 100, mean = 3, sd = 5)
y2 <- 50 + x1 + x2 + e2^2
viol2 <- data.frame(outcome = y2, pred1 = x1, pred2 = x2) |> tibble()
model.3 <- lm(outcome ~ pred1 + pred2, data = viol2)
plot(model.3, which = 2)

Skewed residuals often show up in strange patterns in the plot of residuals vs. fitted values, too, as in this case.

plot(model.3, which = 1)

Clearly, we have some larger residuals on the positive side, but not on the negative side. Would an outcome transformation be suggested by Box-Cox?

boxCox(model.3); powerTransform(model.3)

## Estimated transformation parameter 
##        Y1 
## -1.438747

The suggested transformation looks like the inverse of our outcome.

model.4 <- lm(1/outcome ~ pred1 + pred2, data = viol2)
plot(model.4, which = 2)

OK. That’s something of an improvement. How about the other residual plots with this transformation?

plot(model.4, which = 1)

The impact of the skew is reduced, at least. I might well be satisfied enough with this, in practice.