knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Note: This analysis is used for personal study purpose. In this section, I will summerize some basic infomation about simple linear regression based on several sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
| Variable | Description |
|---|---|
| studyid | Study ID # (z001 through z500) |
| trt | Treatment group (Intervention or Control) |
| age | Age in years |
| sex | Male or Female |
| nihss | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
| location | Stroke Location - Left or Right Hemisphere |
| hx.isch | History of Ischemic Stroke (Yes/No) |
| afib | Atrial Fibrillation (1 = Yes, 0 = No) |
| dm | Diabetes Mellitus (1 = Yes, 0 = No) |
| mrankin | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
| sbp | Systolic blood pressure, in mm Hg |
| iv.altep | Treatment with IV alteplase (Yes/No) |
| time.iv | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
| aspects | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
| ia.occlus | Intracranial arterial occlusion, based on vessel imaging - five categories3 |
| extra.ica | Extracranial ICA occlusion (1 = Yes, 0 = No) |
| time.rand | Time from stroke onset to study randomization, in minutes |
| time.punc | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
Please refer to this section: The R environment - Data Wrangling: https://rpubs.com/minhtri/933332
df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
names(df)
## [1] "studyid" "trt" "age" "sex" "nihss" "location"
## [7] "hx.isch" "afib" "dm" "mrankin" "sbp" "iv.altep"
## [13] "time.iv" "aspects" "ia.occlus" "extra.ica" "time.rand" "time.punc"
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
df <- df%>%mutate_if(is.character, as.factor)
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Changing coding variable to categorical:
df$afib <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Checking after converting
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Check order of factor after relevel:
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
Package: VIM
Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)
For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007
Checking the percentage of missingness:
library(VIM)
aggr(df, plot = F, numbers = T, prop = T)
##
## Missings in variables:
## Variable Count
## sbp 1
## time.iv 55
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
## time.punc 267
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 0.534
## time.iv 0.110
## aspects 0.008
## time.rand 0.004
## sbp 0.002
## ia.occlus 0.002
## extra.ica 0.002
## studyid 0.000
## trt 0.000
## age 0.000
## sex 0.000
## nihss 0.000
## location 0.000
## hx.isch 0.000
## afib 0.000
## dm 0.000
## mrankin 0.000
## iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 267
## time.iv 55
## aspects 4
## time.rand 2
## sbp 1
## ia.occlus 1
## extra.ica 1
## studyid 0
## trt 0
## age 0
## sex 0
## nihss 0
## location 0
## hx.isch 0
## afib 0
## dm 0
## mrankin 0
## iv.altep 0
Discuss
df = df %>% select (studyid, nihss, age, sbp, time.rand, sex, ia.occlus) %>% na.omit()
Continuous outcome - 1 Continuous predictor
The outcome of the above dataset is nihss and mrankin.
For the simple linear model regression: we will start with a question: “Can we use sbp to predict nihss?” or “In what sbp range can we make a reasonable prediction of nihss?”
Let draw a scatter plot to illustrate the relationship between nihss and sbp
ggplot(df, aes(nihss, sbp)) + geom_point() + geom_smooth(method = "lm", colour = "Red")
Package: car
Command: newModel<-lm(outcome ~ predictor(s), data =
dataFrame, na.action = an action))
with na.action:
model_A <- lm(nihss ~ sbp, data = df)
summary(model_A)
##
## Call:
## lm(formula = nihss ~ sbp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1208 -3.9554 -0.3104 3.8197 10.1171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.094731 1.240217 15.396 <2e-16 ***
## sbp -0.007434 0.008393 -0.886 0.376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.682 on 494 degrees of freedom
## Multiple R-squared: 0.001586, Adjusted R-squared: -0.0004354
## F-statistic: 0.7846 on 1 and 494 DF, p-value: 0.3762
confint(model_A, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 16.65798015 21.531481749
## sbp -0.02392526 0.009056467
First, we should look at the final part of the result and Discuss the Multiple R-squared
For these data, R2 has a value of 0.0016. Because there is only one predictor, this value represents the square of the simple correlation between nihss and sbp – we can find the square root of R2 by:
sqrt(0.0016)
## [1] 0.04
=> 99.84% of the variation in nihss cannot be explained by sbp alone. Therefore, there must be other variables that have an influence.
Discuss the Adjust R-squared
From thomaselove.github.io/432-notes:
The Adjusted R-squared value “adjusts” for the size of our model
in terms of the number of coefficients included in the model.
The adjusted R-squared will always be smaller than the Multiple
R-squared.
In particular, we hope to find models where the adjusted R2 isn’t
substantially less than the Multiple R-squared.
The adjusted R-squared is usually a better estimate of likely
performance of our model in new data than is the Multiple
R-squared.
The adjusted R-squared result is no longer interpretable as a
proportion of anything - in fact, it can fall below 0.
We can obtain the adjusted R2 from the raw R2, the number of observations N and the number of predictors p included in the model: Adjust R2 = 1 - (1 - R2)(N-1)/ (N-p-1).
Discuss the F statistic and its p-value
p-value = 0.3762. This result tells us that there is more than 0.05%
chance that an F-ratio this large would happen if the null hypothesis
were true. Therefore, we can conclude that our regression model results
in not significantly better prediction of nihss than if
we used the mean value of nihss. In short, the
regression model overall predicts nihss not significantly well
=> bad model.
Discuss the equation
The equation is nihss = 19.09 - 0.007 sbp.
=> Intercept is meaningless, just for adjust the height of the line.
Slope or Gradient (b1) = -0.007: the change in the outcome associated with a unit change in the predictor. If our predictor variable is increased by one unit (if sbp is increased by 1), then our model predicts that 0.007 units of NIH score will be reduced.
Discuss the p-value of coefficients
In general, values of the regression coefficient b represent the change
in the outcome resulting from a unit change in the predictor and that if
a predictor is having a significant impact on our ability to predict the
outcome then this b should be different from 0 (and big relative to its
standard error). We also saw that the t-test tells us whether the
b-value is different from 0. R provides the exact probability that the
observed value of t would occur if the value of b in the population were
0. If this observed significance is less than 0.05, then scientists
agree that the result reflects a genuine effect.
For sbp value, the probabilities are 0.376 => not
significant => We can say that the probability of
sbp t-values occurring if the values of b in the
population were 0 is more than 0.05.
=> The bs are not different from 0.
=> sbp makes no significant contribution (p >.05)
to predicting nihss OR We don’t have enough evidence to
say that There is a relationship between nihss and
sbp.
Discuss the Confident Interval
confint(model_A, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 16.65798015 21.531481749
## sbp -0.02392526 0.009056467
Explaining
The confidence intervals can be interpreted as: if we’d collected 100 samples, and calculated the confidence intervals for b, we are saying that 95% of these confidence intervals would contain the true value of b. Therefore, we can be fairly confident that the confidence interval we have constructed for this sample will contain the true value of b in the population.
A good model will have a small confidence interval, indicating that the value of b in this sample is close to the true value of b in the population.
The sign (positive or negative) of the b-values tells us about the direction of the relationship between the predictor and the outcome.
=> Therefore, we would expect a very bad model to have confidence intervals that cross zero, indicating that in some samples the predictor has a negative relationship to the outcome whereas in others it has a positive relationship.
Discuss:
In this model, the predictor (sbp) have very tight
confidence intervals but the value cross zero.
=> This is a bad model and there is no relationship
between nihss and sbp. OR We don’t
have enough evidence to say that There is a relationship between
nihss and sbp.
Explain
Let check our 1st observation z001 who has nihss = 21; sbp = 127:
Observed value sbp = 127.
Fitted value nihss = 19.09 - 0.007 * 127 = 18.201
Residuals = Observed - Fitted = 21 - 18.201 = 2.799
Points above the regression line will have positive residuals, and points below the regression line will have negative residuals. Points on the line have zero residuals.
Discuss
According to thomaselove.github.io/432-notes:
The mean residual will always be zero in an ordinary least
squares model, but a five number summary of the residuals is provided by
the summary, as is an estimated standard deviation of the residuals
(called here the Residual standard error).
In the fakestroke data, the minimum residual was -8.13, so for
one subject, the observed value was 8.13 score smaller than the
predicted (fitted) value. This means that the prediction was 8.13 score
too large for that subject.
Similarly, the maximum residual was 10.11 score, so for one subject the prediction was 10.11 score too small.
Creates a variable called residuals in the dataframe called df2 (df2$residuals) that contains the residuals from the model_A model (resid(model_A)):
df2 = df %>% select (studyid, nihss, sbp)
df2$residuals = resid(model_A)
df2$fitted = fitted(model_A)
df2$standardized.residuals = rstandard(model_A)
df2$studentized.residuals = rstudent(model_A)
df2$cooks.distance = cooks.distance(model_A)
df2$dfbeta = dfbeta(model_A)
df2$dffit = dffits(model_A)
df2$leverage = hatvalues(model_A)
df2$covariance.ratios = covratio(model_A)
df2
## # A tibble: 496 x 12
## studyid nihss sbp residu~1 fitted stand~2 stude~3 cooks~4 dfbeta~5 dffit
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 z001 21 127 2.85 18.2 0.610 0.609 5.83e-4 3.07e-2 3.41e-2
## 2 z002 23 137 4.92 18.1 1.05 1.05 1.25e-3 2.99e-2 5.01e-2
## 3 z003 11 138 -7.07 18.1 -1.51 -1.51 2.52e-3 -3.96e-2 -7.11e-2
## 4 z004 22 122 3.81 18.2 0.816 0.815 1.27e-3 5.00e-2 5.04e-2
## 5 z005 24 162 6.11 17.9 1.31 1.31 2.46e-3 -3.46e-2 7.02e-2
## 6 z006 18 166 0.139 17.9 0.0298 0.0298 1.49e-6 -1.05e-3 1.73e-3
## 7 z007 25 140 6.95 18.1 1.49 1.49 2.34e-3 3.24e-2 6.85e-2
## 8 z008 18 157 0.0725 17.9 0.0155 0.0155 2.93e-7 -2.40e-4 7.64e-4
## 9 z009 25 129 6.86 18.1 1.47 1.47 3.14e-3 6.74e-2 7.93e-2
## 10 z010 27 143 8.97 18.0 1.92 1.92 3.75e-3 2.92e-2 8.69e-2
## # ... with 486 more rows, 3 more variables: dfbeta[2] <dbl>, leverage <dbl>,
## # covariance.ratios <dbl>, and abbreviated variable names 1: residuals,
## # 2: standardized.residuals, 3: studentized.residuals, 4: cooks.distance,
## # 5: dfbeta[,"(Intercept)"]
Checking the distribution: Plot a histogram and Q-Q plot of the studentized residuals.
# Histogram and density plot:
histogram <-ggplot(df2, aes(studentized.residuals)) +
theme(legend.position = "none") +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
labs(x = "Studentized Residual", y = "Density")
histogram + stat_function(fun = dnorm, args = list(mean = mean(df2$studentized.residuals, na.rm = TRUE), sd = sd(df2$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)
# Create a Q-Q plot: 3 ways to draw
## 1st way
qqplot.resid <- qplot(sample = df2$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values")
qqplot.resid
## 2nd way
### library(ggpubr)
### ggqqplot(df2$studentized.residuals, ylab = "Observed Values")
## 3rd way
ggplot(data = df2, aes(sample = studentized.residuals)) +
geom_qq() +
geom_qq_line() +
labs( x = "Theoretical Values", y = "Observed Values")
Checking the Homoscedasticity: Plot a scatterplot of studentized residuals against predicted values
ggplot(df2, aes(fitted, studentized.residuals)) +
geom_point() +
geom_smooth(method = "lm", colour = "Blue")+ labs(x = "Fitted Values
lm(nihss ~ sbp)", y = "Studentized Residual")
Or we can just simply use the following 3 commands:
hist(rstudent(model_A))
hist(rstandard(model_A))
par(mfrow=c(2,2))
plot(model_A)
Continuous outcome - 1 Categorical predictor
Package: car
Command: newModel<-lm(outcome ~ predictor(s), data =
dataFrame, na.action = an action))
with na.action:
library(car)
model_C <- lm(nihss ~ ia.occlus, data = df)
summary(model_C)
##
## Call:
## lm(formula = nihss ~ ia.occlus, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2152 -4.2152 -0.2313 3.7848 10.7687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.750 2.325 6.345 5.05e-10 ***
## ia.occlusICA with M1 2.481 2.359 1.052 0.2934
## ia.occlusM1 3.465 2.339 1.481 0.1392
## ia.occlusM2 4.301 2.441 1.762 0.0787 .
## ia.occlusA1 or A2 7.583 3.551 2.136 0.0332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.649 on 491 degrees of freedom
## Multiple R-squared: 0.0217, Adjusted R-squared: 0.01373
## F-statistic: 2.723 on 4 and 491 DF, p-value: 0.02897
confint(model_C, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 10.1827135 19.317286
## ia.occlusICA with M1 -2.1536104 7.116297
## ia.occlusM1 -1.1309126 8.061292
## ia.occlusM2 -0.4945081 9.097072
## ia.occlusA1 or A2 0.6066880 14.559979
Discuss
We can see that:
round(tapply(df$nihss, df$ia.occlus, mean, na.rm=TRUE), 3)
## Intracranial ICA ICA with M1 M1 M2
## 14.750 17.231 18.215 19.051
## A1 or A2
## 22.333
Let look at our 4 predictors here: The 1st one is ia.occlusICA with M1. The first dummy variable shows the difference between the change in sbp scores for the ICA with M1 group to the baseline ICA group: mean of ICA with M1 - ICA = 17.231 - 14.750 = 2.48. It means that the change in sbp scores is greater for the ICA with M1 than it is for the ICA. So, the beta values tell us the relative difference between each group and the group that we chose as a baseline category.
For our ICA with M1 variable, the t-test is
non-significant (p>0.05) => the change in sbp
score is the same if a person changes from ICA to
ICA with M1.
For our A1 or A2 variable, the t-test is positive value and is significant (p<0.05)=> the change in sbp score goes up if a person changes from ICA to A1 or A2.
Continuous outcome - Continuous predictor(s)
The outcome of the above dataset is nihss and mrankin.
We will start with a question: “Can we use age, sbp and time.rand to predict nihss?” or “In what sbp range can we make a reasonable prediction of nihss?”
Package: car
Command: newModel<-lm(outcome ~ predictor(s), data =
dataFrame, na.action = an action))
with na.action:
library(car)
model_B <- lm(nihss ~ age + sbp + time.rand, data = df)
summary(model_B)
##
## Call:
## lm(formula = nihss ~ age + sbp + time.rand, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5771 -3.9720 -0.4679 3.6668 10.5137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.843881 1.602149 13.010 <2e-16 ***
## age -0.010033 0.012355 -0.812 0.4172
## sbp -0.006869 0.008385 -0.819 0.4131
## time.rand -0.005679 0.003253 -1.746 0.0815 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.675 on 492 degrees of freedom
## Multiple R-squared: 0.008917, Adjusted R-squared: 0.002873
## F-statistic: 1.475 on 3 and 492 DF, p-value: 0.2204
confint(model_B, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 17.69598277 2.399178e+01
## age -0.03430919 1.424283e-02
## sbp -0.02334311 9.605859e-03
## time.rand -0.01206948 7.124692e-04
First, we should look at the final part of the result: The Multiple R-squared
For these data, R2 has a value of 0.009. Two new predictors (0.9-0.2 = 0.7%) has explained quite a large amount of the variation in nihss compare to sbp alone (0.2%).
=> 99.1% of the variation in nihss cannot be explained by age, sbp and time.rand alone. Therefore, there must be other variables that have an influence also.
From thomaselove.github.io/432-notes:
The Adjusted R-squared value “adjusts” for the size of our model
in terms of the number of coefficients included in the model.
The adjusted R-squared will always be smaller than the Multiple
R-squared.
In particular, we hope to find models where the adjusted R2 isn’t
substantially less than the Multiple R-squared.
The adjusted R-squared is usually a better estimate of likely
performance of our model in new data than is the Multiple
R-squared.
The adjusted R-squared result is no longer interpretable as a
proportion of anything - in fact, it can fall below 0.
We can obtain the adjusted R2 from the raw R2, the number of observations N and the number of predictors p included in the model: Adjust R2 = 1 - (1 - R2)(N-1)/ (N-p-1).
p-value = 0.2204 This result tells us that there is more than 0.05% chance that an F-ratio this large would happen if the null hypothesis were true. Therefore, we can conclude that our regression model results in not significantly better prediction of nihss than if we used the mean value of nihss. In short, the regression model overall predicts nihss not significantly well
=> Bad model or there is no sufficient evidence to say that there is a significant relationship between nihss and age + sbp + time.rand.
The equation is nihss = 20.84 - 0.01age - 0.0069sbp - 0.0057time.rand:
In general, values of the regression coefficient b represent the change in the outcome resulting from a unit change in the predictor and that if a predictor is having a significant impact on our ability to predict the outcome then this b should be different from 0 (and big relative to its standard error). We also saw that the t-test tells us whether the b-value is different from 0. R provides the exact probability that the observed value of t would occur if the value of b in the population were 0. If this observed significance is less than 0.05, then scientists agree that the result reflects a genuine effect:
• age, t(492) = -0.812 , p > .05,
• sbp, t(492) = -0.819 , p > .05,
• time.rand, t(492) = -1.746 , p > .05,
=> p > 0.05 => Not significant predictors to
sbp.
=> Because absolute t value of time.rand >
age and sbp => age and sbp have
similar effect, while time.rand has higher effect to
sbp.
library(QuantPsyc)
lm.beta(model_B)
## age sbp time.rand
## -0.03646526 -0.03679016 -0.07840996
This value tells us the number of standard deviations by which the
outcome will change as a result of one standard deviation change in the
predictor – The standardized beta values are all measured in standard
deviation units => can be comparable => tells us the importance of
each predictor (bigger absolute value = more important).
=> time.rand > age = sbp.
confint(model_B, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 17.69598277 2.399178e+01
## age -0.03430919 1.424283e-02
## sbp -0.02334311 9.605859e-03
## time.rand -0.01206948 7.124692e-04
Explaining
The confidence intervals can be interpreted as: if we’d collected 100 samples, and calculated the confidence intervals for b, we are saying that 95% of these confidence intervals would contain the true value of b. Therefore, we can be fairly confident that the confidence interval we have constructed for this sample will contain the true value of b in the population.
A good model will have a small confidence interval, indicating that the value of b in this sample is close to the true value of b in the population.
The sign (positive or negative) of the b-values tells us about the direction of the relationship between the predictor and the outcome.
=> Therefore, we would expect a very bad model to have confidence intervals that cross zero, indicating that in some samples the predictor has a negative relationship to the outcome whereas in others it has a positive relationship.
Discuss: In this model, all predictors confidence
intervals cross zero.
=> This is a bad model and there is no relationship
between nihss and age, sbp and
time.rand. OR We don’t have enough evidence to say that There
is a relationship between nihss and age, sbp
and time.rand.
We have the following hypothesis:
H0: Null Hypothesis: Models do not significantly better
Ha: Alternative Hypothesis: Full model is significantly better
Using anova() function: model 2 must have more variables than model
1
Command: anova(model.1, model.2, … , model.n)
anova(model_A, model_B)
## Analysis of Variance Table
##
## Model 1: nihss ~ sbp
## Model 2: nihss ~ age + sbp + time.rand
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 494 10831
## 2 492 10751 2 79.526 1.8196 0.1632
Discuss
F(2, 492) = 1.82 , p-value > 0.05
=> Model 2 do not significantly improve the fit of the model to the
data compared to model 1.
Package: relaimpo
Command: calc.relimp(model, type = “lmg”)
library(relaimpo)
calc.relimp(model_B, type = "lmg")
## Response variable: nihss
## Total response variance: 21.915
## Analysis based on 496 observations
##
## 3 Regressors:
## age sbp time.rand
## Proportion of variance explained by model: 0.89%
## Metrics are not normalized (rela=FALSE).
##
## Relative importance metrics:
##
## lmg
## age 0.001289178
## sbp 0.001468978
## time.rand 0.006158490
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## age -0.009725877 -0.009876461 -0.010033176
## sbp -0.007434398 -0.007157730 -0.006868625
## time.rand -0.005691923 -0.005685081 -0.005678503
Discuss
In this model, Proportion of variance explained by model: 0.9%:
To calculate the Percentage confidence interval + ranking:
boot = boot.relimp(model_B, b = 1000, type = c("lmg"), fixed = F)
booteval.relimp(boot, typesel = "lmg", level = 0.9, bty ="perc", nodiff = T )
## Response variable: nihss
## Total response variance: 21.915
## Analysis based on 496 observations
##
## 3 Regressors:
## age sbp time.rand
## Proportion of variance explained by model: 0.89%
## Metrics are not normalized (rela=FALSE).
##
## Relative importance metrics:
##
## lmg
## age 0.001289178
## sbp 0.001468978
## time.rand 0.006158490
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## age -0.009725877 -0.009876461 -0.010033176
## sbp -0.007434398 -0.007157730 -0.006868625
## time.rand -0.005691923 -0.005685081 -0.005678503
##
##
## Confidence interval information ( 1000 bootstrap replicates, bty= perc ):
## Relative Contributions with confidence intervals:
##
## Lower Upper
## percentage 0.9 0.9 0.9
## age.lmg 0.0013 ABC 0.0000 0.0136
## sbp.lmg 0.0015 ABC 0.0000 0.0119
## time.rand.lmg 0.0062 ABC 0.0001 0.0233
##
## Letters indicate the ranks covered by bootstrap CIs.
## (Rank bootstrap confidence intervals always obtained by percentile method)
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
The assumptions and conditions for a multiple regression model are nearly the same as those for simple regression. The key assumptions that must be checked in a multiple regression model are:
Linearity Assumption
Independence Assumption
Equal Variance (Constant Variance / Homoscedasticity)
Assumption
Normality Assumption
Creates a variable called residuals in the dataframe called df2 (df2$residuals) that contains the residuals from the model_B model (resid(model_B)):
df2 = df %>% select (studyid, nihss, age, sbp, time.rand) %>% na.omit
df2$residuals = resid(model_B)
df2$fitted = fitted(model_B)
df2$standardized.residuals = rstandard(model_B)
df2$studentized.residuals = rstudent(model_B)
df2$cooks.distance = cooks.distance(model_B)
df2$dfbeta = dfbeta(model_B)
df2$dffit = dffits(model_B)
df2$leverage = hatvalues(model_B)
df2$covariance.ratios = covratio(model_B)
df2
## # A tibble: 496 x 14
## studyid nihss age sbp time.rand residu~1 fitted standa~2 studen~3 cooks~4
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 z001 21 53 127 139 2.35 18.7 0.504 0.504 4.05e-4
## 2 z002 23 51 137 118 4.28 18.7 0.919 0.919 1.60e-3
## 3 z003 11 68 138 178 -7.20 18.2 -1.54 -1.55 1.62e-3
## 4 z004 22 28 122 160 3.18 18.8 0.686 0.686 1.69e-3
## 5 z005 24 91 162 214 6.40 17.6 1.37 1.38 3.66e-3
## 6 z006 18 34 166 154 -0.488 18.5 -0.105 -0.105 3.26e-5
## 7 z007 25 75 140 122 6.56 18.4 1.41 1.41 3.19e-3
## 8 z008 18 89 157 147 -0.0378 18.0 -0.00812 -0.00811 1.38e-7
## 9 z009 25 75 129 271 7.33 17.7 1.57 1.58 3.57e-3
## 10 z010 27 26 143 141 8.20 18.8 1.77 1.77 1.18e-2
## # ... with 486 more rows, 4 more variables: dfbeta <dbl[,4]>, dffit <dbl>,
## # leverage <dbl>, covariance.ratios <dbl>, and abbreviated variable names
## # 1: residuals, 2: standardized.residuals, 3: studentized.residuals,
## # 4: cooks.distance
OR we can use augment command:
library(broom)
aug_model_B = augment(model_B, data = df)
According to thomaselove.github.io/431-notes, We are fitting a linear model:
By linear, we mean that each predictor value, x, appears simply
multiplied by its coefficient and added to the model.
No x appears in an exponent or some other more complicated
function.
If the regression model is true, then the outcome y is linearly related to each of the x’s.
Scatterplots of y against each of the predictors are reasonably
straight.
The scatterplots need not show a strong (or any!) slope; we just
check that there isn’t a bend or other nonlinearity.
Any substantial curve is indicative of a potential problem.
Modest bends are not usually worthy of serious attention.
For example, in the example data, here are the relevant scatterplots for the quantitative predictors (age, sbp) as well as a boxplot with violin for the categorical predictor (sex).
p1 <- ggplot(df, aes(x = age, y = nihss)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, span = 2, formula = y ~ x) +
geom_smooth(method = "lm", se = FALSE, col = "tomato", formula = y ~ x)
p2 <- ggplot(df, aes(x = sbp, y = nihss)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, span = 2, formula = y ~ x) +
geom_smooth(method = "lm", se = FALSE, col = "tomato", formula = y ~ x)
p3 <- ggplot(df, aes(x = sex, y = nihss)) +
geom_violin(aes(fill = sex)) +
geom_boxplot(width = 0.4) +
coord_flip() + guides(fill = "none")
# Combine plot
library(ggpubr)
plot = ggarrange(
ggarrange(p1, p2, ncol = 2, labels = c("A", "B")), # 1st row with 2 plots
p3, # 2nd row with 1 plot
nrow = 2,
labels = c("", "C") # Label of the 3rd plot
)
annotate_figure(plot, top = text_grob("nihss vs Predictors",
color = "red", face = "bold", size = 12))
Discuss
I’ve added a straight line fit [in tomato red] and a
loess smooth [in blue] to each plot to guide your
assessment of the “straight enough” condition. Note the
use here of span = 2 in the loess smooths to produce lines that are less
wiggly than they would be using the default span = 0.75. We can conclude
that:
Each of these is “straight enough” for our purposes, in initially
fitting the data.
If one of these was not, we might consider a transformation of the relevant predictor, or, if all were problematic, we might transform the outcome Y.
The residuals should appear to have no pattern (no curve, for instance) with respect to the predicted (fitted) values. It is a very good idea to plot the residuals against the fitted values to check for patterns, especially bends or other indications of non-linearity. For a multiple regression, the fitted values are a combination of the x’s given by the regression equation, so they combine the effects of the x’s in a way that makes sense for our particular regression model. That makes them a good choice to plot against. We’ll check for other things in this plot, as well.
When you ask R to plot the result of a linear model, it will produce up to five separate plots: the first of which is a plot of residuals vs. fitted values.
To obtain this plot for the model (model_B) including age, sbp and time.rand to predict nihss we have two options. The simpler approach uses the following code to indicate plot 1 using the which command within the plot function.
plot(model_B, which=1)
Discuss
A smooth is again added (in red) to help you identify serious
non-linearity. In this case, I would conclude that there were no serious
problems with linearity in these data.
The plot also, by default, identifies the 3 values with the largest (in absolute value) residuals. Here, these are rows 45, 97 and 255, each of which has a positive residual (i.e. they represent under-predictions by the model.) To identify these points in the data, use the slice function. These turn out to be the observations with id = 45, 97 and 255, respectively.
slice(df2, c(45, 97, 255))
## # A tibble: 3 x 14
## studyid nihss age sbp time.rand residuals fitted standar~1 stude~2 cooks~3
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 z045 28 93 162 231 10.5 17.5 2.26 2.27 0.0112
## 2 z097 28 72 163 252 10.4 17.6 2.24 2.24 0.00532
## 3 z256 28 62 149 293 10.5 17.5 2.24 2.25 0.00699
## # ... with 4 more variables: dfbeta <dbl[,4]>, dffit <dbl>, leverage <dbl>,
## # covariance.ratios <dbl>, and abbreviated variable names
## # 1: standardized.residuals, 2: studentized.residuals, 3: cooks.distance
library(ggrepel)
ggplot(aug_model_B, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_point(data = aug_model_B |>
slice_max(abs(.resid), n = 5),
col = "red", size = 2) +
geom_text_repel(data = aug_model_B |>
slice_max(abs(.resid), n = 5),
aes(label = studyid), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Residuals vs. Fitted Values from data",
caption = "5 largest |residuals| highlighted in red.",
x = "Fitted nihss", y = "Residual") +
theme(aspect.ratio = 1)
If we do see evidence of non-linearity in the plot of residuals against fitted values, I usually then proceed to look at residual plots against each of the individual predictors, in turn, to try to identify the specific predictor (or predictors) where we may need to use a transformation.
The appeal of such plots (as compared to the initial scatterplots we looked at of the outcome against each predictor) is that they eliminate the distraction of the linear fit, and let us look more closely at the non-linear part of the relationship between the outcome and each quantitative predictor. We might also look at a boxplot of the relationship between the outcome and a categorical predictor, although this is primarily for the purpose of assessing the potential for non-constant variance, rather than non-linearity.
p1 <- ggplot(aug_model_B, aes(x = age, y = nihss)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "tomato")
p2 <- ggplot(aug_model_B, aes(x = sbp, y = nihss)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "tomato")
p3 <- ggplot(aug_model_B, aes(x = sex, y = nihss)) +
geom_violin(aes(fill = sex)) + geom_boxplot(width = 0.3) +
coord_flip() + guides(fill = "none")
# Combine plot
library(ggpubr)
plot = ggarrange(
ggarrange(p1, p2, ncol = 2, labels = c("A", "B")), # 1st row with 2 plots
p3, # 2nd row with 1 plot
nrow = 2,
labels = c("", "C") # Label of the 3rd plot
)
annotate_figure(plot, top = text_grob("Residuals vs. Predictors in aug_model_B model",
color = "red", face = "bold", size = 12))
Again, no particularly problematic issues with the assumption of
linearity in either of the scatterplots here.
Independent errors: For any two observations the residual terms should be uncorrelated (or independent). This assumption can be tested with the Durbin–Watson test, which tests for serial correlations between errors. Specifically, it tests whether adjacent residuals are correlated. The test statistic d can vary between 0 and 4:
Values less than 1 or greater than 3 are definitely cause for concern; however, values closer to 2 may still be problematic depending on your sample and model.
If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options to correct this problem if you deem it to be serious enough:
For positive serial correlation, consider adding lags of the
dependent and/or independent variable to the model.
For negative serial correlation, check to make sure that none of
your variables are overdifferenced.
For seasonal correlation, consider adding seasonal dummy variables to the model.
Note: Be very careful with the Durbin–Watson test, though, as it depends on the order of the data: if you reorder your data, you’ll get a different value.
library(car)
durbinWatsonTest(model_B)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.06230072 2.119325 0.168
## Alternative hypothesis: rho != 0
Discuss
The value is 2.1, p = 0.18 > 0.05 => There is no correlation among
the residuals - Independent
We must avoid collinearity in multiple regression. One way of identifying multicollinearity is to scan a correlation matrix of all of the predictor variables and see if any correlate very highly (> 0.80 or 0.90). Please refer to this link for Correlation analysis: https://rpubs.com/minhtri/933782
The VIF and tolerance statistics (with tolerance being 1 divided by the VIF) are useful statistics to assess collinearity. We can obtain the VIF using the vif() function. Tolerance = 1/VIF. Then we calculate mean VIF.
vif(model_B)
## age sbp time.rand
## 1.001050 1.001306 1.001447
1/vif(model_B)
## age sbp time.rand
## 0.9989510 0.9986955 0.9985552
mean(vif(model_B))
## [1] 1.001268
Discuss
Guidelines:
In this model: VIF values are all well below 10 + the tolerance statistics all well above 0.2 + the average VIF is very close to 1. => No collinearity within our data.
R does provide an additional plot to help assess this issue as linear model diagnostic plot 3. This one looks at the square root of the standardized residual plotted against the fitted values. You want the loess smooth in this plot to be flat, rather than sloped.
plot(model_B, which=3)
library(ggrepel)
ggplot(aug_model_B, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_point(data = aug_model_B |>
slice_max(sqrt(abs(.std.resid)), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug_model_B |>
slice_max(sqrt(abs(.std.resid)), n = 3),
aes(label = studyid), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
labs(title = "Scale-Location Plot for aug_model_B model",
caption = "3 largest |Standardized Residual| in red.",
x = "Fitted Value of nihss",
y = "Square Root of |Standardized Residual|") +
theme(aspect.ratio = 1)
Discuss
For trouble in Constant Variance Assumption: Note the funnel shape for
the upper two heteroscedastic (non-constant variance) plots, and the
upward sloping loess line in the last one.
More info can be found at:
https://thomaselove.github.io/431-notes/30-regr_diagnostics.html#the-scale-location-plot-to-check-for-non-constant-variance
The first useful graph is a plot of fitted values against residuals. This should look like a random array of dots evenly dispersed around zero. If this graph funnels out, then the chances are that there is heteroscedasticity in the data => a violation of the assumption of homogeneity of variance. If there is any sort of curve in this graph then the chances are that the data have violated the assumption of linearity. If the dots seem to have a pattern and are more spread out at some points on the plot than others => violations of both homogeneity of variance and linearity. Any of these scenarios puts the validity of your model into question. Repeat the above for all partial plots too.
# Checking the Homoscedasticity: Plot a scatterplot of studentized residuals against predicted values
ggplot(df2, aes(fitted, studentized.residuals)) +
geom_point() +
geom_smooth(method = "lm", colour = "Blue") +
labs(x = "Fitted Values", y = "Studentized Residual")
If the plot is straight enough, the data are independent, and the plots don’t thicken, you can now move on to the final assumption, that of Normality.
We assume that the errors around the idealized regression model at any specified values of the x-variables follow a Normal model. We need this assumption so that we can use a Student’s t-model for inference. As with other times when we’ve used Student’s t, we’ll settle for the residuals satisfying the Nearly Normal condition. To assess this, we simply look at a histogram or Normal probability plot of the residuals. Note that the Normality Assumption also becomes less important as the sample size grows.
The plot that is produced by the plot() function is a Q-Q plot, which shows up deviations from normality. The straight line in this plot represents a normal distribution, and the points represent the observed residuals. Therefore, in a perfectly normally distributed data set, all points will lie on the line. If the histogram looks non-normal, then things are less good. Be warned, though: distributions can look very non-normal in small samples even when they are normal!
# Histogram and density plot:
histogram <-ggplot(df2, aes(studentized.residuals)) +
theme(legend.position = "none") +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
labs(x = "Studentized Residual", y = "Density")
histogram + stat_function(fun = dnorm, args = list(mean = mean(df2$studentized.residuals, na.rm = TRUE), sd = sd(df2$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)
# Create a Q-Q plot: 3 ways to draw
## 1st way
### qplot(sample = df2$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values")
## 2nd way
### library(ggpubr)
### ggqqplot(df2$studentized.residuals, ylab = "Observed Values")
## 3rd way
ggplot(data = df2, aes(sample = studentized.residuals)) +
geom_qq() +
geom_qq_line() +
labs(x = "Theoretical Values", y = "Observed Values")
## 4th way
plot(model_B, which=2)
If in the step 3, we find a good model, we will continue with step 4. In case we don’t, just repeat step 3 with different predictors until the p-value < 0.05. I will do step 4 for the above model just for illustration.
Outliers: Residuals can be obtained with the resid() function, standardized residuals with the rstandard() function and studentized residuals with the rstudent() function.
Influential cases: Cook’s distances can be obtained with the cooks.distance() function, DFBeta with the dfbeta() function, DFFit with the dffits() function, hat values (leverage) with the hatvalues() function, and the covariance ratio with the covratio() function.
In an ordinary sample we would expect 95% of cases to have standardized residuals within about ±2 and approximately 99.74% of values between -3 and +3.
A natural check, therefore, will be to identify the most extreme/unusual standardized residual in our data set, and see whether its value is within the general bounds implied by the Empirical Rule associated with the Normal distribution. If, for instance, we see a standardized residual below -3 or above +3, this is likely to be a very poorly fit point (we would expect to see a point like this about 2.6 times for every 1,000 observations.)
A very poorly fitting point, especially if it also has high influence on the model (a concept we’ll discuss shortly), may be sufficiently different from the rest of the points in the data set to merit its removal before modeling. If we did remove such a point, we’d have to have a good reason for this exclusion, and include that explanation in our report.
We have a sample of 496, therefore it is reasonable to expect about 24 cases (5%) to have standardized residuals outside these limits.
df2$large.residual <- df2$standardized.residuals > 2 | df2$standardized.residuals < -2
sum(df2$large.residual)
## [1] 14
Discuss
Only 14 cases had a large residual (which we defined as one bigger than
2 or smaller than −2). It might be better to know not just how many
cases there are, but also which cases they are. We can do this by
selecting only those cases for which the variable large.residual is
TRUE. To know which cases they are: dataFrame[rows,
columns]
df2[df2$large.residual,c("studyid","nihss", "age", "sbp", "time.rand", "standardized.residuals")]
## # A tibble: 14 x 6
## studyid nihss age sbp time.rand standardized.residuals
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 z022 27 67 148 291 2.04
## 2 z045 28 93 162 231 2.26
## 3 z079 28 54 154 112 2.02
## 4 z097 28 72 163 252 2.24
## 5 z129 28 43 109 281 2.14
## 6 z184 28 38 131 216 2.08
## 7 z243 28 54.5 130 187 2.07
## 8 z256 28 62 149 293 2.24
## 9 z281 28 74 159 189 2.16
## 10 z290 28 56 157 237 2.17
## 11 z375 28 42 158 124 2.01
## 12 z388 28 44 158 248 2.17
## 13 z391 28 66 128 159 2.06
## 14 z435 28 56 146 133 2.03
Discuss
In a normally distributed sample,
=> In our model, none of the case outside the limit ( > 2.58 or > 3 to be investigated) => a fairly accurate model.
plot(model_B, which=2)
The Q-Q plot of standardized residuals shows no point that is far away
from our expectations under the Normal model. If there was 1 point
(point 45), we can look at by command:
library(kableExtra)
aug_model_B |> slice(45) |> select(studyid:.std.resid)|>
kbl(digits = 2) |> kable_classic_2()
| studyid | nihss | age | sbp | time.rand | sex | ia.occlus | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| z045 | 28 | 93 | 162 | 231 | Female | M1 | 17.49 | 10.51 | 0.01 | 4.66 | 0.01 | 2.26 |
Discuss: The std.resid is not so high, we can ignore this point. However, in case the standardized residuals exceed 4 and we want to Assessing Standardized Residuals with an Outlier Test:
library(car)
outlierTest(model_B)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 45 2.268453 0.023735 NA
Discuss
When p < 0.05: Our conclusion from the Bonferroni p
value here is that there’s evidence to support the belief that we could
have a serious problem with Normality, in that a studentized residual of
> 3 is out of the range we might reasonably expect to see given a
Normal distribution of errors and observations.
Which value is that? It’s the subject with row = …. (For example: if row 45)
aug_model_B |> slice(45) |>
select(studyid:.std.resid) |> kbl(digits = 2) |> kable_classic_2()
| studyid | nihss | age | sbp | time.rand | sex | ia.occlus | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| z045 | 28 | 93 | 162 | 231 | Female | M1 | 17.49 | 10.51 | 0.01 | 4.66 | 0.01 | 2.26 |
Hat values (sometimes called leverage), which gauge the influence of the observed value of the outcome variable over the predicted values. Leverage values can lie between 0 (indicating that the case has no influence whatsoever) and 1 (indicating that the case has complete influence over prediction. The average leverage value is defined as (k+1)/n. Should Investigating cases with values greater than twice the average (2(k + 1)/n) and three times the average (3(k + 1)/n) as a cut-off point for identifying cases having undue influence.
In this model nihss ~ age + sbp + time.rand: Leverage = (3+1)/496 =
0.008
=> 2 times of Leverage = 0.016
=> 3 times of Leverage = 0.024
=> All cases are within the boundary of three times the average.
C1: We can search the id that has large residual (>2) => cooks.distance, leverage and coveriance ratio:
df2[df2$large.residual , c("studyid", "cooks.distance", "leverage", "covariance.ratios")]
## # A tibble: 14 x 4
## studyid cooks.distance leverage covariance.ratios
## <fct> <dbl> <dbl> <dbl>
## 1 z022 0.00561 0.00538 0.980
## 2 z045 0.0112 0.00871 0.975
## 3 z079 0.00789 0.00770 0.983
## 4 z097 0.00532 0.00424 0.972
## 5 z129 0.0139 0.0120 0.983
## 6 z184 0.00821 0.00757 0.981
## 7 z243 0.00397 0.00369 0.977
## 8 z256 0.00699 0.00552 0.973
## 9 z281 0.00394 0.00337 0.974
## 10 z290 0.00392 0.00331 0.973
## 11 z375 0.0101 0.00983 0.985
## 12 z388 0.00727 0.00616 0.976
## 13 z391 0.00441 0.00415 0.978
## 14 z435 0.00555 0.00535 0.980
C2: OR we can search what value have high leverage only:
aug_model_B |> select(studyid:ia.occlus, .hat) |>
filter(.hat > 0.024) |>
arrange(desc(.hat))
## # A tibble: 3 x 8
## studyid nihss age sbp time.rand sex ia.occlus .hat
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 z148 23 75 231 120 Male M1 0.0303
## 2 z115 11 89 78 296 Female M1 0.0256
## 3 z372 28 27 94 117 Male M1 0.0241
3 of our 496 subjects meet the standard for large leverage values (.hat values exceeding three times the average .hat value) but none of these are the outlier (id = z045) we identified as having an especially poor fit (large standardized residual.)
aug_model_B |>
filter(studyid == "z045") |>
select(studyid, .std.resid, .hat)
## # A tibble: 1 x 3
## studyid .std.resid .hat
## <fct> <dbl> <dbl>
## 1 z045 2.26 0.00871
So, let’s look at a plot of the two quantitative predictors in our data to see which 3 points show up as those with “high” leverage. The points with high leverage (large values of .hat) are simply those with an unusual combination of the two quantitative predictors.
ggplot(aug_model_B, aes(x = age, y = sbp)) +
geom_point(col = "gray50") +
geom_point(data = aug_model_B |>
slice_max(.hat, n = 3),
col = "red", size = 2) +
geom_text_repel(data = aug_model_B |>
slice_max(.hat, n = 3),
aes(label = studyid), col = "red") +
labs(x = "age", y = "sbp",
title = "Where are the highly leveraged points in our model?",
subtitle = "model nihss, 3 highest leverage points in red")
Another way to understand the impact of leverage is to look at a plot of residuals vs. leverage, which is diagnostic plot 5 for a linear model. Here’s the base R version for our model_B model.
plot(model_B, which=5)
Here is the ggplot2 version.
ggplot(aug_model_B, aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_point(data = aug_model_B |> filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug_model_B |> filter(.cooksd >= 0.5),
aes(label = studyid), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Residuals vs. Leverage",
x = "Leverage", y = "Standardized Residual")
We see that the most highly leveraged points are shown furthest to the right in this plot (in fact, I’ve inserted a dotted vertical line at the cutpoint of 3*mean(.hat)).
Cook’s distance is a measure of the overall influence of a case on the model, values greater than 1 may be cause for concern. Even points with Cook’s d > 0.5 may merit further investigation.
df2[df2$large.residual , c("studyid", "cooks.distance", "leverage", "covariance.ratios")]
## # A tibble: 14 x 4
## studyid cooks.distance leverage covariance.ratios
## <fct> <dbl> <dbl> <dbl>
## 1 z022 0.00561 0.00538 0.980
## 2 z045 0.0112 0.00871 0.975
## 3 z079 0.00789 0.00770 0.983
## 4 z097 0.00532 0.00424 0.972
## 5 z129 0.0139 0.0120 0.983
## 6 z184 0.00821 0.00757 0.981
## 7 z243 0.00397 0.00369 0.977
## 8 z256 0.00699 0.00552 0.973
## 9 z281 0.00394 0.00337 0.974
## 10 z290 0.00392 0.00331 0.973
## 11 z375 0.0101 0.00983 0.985
## 12 z388 0.00727 0.00616 0.976
## 13 z391 0.00441 0.00415 0.978
## 14 z435 0.00555 0.00535 0.980
Discuss: None of them has a Cook’s distance greater than 1 => None of the cases is having an undue influence.
plot(model_B, which=5)
No points in this plot indicate substantial influence on our model. To do so, they’d have to be outside the Cook’s distance contour shown at the top right of this plot (there is a similar arc at the bottom right, but it’s too far away from the data to show up in the plot.)
OR we can draw Index Plot of Cook’s Distance:
plot(model_B, which=4)
aug_model_B_ex <- aug_model_B |>
mutate(obsnum = 1:nrow(aug_model_B |> select(.cooksd)))
ggplot(aug_model_B_ex, aes(x = obsnum, y = .cooksd)) +
geom_point() +
geom_segment(aes(x = obsnum, xend = obsnum, y = 0, yend = .cooksd)) +
geom_text_repel(data = aug_model_B_ex |>
slice_max(.cooksd, n = 3),
aes(label = studyid)) +
labs(x = "Row Number",
y = "Cook's Distance",
title = "Cook's distance Index plot for chol_m1",
subtitle = "Subjects with the 3 largest Cook's d values are identified.")
Remember that we’d need a Cook’s distance value to be at least 0.50 for us to worry about it in a serious way. Here, the largest we see (0.02 for id = 372) is still far away from that standard.
Let’s look at the six largest Cook’s distance values.
aug_model_B |> select(studyid, .std.resid, .hat, .cooksd) |>
arrange(desc(.cooksd)) |> head() |> kbl(digits = 3) |> kable_classic()
| studyid | .std.resid | .hat | .cooksd |
|---|---|---|---|
| z372 | 1.892 | 0.024 | 0.022 |
| z357 | 1.751 | 0.022 | 0.017 |
| z115 | -1.459 | 0.026 | 0.014 |
| z129 | 2.138 | 0.012 | 0.014 |
| z330 | -1.849 | 0.016 | 0.014 |
| z010 | 1.767 | 0.015 | 0.012 |
If we want to search for a particular id:
aug_model_B |> filter(studyid == "z372") |>
select(studyid:.fitted, .std.resid, .hat, .cooksd) |>
kbl(digits = 2) |> kable_classic()
| studyid | nihss | age | sbp | time.rand | sex | ia.occlus | .fitted | .std.resid | .hat | .cooksd |
|---|---|---|---|---|---|---|---|---|---|---|
| z372 | 28 | 27 | 94 | 117 | Male | M1 | 19.26 | 1.89 | 0.02 | 0.02 |
We consider the following criteria:
Most of our 14 potential outliers have CVR values within or just outside these boundaries. Besides, given the Cook’s distance for all case, there is probably no cause for alarm.
For any violation, please read at this page: 30.14 Violated
Assumptions: Problems with Linearity
https://thomaselove.github.io/431-notes/30-regr_diagnostics.html#the-scale-location-plot-to-check-for-non-constant-variance
Overall, we need to eliminate an outlier observation or transform the
outcome/ predictors depend on the types of violation.
https://www.r-bloggers.com/2022/10/box-cox-transformation-in-r/
So what do serious assumption violations look like, and what can we do about them?
Here is a simulated example that shows a clear problem with non-linearity.
set.seed(4311); x1 <- rnorm(n = 100, mean = 15, sd = 5)
set.seed(4312); x2 <- rnorm(n = 100, mean = 10, sd = 5)
set.seed(4313); e1 <- rnorm(n = 100, mean = 0, sd = 15)
y <- 15 + x1 + x2^2 + e1
viol1 <- data.frame(outcome = y, pred1 = x1, pred2 = x2) |> tibble()
model_1 <- lm(outcome ~ pred1 + pred2, data = viol1)
plot(model_1, which = 1)
In light of this, I would be looking for a potential transformation of outcome. Does the Box-Cox plot make any useful suggestions?
boxCox(model_1); powerTransform(model_1)
## Estimated transformation parameter
## Y1
## 0.4414775
Note that if the outcome was negative, we would have to add some constant value to every outcome in order to get every outcome value to be positive, and Box-Cox to run. This suggests fitting a new model, using the square root of the outcome.
model_2 <- lm(sqrt(outcome) ~ pred1 + pred2, data = viol1)
plot(model_2, which = 1)
This is meaningfully better in terms of curve, but now looks a bit fan-shaped, indicating a potential problem with heteroscedasticity. Let’s look at the scale-location plot for this model.
plot(model_2, which = 3)
This definitely looks like there’s a trend down in this plot. So the square root transformation, by itself, probably hasn’t resolved assumptions sufficiently well. We’ll have to be very careful about our interpretation of the model.
set.seed(4314); x1 <- rnorm(n = 100, mean = 15, sd = 5)
set.seed(4315); x2 <- rnorm(n = 100, mean = 10, sd = 5)
set.seed(4316); e2 <- rnorm(n = 100, mean = 3, sd = 5)
y2 <- 50 + x1 + x2 + e2^2
viol2 <- data.frame(outcome = y2, pred1 = x1, pred2 = x2) |> tibble()
model.3 <- lm(outcome ~ pred1 + pred2, data = viol2)
plot(model.3, which = 2)
Skewed residuals often show up in strange patterns in the plot of residuals vs. fitted values, too, as in this case.
plot(model.3, which = 1)
Clearly, we have some larger residuals on the positive side, but not on the negative side. Would an outcome transformation be suggested by Box-Cox?
boxCox(model.3); powerTransform(model.3)
## Estimated transformation parameter
## Y1
## -1.438747
The suggested transformation looks like the inverse of our outcome.
model.4 <- lm(1/outcome ~ pred1 + pred2, data = viol2)
plot(model.4, which = 2)
OK. That’s something of an improvement. How about the other residual plots with this transformation?
plot(model.4, which = 1)
The impact of the skew is reduced, at least. I might well be satisfied enough with this, in practice.
https://thomaselove.github.io/432-notes/summarizing-the-smart_cle1-data.html#model-summary-for-a-simple-one-predictor-regression
https://thomaselove.github.io/431-notes/30-regr_diagnostics.html
https://www.statology.org/durbin-watson-test-r/
Field, A., Miles, J. and Field, Z., 2012. Discovering statistics using
R. Sage publications.