knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Clustering practice: https://rpubs.com/minhtri/923711
Table Descriptive Analysis: https://rpubs.com/minhtri/929114
Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
The R environment - Data Wrangling: https://rpubs.com/minhtri/968607
Correlation analysis: https://rpubs.com/minhtri/968611
Parametric or Nonparametric data: https://rpubs.com/minhtri/968616
Dealing with outliers or non-normal distribution: https://rpubs.com/minhtri/968622
Linear regression - Frequentist approach: https://rpubs.com/minhtri/968628
Linear regression - Bayesian framework: https://rpubs.com/minhtri/968636
Linear regression - Choosing optimal Model: https://rpubs.com/minhtri/968651
Note: This analysis is used for personal study purpose. In this section, I will summerize some basic infomation about linear regression - model selection based on several sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
| Variable | Description |
|---|---|
| studyid | Study ID # (z001 through z500) |
| trt | Treatment group (Intervention or Control) |
| age | Age in years |
| sex | Male or Female |
| nihss | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
| location | Stroke Location - Left or Right Hemisphere |
| hx.isch | History of Ischemic Stroke (Yes/No) |
| afib | Atrial Fibrillation (1 = Yes, 0 = No) |
| dm | Diabetes Mellitus (1 = Yes, 0 = No) |
| mrankin | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
| sbp | Systolic blood pressure, in mm Hg |
| iv.altep | Treatment with IV alteplase (Yes/No) |
| time.iv | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
| aspects | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
| ia.occlus | Intracranial arterial occlusion, based on vessel imaging - five categories3 |
| extra.ica | Extracranial ICA occlusion (1 = Yes, 0 = No) |
| time.rand | Time from stroke onset to study randomization, in minutes |
| time.punc | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
Please refer to this section: The R environment - Data Wrangling: https://rpubs.com/minhtri/933332
df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
names(df)
## [1] "studyid" "trt" "age" "sex" "nihss" "location"
## [7] "hx.isch" "afib" "dm" "mrankin" "sbp" "iv.altep"
## [13] "time.iv" "aspects" "ia.occlus" "extra.ica" "time.rand" "time.punc"
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
df <- df %>% mutate_if(is.character, as.factor)
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Changing coding variable to categorical:
df$afib <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Relevel factors:
df$trt = factor(df$trt, levels=c("Control", "Intervention"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Check order of factor after relevel:
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
Package: VIM
Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)
For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007
Checking the percentage of missingness:
library(VIM)
aggr(df, plot = F, numbers = T, prop = T)
##
## Missings in variables:
## Variable Count
## sbp 1
## time.iv 55
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
## time.punc 267
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 0.534
## time.iv 0.110
## aspects 0.008
## time.rand 0.004
## sbp 0.002
## ia.occlus 0.002
## extra.ica 0.002
## studyid 0.000
## trt 0.000
## age 0.000
## sex 0.000
## nihss 0.000
## location 0.000
## hx.isch 0.000
## afib 0.000
## dm 0.000
## mrankin 0.000
## iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 267
## time.iv 55
## aspects 4
## time.rand 2
## sbp 1
## ia.occlus 1
## extra.ica 1
## studyid 0
## trt 0
## age 0
## sex 0
## nihss 0
## location 0
## hx.isch 0
## afib 0
## dm 0
## mrankin 0
## iv.altep 0
Discuss
df1 = df %>% select (-time.punc, - time.iv, -studyid) %>% na.omit()
Supervised learning tasks can be of two types: classification and regression. The ground idea stays the same: prediction of an outcome, the only difference is the type of outcome. In classification problems, a class is predicted: the outcome can only take a value from a fixed set of distinct possibilities. In a regression problem, the outcome is a continuous numerical variable and can take any value.
We’ll start with the train/test split. We want to split our dataset into two parts:
Note: It is important that the training and testing sets do not overlap: as the model learns from the training data, if the same data were used for testing, it would lead to a biased performance result (the model is tested on the same data that were used for it to learn).
Each time we will use a supervised machine learning algorithm in this course, the following steps will be observed:
Split the data into training and testing set.
Build the model using the training set and its true labels.
Make prediction on the testing set.
Compare predictions to true labels of the testing set and get performance.
Sometimes, when a model has to be specifically adapted to the task, the steps might get a bit more complex (the separation of the training and testing datasets might differ for example), but the base idea stays the same.
Before we do anything, let’s set a random seed. Train/test split is a random process, and seed ensures the randomization works the same on yours and my computer set.seed():
set.seed(123)
Let’s perform the split now. 70% of the data is used for training, and the remaining 30% is used for testing.
library(caret) ; library(rsample)
index = createDataPartition (df1$nihss, p = 0.7, list = FALSE)
training = df1[index, ]
testing = df1[-index, ]
dim(training)
## [1] 346 15
dim(testing)
## [1] 146 15
After the split, you will see we have 492 rows in total, of which 346 were allocated for model training, and the remaining 146 are used to test the model.
We can now proceed with the model training. Based on the result of model selection in previous part (https://rpubs.com/minhtri/968651), I will choose only 2 predictors dm + ia.occlus. R uses the following syntax for linear regression models:
fit = lm(nihss ~ dm + ia.occlus, data = training)
summary(fit)
##
## Call:
## lm(formula = nihss ~ dm + ia.occlus, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5494 -3.5494 -0.5494 3.4506 9.4506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3333 2.6473 5.792 1.58e-08 ***
## dmYes -2.3694 0.7416 -3.195 0.00153 **
## ia.occlusICA with M1 2.1704 2.6931 0.806 0.42087
## ia.occlusM1 3.2161 2.6666 1.206 0.22863
## ia.occlusM2 3.0462 2.8094 1.084 0.27901
## ia.occlusA1 or A2 4.1667 4.1858 0.995 0.32023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.585 on 340 degrees of freedom
## Multiple R-squared: 0.0401, Adjusted R-squared: 0.02598
## F-statistic: 2.841 on 5 and 340 DF, p-value: 0.01576
Then, we can finally make predictions!
testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss
# Find RMSE of testing model
sqrt(mean (testing$error^2))
## [1] 4.69602
# Find the coefficient of determination
cor( testing$predicted, testing$nihss)^2
## [1] 0.03540727
Discuss: RMSE should be low and R-squared is high for a good model. In reality, we use K-fold approach. However, everything is not quite right for this fake data.
library(caret)
fit = train(form = nihss ~ dm + ia.occlus, data = training, method = "lm", trControl = trainControl(method = "cv", number = 10))
fit
## Linear Regression
##
## 346 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 311, 311, 310, 312, 312, 312, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.604462 0.04018193 3.927088
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss
# Find RMSE of testing model
sqrt(mean (testing$error^2))
## [1] 4.69602
# Find the coefficient of determination
cor( testing$predicted, testing$nihss)^2
## [1] 0.03540727
library(caret)
fit = train(form = nihss ~ dm + ia.occlus, data = training, method = "lm", trControl = trainControl(method = "boot", number = 100))
fit
## Linear Regression
##
## 346 samples
## 2 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 346, 346, 346, 346, 346, 346, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.616364 0.02373209 3.920081
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss
# Find RMSE of testing model
sqrt(mean (testing$error^2))
## [1] 4.69602
# Find the coefficient of determination
cor( testing$predicted, testing$nihss)^2
## [1] 0.03540727