knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

Clustering practice: https://rpubs.com/minhtri/923711
Table Descriptive Analysis: https://rpubs.com/minhtri/929114
Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
The R environment - Data Wrangling: https://rpubs.com/minhtri/968607
Correlation analysis: https://rpubs.com/minhtri/968611
Parametric or Nonparametric data: https://rpubs.com/minhtri/968616
Dealing with outliers or non-normal distribution: https://rpubs.com/minhtri/968622
Linear regression - Frequentist approach: https://rpubs.com/minhtri/968628
Linear regression - Bayesian framework: https://rpubs.com/minhtri/968636
Linear regression - Choosing optimal Model: https://rpubs.com/minhtri/968651

Note: This analysis is used for personal study purpose. In this section, I will summerize some basic infomation about linear regression - model selection based on several sources.

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

1 Simulated fakestroke data

Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) |

Variable	Description
studyid	Study ID # (z001 through z500)
trt	Treatment group (Intervention or Control)
age	Age in years
sex	Male or Female
nihss	NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location	Stroke Location - Left or Right Hemisphere
hx.isch	History of Ischemic Stroke (Yes/No)
afib	Atrial Fibrillation (1 = Yes, 0 = No)
dm	Diabetes Mellitus (1 = Yes, 0 = No)
mrankin	Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp	Systolic blood pressure, in mm Hg
iv.altep	Treatment with IV alteplase (Yes/No)
time.iv	Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects	Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus	Intracranial arterial occlusion, based on vessel imaging - five categories3
extra.ica	Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand	Time from stroke onset to study randomization, in minutes
time.punc	Time from stroke onset to groin puncture, in minutes (only if Intervention)

2 Data preprocessing step

2.1 Step 1: Data wrangling

Please refer to this section: The R environment - Data Wrangling: https://rpubs.com/minhtri/933332

Load the data to R and examine some basic infomation about variables. When loading the data, I also define the type of variables (numeric or text):

df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
names(df)

##  [1] "studyid"   "trt"       "age"       "sex"       "nihss"     "location" 
##  [7] "hx.isch"   "afib"      "dm"        "mrankin"   "sbp"       "iv.altep" 
## [13] "time.iv"   "aspects"   "ia.occlus" "extra.ica" "time.rand" "time.punc"

str(df)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Define factor for all character variables:

df <- df %>% mutate_if(is.character, as.factor)
str(df)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

afib, dm, extra.ica are variables but in the excel file they are noted as “1” and “0”. Therefore, I must convert them back to “Yes-No” for factor variables:

# Changing coding variable to categorical:
df$afib       <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm         <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica  <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))

Reorder the level of some character variables. After this, basically we have tidy data that are nearly ready for Missingness checking:

trt: Intervention (baseline) , Control
mrankin: 0, 1, 2, > 2.
ia.occlus: rearrange the ia.occlus variable to the order presented in Berkhemer et al. (2015).

# Relevel factors:
df$trt = factor(df$trt, levels=c("Control", "Intervention"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Check order of factor after relevel: 
str(df)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

2.2 Step 2: Missingness checking and imputation when necessary

Package: VIM

Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)

plot: whether the results should be plotted.
numbers: whether the proportion or frequencies of the different combinations should be represented by numbers.
prop: whether the proportion of missing/imputed values should be used rather than the total amount.
combined: a logical indicating whether the two plots should be combined.
sortVars: whether the variables should be sorted by the number of missing/imputed values.

For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007

Checking the percentage of missingness:

library(VIM)
aggr(df, plot = F, numbers = T, prop = T)

## 
##  Missings in variables:
##   Variable Count
##        sbp     1
##    time.iv    55
##    aspects     4
##  ia.occlus     1
##  extra.ica     1
##  time.rand     2
##  time.punc   267

aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc 0.534
##    time.iv 0.110
##    aspects 0.008
##  time.rand 0.004
##        sbp 0.002
##  ia.occlus 0.002
##  extra.ica 0.002
##    studyid 0.000
##        trt 0.000
##        age 0.000
##        sex 0.000
##      nihss 0.000
##   location 0.000
##    hx.isch 0.000
##       afib 0.000
##         dm 0.000
##    mrankin 0.000
##   iv.altep 0.000

aggr(df, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc   267
##    time.iv    55
##    aspects     4
##  time.rand     2
##        sbp     1
##  ia.occlus     1
##  extra.ica     1
##    studyid     0
##        trt     0
##        age     0
##        sex     0
##      nihss     0
##   location     0
##    hx.isch     0
##       afib     0
##         dm     0
##    mrankin     0
##   iv.altep     0

Discuss

There are 7 variables with missingess.
However, the missingness in time.punc and time.iv is due to the experimental design (time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep =Yes). Therefore, I will exclude them from the missingness analysis.
Only 5 variables left: the proportion of each < 5%.
=> Complete case analysis can be applied.
=> We can ignore the missingness and progress to next step - Multiple regression.

df1 = df %>% select (-time.punc, - time.iv, -studyid) %>% na.omit()

3 Model training and evaluation

3.1 Introduction

Supervised learning tasks can be of two types: classification and regression. The ground idea stays the same: prediction of an outcome, the only difference is the type of outcome. In classification problems, a class is predicted: the outcome can only take a value from a fixed set of distinct possibilities. In a regression problem, the outcome is a continuous numerical variable and can take any value.

We’ll start with the train/test split. We want to split our dataset into two parts:

the training set: this part of the data will be used to train the model.
the testing set: this set is left out during the training of the model, and is used for testing the performance: the model predicts the outcome of the data in the testing set, and this outcome is compared with the real outcome that we know from the data.

Note: It is important that the training and testing sets do not overlap: as the model learns from the training data, if the same data were used for testing, it would lead to a biased performance result (the model is tested on the same data that were used for it to learn).

Each time we will use a supervised machine learning algorithm in this course, the following steps will be observed:

Split the data into training and testing set.
Build the model using the training set and its true labels.
Make prediction on the testing set.
Compare predictions to true labels of the testing set and get performance.

Sometimes, when a model has to be specifically adapted to the task, the steps might get a bit more complex (the separation of the training and testing datasets might differ for example), but the base idea stays the same.

3.2 Practice

Before we do anything, let’s set a random seed. Train/test split is a random process, and seed ensures the randomization works the same on yours and my computer set.seed():

set.seed(123)

Let’s perform the split now. 70% of the data is used for training, and the remaining 30% is used for testing.

library(caret) ; library(rsample) 
index = createDataPartition (df1$nihss, p = 0.7, list = FALSE)
training = df1[index, ]
testing = df1[-index, ]
dim(training)

## [1] 346  15

dim(testing)

## [1] 146  15

After the split, you will see we have 492 rows in total, of which 346 were allocated for model training, and the remaining 146 are used to test the model.

We can now proceed with the model training. Based on the result of model selection in previous part (https://rpubs.com/minhtri/968651), I will choose only 2 predictors dm + ia.occlus. R uses the following syntax for linear regression models:

fit = lm(nihss ~ dm + ia.occlus, data = training)
summary(fit)

## 
## Call:
## lm(formula = nihss ~ dm + ia.occlus, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5494 -3.5494 -0.5494  3.4506  9.4506 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           15.3333     2.6473   5.792 1.58e-08 ***
## dmYes                 -2.3694     0.7416  -3.195  0.00153 ** 
## ia.occlusICA with M1   2.1704     2.6931   0.806  0.42087    
## ia.occlusM1            3.2161     2.6666   1.206  0.22863    
## ia.occlusM2            3.0462     2.8094   1.084  0.27901    
## ia.occlusA1 or A2      4.1667     4.1858   0.995  0.32023    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.585 on 340 degrees of freedom
## Multiple R-squared:  0.0401, Adjusted R-squared:  0.02598 
## F-statistic: 2.841 on 5 and 340 DF,  p-value: 0.01576

Then, we can finally make predictions!

testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss

# Find RMSE of testing model
sqrt(mean (testing$error^2))

## [1] 4.69602

# Find the coefficient of determination  
cor( testing$predicted, testing$nihss)^2

## [1] 0.03540727

Discuss: RMSE should be low and R-squared is high for a good model. In reality, we use K-fold approach. However, everything is not quite right for this fake data.

3.3 ML K-fold cross-validation approach

library(caret)
fit = train(form = nihss ~ dm + ia.occlus, data = training, method = "lm", trControl = trainControl(method = "cv", number = 10))
fit

## Linear Regression 
## 
## 346 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 311, 311, 310, 312, 312, 312, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   4.604462  0.04018193  3.927088
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss

# Find RMSE of testing model
sqrt(mean (testing$error^2))

## [1] 4.69602

# Find the coefficient of determination  
cor( testing$predicted, testing$nihss)^2

## [1] 0.03540727

3.4 ML Bootstrap validation approach (preferable ?)

library(caret)
fit = train(form = nihss ~ dm + ia.occlus, data = training, method = "lm", trControl = trainControl(method = "boot", number = 100))
fit

## Linear Regression 
## 
## 346 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 346, 346, 346, 346, 346, 346, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   4.616364  0.02373209  3.920081
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

testing$predicted = predict(fit, testing)
testing$error = testing$predicted - testing$nihss

# Find RMSE of testing model
sqrt(mean (testing$error^2))

## [1] 4.69602

# Find the coefficient of determination  
cor( testing$predicted, testing$nihss)^2

## [1] 0.03540727

4 References

https://www3.tuhh.de/sts/hoou/data-quality-explored/0-2-supervised-learning.html

Linear regression - Machine learning

2022-11-12