knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

Clustering practice: https://rpubs.com/minhtri/923711
Table Descriptive Analysis: https://rpubs.com/minhtri/929114
Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
The R environment - Data Wrangling: https://rpubs.com/minhtri/968607

Note: This analysis is used for my own study purpose. In this section, I will summerize some basic commands for correlation analysis in R environment based on several online sources.

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

1 Introduction

First, I will add the table about parametric or non-parametric test for independent groups. This table will help you have an overview about which approach is suitable for which variables. In this practice, I will go for Correlation analysis:

1.1 About correlation coefficient r

Result value of coefficient:

A coefficient of +1: positively correlated.
A coefficient of -1: perfect negative relationship.
A coefficient of zero indicates no linear relationship at all.

The correlation coefficient is a commonly used measure of the size of an effect:

± 0.1: a small effect.
± 0.3: a medium effect.
± 0.5: a large effect.
However, if you can, try to interpret the size of correlation within the context of the research you’ve done rather than blindly following these benchmarks.

Correlation coefficients give no indication of the direction of causality. For two reasons:

The third-variable problem: causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the result.
Direction of causality: Even if we could ignore the third-variable problem, correlation coefficient doesn’t indicate in which direction causality operates.

2 types of correlation:

A bivariate correlation: a correlation between two variables (as described at the beginning of this chapter): Pearson’s product-moment correlation coefficient, Spearman’s rho and Kendall’s tau.
A partial correlation: the relationship between two variables while ‘controlling’ the effect of one or more additional variables.

1.2 R2 - Coefficient of determination for interpretation

Extremely important:

We cannot make direct conclusions about causality from a correlation.
Coefficient of determination, R2: a measure of the amount of variability in one variable that is shared by the other.
R2 cannot be used to infer causal relationships.

Example: The correlation coefficient (r) of X and Y is -0.4410 => so the value of R2 will be (-0.4410)^2 = 0.194. This value tells us how much of the variability in X is shared by Y.
=> X shares 19.4% of the variability in Y.
=> Or X accounts for only 19.4% of variation in Y.
=> This leaves 80.6% of the variability still to be accounted for by other variables.

R2 cannot be used to infer causal relationships. X might well share 19.4% of the variation in Y, but it does not necessarily cause this variation.

1.3 Calculating the effect size

Correlation coefficients are effect sizes: r => No calculations.

Pearson’s r that we can square this value to get the proportion of shared variance, R2.
For Spearman’s rs we can do this too because it uses the same equation as Pearson’s r. However, the resulting Rs2: it is the proportion of variance in the ranks that two variables share. Having said this, Rs2 is usually a good approximation for R2 (especially in conditions of near-normal distributions).
Kendall’s τ, however, is not numerically similar to either r or rs and so τ2 does not tell us about the proportion of variance shared by two variables or the ranks of those two variables. Kendall’s τ is 66–75% smaller than both Spearman’s rs and Pearson’s r, but r and rs are generally similar sizes.
=> If τ is used as an effect size it should be borne in mind that it is not comparable to r and rs and should not be squared.
Point-biserial and biserial correlations differ in size too (the biserial correlation was bigger than the point-biserial). => Be careful to decide whether your dichotomous variable has an underlying continuum, or whether it is a truly discrete variable.

1.4 R commands

There are 4 main functions that you can apply to calculate correlation coefficient:

cor() and cor.test() are part of the base system in R.
corr.test() is part of the psych package.
rcorr() is part of the Hmisc package.

Function	Pearson	Spearman	Kendall	p-value	CI	Multiple correlations?	Sample size
cor()	x	x	x			x
cor.test()	x	x	x	x	x
corr.test()	x	x	x	x	x	x	x
rcorr()	x	x		x		x	x

1.4.1 cor()

Command: cor(x, y = NULL, use = “everything”, method = c(“pearson”, “kendall”, “spearman”))

x: a numeric variable or dataframe.
y: another numeric variable.
use: set equal to a character string that specifies how missing values are handled. The strings can be:

“everything”: R will output an NA instead of a correlation coefficient for any correlations involving variables containing missing values;
“all.obs”: use all observations and, therefore, returns an error message if there are any missing values in the data;
“complete.obs”: correlations are computed from only cases that are complete for all variables – sometimes known as excluding cases listwise (if a case has a missing value for any variable, then they are excluded from the whole analysis);
“pairwise.complete.obs”: n which correlations between pairs of variables are computed for cases that are complete for those two variables – sometimes known as excluding cases pairwise (if a participant has a score missing for a particular variable or analysis, then their data are excluded only from calculations involving the variable for which they have no score).

method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of “pearson” (default), “kendall”, or “spearman”: can be abbreviated. If you want more than one type you can using the c() function; for example, c(“pearson”, “spearman”) would produce both types of correlation coefficients.

1.4.2 cor.test()

Test for association between paired samples, using one of Pearson’s product moment correlation coefficient, Kendall’s tau or Spearman’s rho.

Command: cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)

x: a numeric variable.
y: another numeric variable.
alternative: whether you want to do a two-tailed test (alternative = “two.sided” - default), or whether you predict that the correlation will be less than zero (i.e., negative) or more than zero (i.e., positive), in which case you can use alternative = “less” and alternative = “greater”, respectively.
method: the same as for cor().
conf.level: specify the width of the confidence interval computed for the correlation. The default is 0.95 (conf.level = 0.95).

1.4.3 corr.test()

Although the cor.test() function finds the correlations for a matrix, it does not report probability values. cor.test does, but for only one pair of variables at a time. corr.test() uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well. For symmetric matrices, raw probabilites are reported below the diagonal and correlations adjusted for multiple comparisons above the diagonal. In the case of different x and ys, the default is to adjust the probabilities for multiple tests.

Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”, method=“pearson”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5, normal=TRUE)

x: a matrix or dataframe.
y: a second matrix or dataframe with the same number of rows as x use.
use=“pairwise”: the default value and will do pairwise deletion of cases. use=“complete” will select just complete cases.
method: method=“pearson” is the default value. The alternatives to be passed to cor are “spearman” and “kendall”. These last two are much slower, particularly for big data sets.
adjust: What adjustment for multiple tests should be used? (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”). See p.adjust for details about why to use “holm” rather than “bonferroni”).
alpha: alpha level of confidence intervals.

1.4.4 rcorr()

rcorr() Computes a matrix of Pearson’s r or Spearman’s rho rank correlation coefficients for all possible pairs of columns of a matrix. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. Ranks are computed using efficient algorithms, using midranks for ties.

Package: Hmisc

Command: rcorr(x,y, type = “correlation type”)

x is a numeric variable or matrix.
y is another numeric variable. (does not need to be specified if x above is a matrix).
type enables you to specify whether you want “pearson” or “spearman” correlations. If you want both you can specify a list as c(“pearson”, “spearman”).

Note:

This function does not work on dataframes, so you have to convert your dataframe to a matrix first: newMatrix <- as.matrix(dataframe).
This function excludes cases pairwise.

1.4.5 Drawing a visual graph

R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables. Only appy for cor() command.

Package: corrplot

Command: corrplot(df)

df = cor(dataset)

Leave blank on non-significant coefficient_ add all correlation coefficients:

Command: corrplot(df, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank', order = 'AOE', diag = FALSE)$corrPos -> p1 text(p1$x, p1$y, round(p1$corr, 2))

2 Simulated fakestroke data

Main data used in this notes: fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) |

Variable	Type	Description
studyid	Category	Study ID # (z001 through z500)
trt	Category	Treatment group (Intervention or Control)
age	Numeric	Age in years
sex	Category	Male or Female
nihss	Numeric	NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location	Category	Stroke Location - Left or Right Hemisphere
hx.isch	Category	History of Ischemic Stroke (Yes/No)
afib	Category	Atrial Fibrillation (1 = Yes, 0 = No)
dm	Category	Diabetes Mellitus (1 = Yes, 0 = No)
mrankin	Category	Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp	Numeric	Systolic blood pressure, in mm Hg
iv.altep	Category	Treatment with IV alteplase (Yes/No)
time.iv	Numeric	Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects	Category	Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus	Category	Intracranial arterial occlusion, based on vessel imaging - five categories
extra.ica	Category	Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand	Numeric	Time from stroke onset to study randomization, in minutes
time.punc	Numeric	Time from stroke onset to groin puncture, in minutes (only if Intervention)

3 Pearson’s correlation coefficient (r) – Parametric test

3.1 Assumptions of Pearson’s r

Requires only that data are interval (continuous variables).
If you want to establish whether the correlation coefficient is significant, more assumption: The sampling distribution has to be normally distributed.
=> If your data are non-normal, use a different kind of correlation coefficient or use bootstrapping.

3.2 Conclusion about Pearson’s r

Pearson correlation only indicates the degree of association between the two continuous variables but not the causation.

Using Pearsonian correlation, if the correlation is positive and significant, one should not conclude that as the variable X positively causes and affect the Y;
If the correlation is negative and significant, one should not conclude that as the variable the variable X negatively causes and affect the Y.
One can only say for positive and significant correlation, X and Y are positively associated; and for negative and significant correlation, X and Y are negatively associated, but never conclude about causation. It is only about the trend. This is a common mistake found in many research articles which should be avoided.

3.3 Practice with R

3.3.1 Step 1: Data Wrangling

Load the data to R and examine some basic infomation about variables. When loading the data, I also define the type of variables (numeric or text):

data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
names(data)

##  [1] "studyid"   "trt"       "age"       "sex"       "nihss"     "location" 
##  [7] "hx.isch"   "afib"      "dm"        "mrankin"   "sbp"       "iv.altep" 
## [13] "time.iv"   "aspects"   "ia.occlus" "extra.ica" "time.rand" "time.punc"

str(data)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Define factor for all character variables:

data <- data%>%mutate_if(is.character, as.factor)
str(data)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

afib, dm, extra.ica are variables but in the excel file they are noted as “1” and “0”. Therefore, I must convert them back to “Yes-No” for factor variables:

# Changing coding variable to categorical:
data$afib       <- factor(data$afib , levels=0:1, labels=c("No", "Yes"))
data$dm         <- factor(data$dm , levels=0:1, labels=c("No", "Yes"))
data$extra.ica  <- factor(data$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Checking after converting
str(data)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Reorder the level of some character variables:

trt: Intervention (baseline) , Control
mrankin: 0, 1, 2, > 2.
ia.occlus: rearrange the ia.occlus variable to the order presented in Berkhemer et al. (2015).

# Relevel factors:
data$trt = factor(data$trt, levels=c("Intervention", "Control"))
data$mrankin = factor(data$mrankin, levels=c("0", "1", "2", "> 2"))
data$ia.occlus = factor(data$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Check order of factor after relevel: 
str(data)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

After step 1.4, basically we have tidy data that are nearly ready for analysis. But for Pearson correlation only apply for numeric variables, I have to exclude all factor variables:

# Select only numeric variables:
df1 = data %>% dplyr::select_if(is.numeric)
str(df1)

## tibble [500 x 7] (S3: tbl_df/tbl/data.frame)
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

3.3.2 Step 2: Missingness checking

Package: VIM

Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)

plot: whether the results should be plotted.
numbers: whether the proportion or frequencies of the different combinations should be represented by numbers.
prop: whether the proportion of missing/imputed values should be used rather than the total amount.
combined: a logical indicating whether the two plots should be combined.
sortVars: whether the variables should be sorted by the number of missing/imputed values.

For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007

Checking the percentage of missingness:

library(VIM)
aggr(df1, plot = F, numbers = T, prop = T)

## 
##  Missings in variables:
##   Variable Count
##        sbp     1
##    time.iv    55
##    aspects     4
##  time.rand     2
##  time.punc   267

aggr(df1, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc 0.534
##    time.iv 0.110
##    aspects 0.008
##  time.rand 0.004
##        sbp 0.002
##        age 0.000
##      nihss 0.000

aggr(df1, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##  time.punc   267
##    time.iv    55
##    aspects     4
##  time.rand     2
##        sbp     1
##        age     0
##      nihss     0

Discuss

There are 7 variables with missingess.
However, the missingness in time.punc and time.iv is due to the experimental design (time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep =Yes). Therefore, I must exclude them from the missingness analysis.
Only 5 variables left: the proportion < 5%.
=> Complete case analysis can be applied.
=> We can ignore the missingness and progress to next step.

3.3.3 Step 3: Preliminary test for Assumption: Data distribution

For Pearson: If you want to establish whether the correlation coefficient is significant, more assumption: The sampling distribution has to be normally distributed.

3.3.3.1 Step 3.1: Shapiro-Wilk test

Mechanism: Shapiro-Wilk test compares the scores in the sample to a normally distributed set of scores with the same mean and standard deviation:

Null hypothesis (p> 0.05): the data are normally distributed.
Alternative hypothesis (p <0.05): the data are not normally distributed.

Limitations: because with large sample sizes it is very easy to get significant results from small deviations from normality, and so a significant test doesn’t necessarily tell us whether the deviation from normality is enough to bias any statistical procedures that we apply to the data.
=> Plot your data (histogram / Q-Q plots) and try to make an informed decision about the extent of non-normality (and the values of skew and kurtosis.).

Command:
* For numeric variables: shapiro.test(variable)
* For separate factors in 1 numeric variable: by(numeric variable, group/categorical variable, name of test)

Practice: Testing the distribution of 7 numeric variables:

# List names of numeric variables:
names(df1)

## [1] "age"       "nihss"     "sbp"       "time.iv"   "aspects"   "time.rand"
## [7] "time.punc"

# Using Shapiro test:
shapiro.test(df1$age)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$age
## W = 0.97448, p-value = 1.184e-07

shapiro.test(df1$nihss)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$nihss
## W = 0.96203, p-value = 4.588e-10

shapiro.test(df1$sbp)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$sbp
## W = 0.99841, p-value = 0.9336

shapiro.test(df1$time.iv)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$time.iv
## W = 0.91375, p-value = 3.041e-15

shapiro.test(df1$aspects)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$aspects
## W = 0.8438, p-value < 2.2e-16

shapiro.test(df1$time.rand)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$time.rand
## W = 0.95965, p-value = 1.93e-10

shapiro.test(df1$time.punc)

## 
##  Shapiro-Wilk normality test
## 
## data:  df1$time.punc
## W = 0.93577, p-value = 1.435e-08

Discuss

sbp: p-value = 0.9336 > 0.05: normal distribution.
Other variables: p-value <0.05: non-normal distribution.

3.3.3.2 Step 3.2: Drawing Q-Q plot

We have 2 ways to command:

Package: ggplot2

ggplot(data = , aes(sample = )) + geom_qq() + geom_qq_line() + labs( x = “Theoretical”, y = “Sample Quantiles -”)

Package: ggpubr

ggqqplot(variable, ylab = ” “)

Practice: Let try the 2 commands for Age - similar plots are drawn:

ggplot(data = df1, aes(sample = age)) + 
  geom_qq() + 
  geom_qq_line() + 
  labs( x = "Theoretical", y = "Sample Quantiles - Age")

library("ggpubr")
ggqqplot(df1$age, ylab = "Age ")

Let use the command for other numeric variables:

ggqqplot(df1$nihss, ylab = "nihss ")

ggqqplot(df1$sbp, ylab = "sbp ")

ggqqplot(df1$time.iv, ylab = "time.iv ")

ggqqplot(df1$aspects, ylab = "aspects ")

ggqqplot(df1$time.rand, ylab = "time.rand ")

ggqqplot(df1$time.punc, ylab = "time.punc ")

Discuss

The Q-Q plot of sbp: normal distribution.
Other variables: non-normal distribution.
=> The results of Shapiro-Wilk test are similar to the Q-Q plot.
=> Violated assumption for normal distribution.
=> Must use non-normal approach: Spearman’s correlation coefficient (rs) or Kendall’s tau (t).
=> However, for demonstration, I will just continue with Pearson correlation test.

3.3.4 Step 4: Pearson correlation test

Let’s get to the main part: We have 4 types of commands: cor(); cor.test(); corr.test() and rcorr(). I will apply them 1 by 1 to compare the results age vs nihss and age vs sbp (if relevant):

3.3.4.1 cor()

cor(x,y, use = “string”, method = “correlation type”)

with use is set equal to a character string that specifies how missing values are handled. The strings can be:

everything (default): R will output an NA instead of a correlation coefficient for any correlations involving variables containing missing values;
all.obs: will use all observations and, therefore, returns an error message if there are any missing values in the data;
complete.obs: correlations are computed from only cases that are complete for all variables – sometimes known as excluding cases listwise (if a case has a missing value for any variable, then they are excluded from the whole analysis);
pairwise.complete.obs - n which correlations between pairs of variables are computed for cases that are complete for those two variables – sometimes known as excluding cases pairwise (if a participant has a score missing for a particular variable or analysis, then their data are excluded only from calculations involving the variable for which they have no score).

I will demonstrate 3 options of use through df1. However I prefer use = “pairwise.complete.obs”:

cor(df1, use = "everything", method = "pearson")

##                   age       nihss sbp time.iv aspects time.rand time.punc
## age        1.00000000 -0.03376615  NA      NA      NA        NA        NA
## nihss     -0.03376615  1.00000000  NA      NA      NA        NA        NA
## sbp                NA          NA   1      NA      NA        NA        NA
## time.iv            NA          NA  NA       1      NA        NA        NA
## aspects            NA          NA  NA      NA       1        NA        NA
## time.rand          NA          NA  NA      NA      NA         1        NA
## time.punc          NA          NA  NA      NA      NA        NA         1

cor(df1, use ="complete.obs", method = "pearson")

##                   age        nihss          sbp      time.iv       aspects
## age        1.00000000 -0.037087611  0.065222765  0.018257624  0.0964599084
## nihss     -0.03708761  1.000000000 -0.046851536  0.022913105 -0.0070076127
## sbp        0.06522276 -0.046851536  1.000000000 -0.001481055 -0.0588125399
## time.iv    0.01825762  0.022913105 -0.001481055  1.000000000  0.0214545875
## aspects    0.09645991 -0.007007613 -0.058812540  0.021454587  1.0000000000
## time.rand -0.06047223 -0.080975591  0.045421238  0.033862157  0.0076874989
## time.punc  0.09370001 -0.077125919  0.072707526 -0.145815362  0.0006809697
##              time.rand     time.punc
## age       -0.060472227  0.0937000138
## nihss     -0.080975591 -0.0771259191
## sbp        0.045421238  0.0727075264
## time.iv    0.033862157 -0.1458153617
## aspects    0.007687499  0.0006809697
## time.rand  1.000000000 -0.0116944381
## time.punc -0.011694438  1.0000000000

cor(df1, use = "pairwise.complete.obs", method = "pearson")

##                   age       nihss          sbp     time.iv      aspects
## age        1.00000000 -0.03376615  0.010030620 -0.02228718  0.008500860
## nihss     -0.03376615  1.00000000 -0.038767105  0.06151694 -0.043631014
## sbp        0.01003062 -0.03876710  1.000000000  0.07122853 -0.008519498
## time.iv   -0.02228718  0.06151694  0.071228535  1.00000000  0.030083740
## aspects    0.00850086 -0.04363101 -0.008519498  0.03008374  1.000000000
## time.rand -0.02424378 -0.07720899  0.028310877 -0.01314698  0.014546529
## time.punc  0.09933442 -0.04822361  0.064757047 -0.13864373 -0.014564781
##              time.rand    time.punc
## age       -0.024243782  0.099334424
## nihss     -0.077208987 -0.048223608
## sbp        0.028310877  0.064757047
## time.iv   -0.013146984 -0.138643731
## aspects    0.014546529 -0.014564781
## time.rand  1.000000000  0.005445479
## time.punc  0.005445479  1.000000000

library(corrplot)
a = cor(df1, use ="complete.obs", method = "pearson")
corrplot(a, method = 'number') # colorful number

library(corrplot)
a = cor(df1, use ="complete.obs", method = "pearson")
corrplot(a)

## leave blank on non-significant coefficient
## add all correlation coefficients
library(corrplot)
M = cor(df1, use ="complete.obs", method = "pearson")

testRes  = cor.mtest(df1, use ="complete.obs", method = "pearson")

corrplot(M, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
         order = 'AOE', diag = FALSE)$corrPos -> p1

text(p1$x, p1$y, round(p1$corr, 2))

Discuss

Look at the result, we can easily notice that:

Each variable is perfectly correlated with itself (obviously) and so r = 1.
For use = everything: sbp, time.iv, aspects, time.rand, time.punc have missing value => The result is NA.
For use = complete.obs: the program will delete all rows containing missing value: 1+1+1+4+25+29+237 = 298 rows (please refer to the missing value checking section). => The Pearson correlation coefficient of nihss and age is different; The correlation of other variables is also calculated.
For use = pairwise.complete.obs: the program will compare complete case of 2 pairs of variables. Therefore, the value of Pearson nihss and age is similar with use=complete.obs. This pairwise analysis is preferable to me.

For more visualization, please visit: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

Our target variables:

For age vs nihss: r = -0.03 indicated that age is weakly negatively related to nihss. R2 = 0.09% indicated that age shares 0.09% of the variability in nihss.
For age vs sbp: r = 0.01: age is weakly positively related to sbp. R2 = 0.01% indicated that age shares 0.01% of the variability in sbp.

3.3.4.2 cor.test()

cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)

cor.test(df1$age, df1$nihss)

## 
##  Pearson's product-moment correlation
## 
## data:  df1$age and df1$nihss
## t = -0.75395, df = 498, p-value = 0.4512
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12109817  0.05408458
## sample estimates:
##         cor 
## -0.03376615

cor.test(df1$age, df1$sbp)

## 
##  Pearson's product-moment correlation
## 
## data:  df1$age and df1$sbp
## t = 0.22363, df = 497, p-value = 0.8231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07781638  0.09772306
## sample estimates:
##        cor 
## 0.01003062

Discuss

The Pearson correlation coefficient of age and nihss, age and sbp when using cor() and cor.test() are the same: weakly relationship.
The p-value > 0.05: not statistically significant.
Confidence interval contains 0: further evidence that the r is not statistically significant.
=> Very very weak related and not statistically significant.
For age and nihss: r = -0.033 and p > 0.05. This p-value tells us that the probability of getting a correlation coefficient this big in a sample of 500 observations if the null hypothesis were true (there was no relationship between these variables) is very high. Therefore, we reject the alternative hypothesis and say that the relationship between age and nihss is not statistically significant.
For age and sbp: Similar conclusion.

3.3.4.3 corr.test()

Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”, method=“pearson”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5, normal=TRUE)

library(psych)
print(corr.test(df1), short = F)

## Call:corr.test(x = df1)
## Correlation matrix 
##             age nihss   sbp time.iv aspects time.rand time.punc
## age        1.00 -0.03  0.01   -0.02    0.01     -0.02      0.10
## nihss     -0.03  1.00 -0.04    0.06   -0.04     -0.08     -0.05
## sbp        0.01 -0.04  1.00    0.07   -0.01      0.03      0.06
## time.iv   -0.02  0.06  0.07    1.00    0.03     -0.01     -0.14
## aspects    0.01 -0.04 -0.01    0.03    1.00      0.01     -0.01
## time.rand -0.02 -0.08  0.03   -0.01    0.01      1.00      0.01
## time.punc  0.10 -0.05  0.06   -0.14   -0.01      0.01      1.00
## Sample Size 
##           age nihss sbp time.iv aspects time.rand time.punc
## age       500   500 499     445     496       498       233
## nihss     500   500 499     445     496       498       233
## sbp       499   499 499     444     495       497       233
## time.iv   445   445 444     445     441       444       203
## aspects   496   496 495     441     496       494       233
## time.rand 498   498 497     444     494       498       231
## time.punc 233   233 233     203     233       231       233
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            age nihss  sbp time.iv aspects time.rand time.punc
## age       0.00  1.00 1.00    1.00    1.00      1.00         1
## nihss     0.45  0.00 1.00    1.00    1.00      1.00         1
## sbp       0.82  0.39 0.00    1.00    1.00      1.00         1
## time.iv   0.64  0.20 0.13    0.00    1.00      1.00         1
## aspects   0.85  0.33 0.85    0.53    0.00      1.00         1
## time.rand 0.59  0.09 0.53    0.78    0.75      0.00         1
## time.punc 0.13  0.46 0.33    0.05    0.82      0.93         0
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss       -0.12 -0.03      0.05  0.45     -0.16      0.09
## age-sbp         -0.08  0.01      0.10  0.82     -0.10      0.12
## age-tim.v       -0.12 -0.02      0.07  0.64     -0.15      0.11
## age-aspct       -0.08  0.01      0.10  0.85     -0.09      0.11
## age-tm.rn       -0.11 -0.02      0.06  0.59     -0.15      0.10
## age-tm.pn       -0.03  0.10      0.22  0.13     -0.10      0.29
## nihss-sbp       -0.13 -0.04      0.05  0.39     -0.17      0.09
## nihss-tim.v     -0.03  0.06      0.15  0.20     -0.08      0.20
## nihss-aspct     -0.13 -0.04      0.04  0.33     -0.17      0.09
## nihss-tm.rn     -0.16 -0.08      0.01  0.09     -0.21      0.06
## nihss-tm.pn     -0.18 -0.05      0.08  0.46     -0.24      0.14
## sbp-tim.v       -0.02  0.07      0.16  0.13     -0.07      0.21
## sbp-aspct       -0.10 -0.01      0.08  0.85     -0.12      0.10
## sbp-tm.rn       -0.06  0.03      0.12  0.53     -0.10      0.15
## sbp-tm.pn       -0.06  0.06      0.19  0.33     -0.13      0.26
## tim.v-aspct     -0.06  0.03      0.12  0.53     -0.11      0.16
## tim.v-tm.rn     -0.11 -0.01      0.08  0.78     -0.13      0.11
## tim.v-tm.pn     -0.27 -0.14      0.00  0.05     -0.34      0.08
## aspct-tm.rn     -0.07  0.01      0.10  0.75     -0.10      0.13
## aspct-tm.pn     -0.14 -0.01      0.11  0.82     -0.19      0.16
## tm.rn-tm.pn     -0.12  0.01      0.13  0.93     -0.12      0.13

Discuss

With corr.test(), we got the result of r, sample size, p-value, ci.
The sample size: allow us to know how many observations used for testing the association between paired samples (because of the missing value).
The p-value: > 0.05: not significant.
The ci cross 0: indicate not significant.
The results are just as we expected

3.3.4.4 rcorr()

Package: Hmisc

Command: rcorr(x,y, type = “correlation type”)

library(Hmisc)
rcorr(as.matrix(df1))

##             age nihss   sbp time.iv aspects time.rand time.punc
## age        1.00 -0.03  0.01   -0.02    0.01     -0.02      0.10
## nihss     -0.03  1.00 -0.04    0.06   -0.04     -0.08     -0.05
## sbp        0.01 -0.04  1.00    0.07   -0.01      0.03      0.06
## time.iv   -0.02  0.06  0.07    1.00    0.03     -0.01     -0.14
## aspects    0.01 -0.04 -0.01    0.03    1.00      0.01     -0.01
## time.rand -0.02 -0.08  0.03   -0.01    0.01      1.00      0.01
## time.punc  0.10 -0.05  0.06   -0.14   -0.01      0.01      1.00
## 
## n
##           age nihss sbp time.iv aspects time.rand time.punc
## age       500   500 499     445     496       498       233
## nihss     500   500 499     445     496       498       233
## sbp       499   499 499     444     495       497       233
## time.iv   445   445 444     445     441       444       203
## aspects   496   496 495     441     496       494       233
## time.rand 498   498 497     444     494       498       231
## time.punc 233   233 233     203     233       231       233
## 
## P
##           age    nihss  sbp    time.iv aspects time.rand time.punc
## age              0.4512 0.8231 0.6392  0.8502  0.5894    0.1306   
## nihss     0.4512        0.3875 0.1952  0.3322  0.0852    0.4638   
## sbp       0.8231 0.3875        0.1340  0.8500  0.5289    0.3250   
## time.iv   0.6392 0.1952 0.1340         0.5286  0.7824    0.0485   
## aspects   0.8502 0.3322 0.8500 0.5286          0.7471    0.8250   
## time.rand 0.5894 0.0852 0.5289 0.7824  0.7471            0.9344   
## time.punc 0.1306 0.4638 0.3250 0.0485  0.8250  0.9344

3.3.4.5 R2: Correlation of Determination

To calculate R2 by hand, we just simply square the r-value and multiply 100%.

3.3.5 Step 5: Draw a scatter plot for visualization

Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))

Example 1: Plot age vs nihss

ggplot(df1, aes(age, nihss)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "nihss", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Example 2: Plot age vs sbp

ggplot(df1, aes(age, sbp)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "sbp", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Summary

4 command produces similar results Pearson’s correlation coefficient r.
However, I prefer corr.test() > rcorr() > cor.test() > cor() because of the information provided in the result sections.
From r, calculate Coefficient of determination - R2: X shares n% of the variability in Y or X accounts for only n% of variation in Y.
r, R2 cannot be used to infer causal relationships: X does not necessarily cause the variation in Y.

Function	Pearson	Spearman	Kendall	p-value	CI	Multiple correlations?	Sample size
cor()	x	x	x			x
cor.test()	x	x	x	x	x
corr.test()	x	x	x	x	x	x	x
rcorr()	x	x		x		x	x

4 Spearman’s correlation coefficient (rs)

4.1 Introduction

Speaman Is a non-parametric statistic - sometimes called Spearman’s rho.
Can be used when the data have violated parametric assumptions such as non-normally distributed data. This happens when at least one of your variables is on an ordinal level of measurement or when the data from one or both variables do not follow normal distributions.
Spearman’s test works by first ranking the data, and then applying Pearson’s equation to those ranks.

4.2 Practice with R

Step 1 to 3 will be similar to Pearson’s r test. I will start from step 4.

4.2.1 Step 4: Spearman’s correlation coefficient (rs)

4.2.1.1 cor()

cor(x,y, use = “string”, method = “correlation type”)

with use is set equal to a character string that specifies how missing values are handled. The strings can be:

everything (default): R will output an NA instead of a correlation coefficient for any correlations involving variables containing missing values;
all.obs: will use all observations and, therefore, returns an error message if there are any missing values in the data;
complete.obs: correlations are computed from only cases that are complete for all variables – sometimes known as excluding cases listwise (if a case has a missing value for any variable, then they are excluded from the whole analysis);
pairwise.complete.obs - n which correlations between pairs of variables are computed for cases that are complete for those two variables – sometimes known as excluding cases pairwise (if a participant has a score missing for a particular variable or analysis, then their data are excluded only from calculations involving the variable for which they have no score).

I will demonstrate the option of use = “pairwise.complete.obs” through df1:

cor(df1, use = "pairwise.complete.obs", method = "spearman")

##                    age       nihss          sbp      time.iv      aspects
## age        1.000000000 -0.02915220  0.049699462 -0.009632336 -0.010716491
## nihss     -0.029152196  1.00000000 -0.062439507  0.079920020 -0.024695464
## sbp        0.049699462 -0.06243951  1.000000000  0.119992418  0.006328807
## time.iv   -0.009632336  0.07992002  0.119992418  1.000000000  0.032214687
## aspects   -0.010716491 -0.02469546  0.006328807  0.032214687  1.000000000
## time.rand -0.042617822 -0.07793713  0.031786886 -0.030225795 -0.009998565
## time.punc  0.096659241 -0.04526983  0.050351932 -0.135213032 -0.003709137
##               time.rand     time.punc
## age       -0.0426178221  0.0966592407
## nihss     -0.0779371258 -0.0452698337
## sbp        0.0317868859  0.0503519322
## time.iv   -0.0302257950 -0.1352130316
## aspects   -0.0099985648 -0.0037091373
## time.rand  1.0000000000 -0.0004096534
## time.punc -0.0004096534  1.0000000000

Discuss

Look at the result, we can easily notice that: Each variable is perfectly correlated with itself (obviously) and so r = 1.

Our target variables:

For age vs nihss: r = -0.029 indicated that age is negatively related to nihss. R2 = 0.0841% indicated that age shares 0.0841% of the variability in nihss.
For age vs sbp: r = 0.05: age is positively related to sbp. R2 = 0.25% indicated that age shares 0.25% of the variability in sbp.

4.2.1.2 cor.test()

cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)

cor.test(df1$age, df1$nihss, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  df1$age and df1$nihss
## S = 21440585, p-value = 0.5155
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.0291522

cor.test(df1$age, df1$sbp, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  df1$age and df1$sbp
## S = 19679299, p-value = 0.2678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.04969946

Discuss

The Spearman correlation coefficient of age and nihss, age and sbp when using cor() and cor.test() are the same.
The p-value > 0.05: not statistically significant.
Confidence interval cross 0: not statistically significant.
=> There is no statistically significant relationship between age and nihss, age and sbp.

4.2.1.3 corr.test()

Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”, method=“spearman”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5, normal=TRUE)

library(psych)
print(corr.test(df1, method = "spearman"), short = F)

## Call:corr.test(x = df1, method = "spearman")
## Correlation matrix 
##             age nihss   sbp time.iv aspects time.rand time.punc
## age        1.00 -0.03  0.05   -0.01   -0.01     -0.04      0.10
## nihss     -0.03  1.00 -0.06    0.08   -0.02     -0.08     -0.05
## sbp        0.05 -0.06  1.00    0.12    0.01      0.03      0.05
## time.iv   -0.01  0.08  0.12    1.00    0.03     -0.03     -0.14
## aspects   -0.01 -0.02  0.01    0.03    1.00     -0.01      0.00
## time.rand -0.04 -0.08  0.03   -0.03   -0.01      1.00      0.00
## time.punc  0.10 -0.05  0.05   -0.14    0.00      0.00      1.00
## Sample Size 
##           age nihss sbp time.iv aspects time.rand time.punc
## age       500   500 499     445     496       498       233
## nihss     500   500 499     445     496       498       233
## sbp       499   499 499     444     495       497       233
## time.iv   445   445 444     445     441       444       203
## aspects   496   496 495     441     496       494       233
## time.rand 498   498 497     444     494       498       231
## time.punc 233   233 233     203     233       231       233
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            age nihss  sbp time.iv aspects time.rand time.punc
## age       0.00  1.00 1.00    1.00    1.00         1         1
## nihss     0.52  0.00 1.00    1.00    1.00         1         1
## sbp       0.27  0.16 0.00    0.24    1.00         1         1
## time.iv   0.84  0.09 0.01    0.00    1.00         1         1
## aspects   0.81  0.58 0.89    0.50    0.00         1         1
## time.rand 0.34  0.08 0.48    0.53    0.82         0         1
## time.punc 0.14  0.49 0.44    0.05    0.96         1         0
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss       -0.12 -0.03      0.06  0.52     -0.15      0.09
## age-sbp         -0.04  0.05      0.14  0.27     -0.08      0.18
## age-tim.v       -0.10 -0.01      0.08  0.84     -0.13      0.11
## age-aspct       -0.10 -0.01      0.08  0.81     -0.13      0.11
## age-tm.rn       -0.13 -0.04      0.05  0.34     -0.17      0.09
## age-tm.pn       -0.03  0.10      0.22  0.14     -0.10      0.29
## nihss-sbp       -0.15 -0.06      0.03  0.16     -0.19      0.07
## nihss-tim.v     -0.01  0.08      0.17  0.09     -0.06      0.22
## nihss-aspct     -0.11 -0.02      0.06  0.58     -0.14      0.10
## nihss-tm.rn     -0.16 -0.08      0.01  0.08     -0.21      0.06
## nihss-tm.pn     -0.17 -0.05      0.08  0.49     -0.23      0.14
## sbp-tim.v        0.03  0.12      0.21  0.01     -0.02      0.26
## sbp-aspct       -0.08  0.01      0.09  0.89     -0.10      0.11
## sbp-tm.rn       -0.06  0.03      0.12  0.48     -0.09      0.16
## sbp-tm.pn       -0.08  0.05      0.18  0.44     -0.14      0.24
## tim.v-aspct     -0.06  0.03      0.13  0.50     -0.10      0.17
## tim.v-tm.rn     -0.12 -0.03      0.06  0.53     -0.16      0.10
## tim.v-tm.pn     -0.27 -0.14      0.00  0.05     -0.34      0.08
## aspct-tm.rn     -0.10 -0.01      0.08  0.82     -0.13      0.11
## aspct-tm.pn     -0.13  0.00      0.12  0.96     -0.15      0.14
## tm.rn-tm.pn     -0.13  0.00      0.13  1.00     -0.13      0.13

Discuss

With corr.test(), we got the result of r, sample size, p-value, ci.
The sample size: allow us to know how many observations used for testing the association between paired samples (because of the missing value).
The p-value: observe the p result below the diagonal line 0.
The results are just as we expected.
=> There is no statistically significant relationship between age and nihss, age and sbp.
However, for time.iv and sbp: r = 0.12; p-value = 0.01 and its CI does not cross 0: statistically significant. This result is different from Pearson r because time.iv is non-normal distribution => The result of Spearman can be trusted compared to Pearson.
=> There is significant positive relationship between time.iv and sbp: as time.iv increase, sbp increase as well. Furthermore, time.iv accounts for 0.12^2 = 1.44% for the variability in sbp.

4.2.1.4 rcorr()

Package: Hmisc

Command: rcorr(x,y, type = “correlation type”)

library(Hmisc)
rcorr(as.matrix(df1), type = "spearman")

##             age nihss   sbp time.iv aspects time.rand time.punc
## age        1.00 -0.03  0.05   -0.01   -0.01     -0.04      0.10
## nihss     -0.03  1.00 -0.06    0.08   -0.02     -0.08     -0.05
## sbp        0.05 -0.06  1.00    0.12    0.01      0.03      0.05
## time.iv   -0.01  0.08  0.12    1.00    0.03     -0.03     -0.14
## aspects   -0.01 -0.02  0.01    0.03    1.00     -0.01      0.00
## time.rand -0.04 -0.08  0.03   -0.03   -0.01      1.00      0.00
## time.punc  0.10 -0.05  0.05   -0.14    0.00      0.00      1.00
## 
## n
##           age nihss sbp time.iv aspects time.rand time.punc
## age       500   500 499     445     496       498       233
## nihss     500   500 499     445     496       498       233
## sbp       499   499 499     444     495       497       233
## time.iv   445   445 444     445     441       444       203
## aspects   496   496 495     441     496       494       233
## time.rand 498   498 497     444     494       498       231
## time.punc 233   233 233     203     233       231       233
## 
## P
##           age    nihss  sbp    time.iv aspects time.rand time.punc
## age              0.5155 0.2678 0.8394  0.8118  0.3426    0.1413   
## nihss     0.5155        0.1637 0.0922  0.5832  0.0823    0.4917   
## sbp       0.2678 0.1637        0.0114  0.8883  0.4795    0.4443   
## time.iv   0.8394 0.0922 0.0114         0.4998  0.5253    0.0544   
## aspects   0.8118 0.5832 0.8883 0.4998          0.8246    0.9551   
## time.rand 0.3426 0.0823 0.4795 0.5253  0.8246            0.9951   
## time.punc 0.1413 0.4917 0.4443 0.0544  0.9551  0.9951

Discuss: The result of rcorr() is similar to corr.test().

4.2.2 Step 5: Draw a scatter plot for visualization

Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))

Example 1: Plot age vs nihss

ggplot(df1, aes(age, nihss)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "nihss", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Example 2: Plot age vs sbp

ggplot(df1, aes(age, sbp)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "sbp", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Summary

4 command produces similar results Spearman’s correlation coefficient rs.
However, I prefer corr.test() > rcorr() > cor.test() > cor() because of the information provided in the result sections.
From r, calculate Coefficient of determination - R2: X shares % of the variability in Y or X accounts for only 19.4% of variation in Y.
r, R2 cannot be used to infer causal relationships: X does not necessarily cause the variation in Y.

5 Kendall’s tau (t)

5.1 Introduction

Another non-parametric correlation.
Should be used rather than Spearman’s coefficient when you have a small data set with a large number of tied ranks.
Although Spearman’s statistic is the more popular of the two coefficients, there is much to suggest that Kendall’s statistic is actually a better estimate of the correlation in the population.

5.2 Practice with R

Step 1 to 3 will be similar to Pearson’s r test. I will start from step 4.

5.2.1 Step 4: Kendall Tau’s (t)

5.2.1.1 cor()

cor(x,y, use = “string”, method = “correlation type”)

cor(df1, use = "pairwise.complete.obs", method = "kendall")

##                    age       nihss          sbp      time.iv      aspects
## age        1.000000000 -0.01870296  0.035677034 -0.006439801 -0.008510516
## nihss     -0.018702958  1.00000000 -0.042009528  0.053594262 -0.018916012
## sbp        0.035677034 -0.04200953  1.000000000  0.081829814  0.005073658
## time.iv   -0.006439801  0.05359426  0.081829814  1.000000000  0.023811504
## aspects   -0.008510516 -0.01891601  0.005073658  0.023811504  1.000000000
## time.rand -0.028191859 -0.05425169  0.021044476 -0.020902271 -0.006315963
## time.punc  0.066117986 -0.03196113  0.033534926 -0.091000978 -0.003156532
##              time.rand    time.punc
## age       -0.028191859  0.066117986
## nihss     -0.054251690 -0.031961132
## sbp        0.021044476  0.033534926
## time.iv   -0.020902271 -0.091000978
## aspects   -0.006315963 -0.003156532
## time.rand  1.000000000 -0.005674725
## time.punc -0.005674725  1.000000000

5.2.1.2 cor.test()

cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)

cor.test(df1$age, df1$nihss, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  df1$age and df1$nihss
## z = -0.6011, p-value = 0.5478
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.01870296

cor.test(df1$age, df1$sbp, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  df1$age and df1$sbp
## z = 1.1748, p-value = 0.2401
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## 0.03567703

5.2.1.3 corr.test()

Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”, method=“kendall”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5, normal=TRUE)

library(psych)
print(corr.test(df1, method = "kendall"), short = F)

## Call:corr.test(x = df1, method = "kendall")
## Correlation matrix 
##             age nihss   sbp time.iv aspects time.rand time.punc
## age        1.00 -0.02  0.04   -0.01   -0.01     -0.03      0.07
## nihss     -0.02  1.00 -0.04    0.05   -0.02     -0.05     -0.03
## sbp        0.04 -0.04  1.00    0.08    0.01      0.02      0.03
## time.iv   -0.01  0.05  0.08    1.00    0.02     -0.02     -0.09
## aspects   -0.01 -0.02  0.01    0.02    1.00     -0.01      0.00
## time.rand -0.03 -0.05  0.02   -0.02   -0.01      1.00     -0.01
## time.punc  0.07 -0.03  0.03   -0.09    0.00     -0.01      1.00
## Sample Size 
##           age nihss sbp time.iv aspects time.rand time.punc
## age       500   500 499     445     496       498       233
## nihss     500   500 499     445     496       498       233
## sbp       499   499 499     444     495       497       233
## time.iv   445   445 444     445     441       444       203
## aspects   496   496 495     441     496       494       233
## time.rand 498   498 497     444     494       498       231
## time.punc 233   233 233     203     233       231       233
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            age nihss  sbp time.iv aspects time.rand time.punc
## age       0.00  1.00 1.00    1.00    1.00      1.00         1
## nihss     0.68  0.00 1.00    1.00    1.00      1.00         1
## sbp       0.43  0.35 0.00    1.00    1.00      1.00         1
## time.iv   0.89  0.26 0.09    0.00    1.00      1.00         1
## aspects   0.85  0.67 0.91    0.62    0.00      1.00         1
## time.rand 0.53  0.23 0.64    0.66    0.89      0.00         1
## time.punc 0.31  0.63 0.61    0.20    0.96      0.93         0
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss       -0.11 -0.02      0.07  0.68     -0.14      0.10
## age-sbp         -0.05  0.04      0.12  0.43     -0.10      0.17
## age-tim.v       -0.10 -0.01      0.09  0.89     -0.13      0.12
## age-aspct       -0.10 -0.01      0.08  0.85     -0.13      0.11
## age-tm.rn       -0.12 -0.03      0.06  0.53     -0.16      0.10
## age-tm.pn       -0.06  0.07      0.19  0.31     -0.13      0.26
## nihss-sbp       -0.13 -0.04      0.05  0.35     -0.17      0.09
## nihss-tim.v     -0.04  0.05      0.15  0.26     -0.09      0.19
## nihss-aspct     -0.11 -0.02      0.07  0.67     -0.14      0.10
## nihss-tm.rn     -0.14 -0.05      0.03  0.23     -0.19      0.08
## nihss-tm.pn     -0.16 -0.03      0.10  0.63     -0.22      0.16
## sbp-tim.v       -0.01  0.08      0.17  0.09     -0.06      0.22
## sbp-aspct       -0.08  0.01      0.09  0.91     -0.10      0.11
## sbp-tm.rn       -0.07  0.02      0.11  0.64     -0.10      0.15
## sbp-tm.pn       -0.10  0.03      0.16  0.61     -0.16      0.22
## tim.v-aspct     -0.07  0.02      0.12  0.62     -0.11      0.16
## tim.v-tm.rn     -0.11 -0.02      0.07  0.66     -0.15      0.11
## tim.v-tm.pn     -0.23 -0.09      0.05  0.20     -0.30      0.12
## aspct-tm.rn     -0.09 -0.01      0.08  0.89     -0.12      0.11
## aspct-tm.pn     -0.13  0.00      0.13  0.96     -0.13      0.13
## tm.rn-tm.pn     -0.13 -0.01      0.12  0.93     -0.16      0.15

Discuss

With corr.test(), we got the result of r, sample size, p-value, ci.
The sample size: allow us to know how many observations used for testing the association between paired samples (because of the missing value).
The p-value: observe the p result below the diagonal line 0.
The results are just as we expected.
The result of p-value of time.iv and sbp = 0.09 and its CI cross 0: non-significant. This result is different from SPearman rs.

5.2.2 Step 5: Draw a scatter plot for visualization

Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))

Example 1: Plot age vs nihss

ggplot(df1, aes(age, nihss)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "nihss", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Example 2: Plot age vs sbp

ggplot(df1, aes(age, sbp)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour = "Red") + 
  labs(x = "age", y = "sbp", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

Summary

4 command produces similar results Kendall’s correlation coefficient t.
However, I prefer corr.test() > cor.test() > cor() because of the information provided in the result sections.

6 Bootstrapping correlations

6.1 Introduction

Another way to deal with data that do not meet the assumptions of Pearson’s r is to use bootstrapping. The boot() function takes the general form:

library(boot)
object<-boot(data, function, replications)

data: the dataframe to be used.
function: a function that you write to tell boot() what you want to bootstrap.
replications: a number specifying how many bootstrap samples you want to take (I usually set this value to 2000).

6.2 Practice with R

Step 1 to 3 will be similar to Pearson’s r test. I will start from final step 4.

Example: Bootstrap 2000 times, apply Kendall for age vs nihss:

bootTau<-function (df1,i) cor(df1$age[i], df1$nihss[i], use = "pairwise.complete.obs", method = "kendall")
library(boot)
boot_kendall<-boot(df1, bootTau, 2000)
boot_kendall

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df1, statistic = bootTau, R = 2000)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1* -0.01870296 -0.0004003333  0.03193298

# To get the 95% confidence interval: 
boot.ci(boot_kendall, conf = 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_kendall, conf = 0.95)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-0.0809,  0.0443 )   (-0.0818,  0.0449 )  
## 
## Level     Percentile            BCa          
## 95%   (-0.0823,  0.0444 )   (-0.0813,  0.0451 )  
## Calculations and Intervals on Original Scale

Explain the formula:

Creates an object called bootTau.
The first bit of the function tells R what input to expect in the function: in this case we need to feed a dataframe (df1) into the function and a variable that has been called i (which refers to a particular bootstrap sample).
The second part of the function specifies the cor() function, which is the thing we want to bootstrap.
cor() is specified in exactly the same way except that for each variable we have added [i] - refers to a particular bootstrap sample.
If you want to bootstrap a Pearson or Spearman correlation you do it in exactly the same way except that you specify method = “pearson” or method = “spearman” when you define the function.

Discuss the result

Output of boot_kendall:

The original value of Kendall’s tau = -0.0187.
Get an estimate of the bias in that value (which in this case is very small) and the standard error (0.031) based on the bootstrap samples.

Output from boot.ci():

Four different confidence intervals (the normal, basic bootstrapped CI, percentile and BCa). All of these confidence intervals cross zero (prove not significant).

=> age and nihss weak correlation and not statistically significant.

7 Biserial, point-biserial correlations and Phi Correlation

7.1 Introduction

Extremely important: This can be considered as special case of Pearson’s r.

Point-biserial correlation coefficient (rpb) is used when one variable is a discrete dichotomy - only A or B (e.g., pregnancy) – no continuum between two categories. The point-biserial correlation is equivalent to calculating the Pearson correlation between a continuous and a dichotomous variable - We can just use the standard cor.test function in R, which will output the correlation, a 95% confidence interval, and an independent t-test with associated p-value.
Biserial correlation coefficient (rb) is used when one variable is a continuous dichotomy – the degree of variable (e.g., large/small margin of passing or failing an exam). rpb < rb.
Phi Coefficient is a measure of association between two binary variables (i.e. living/dead, black/white, success/failure). The correlation is based on frequency counts. The phi coefficient is identical to the Pearson coefficient in the case of a 2 x 2 data set - We can still use the cor.test function to calculate the phi coefficient.

7.2 Practice with R

7.2.1 Step 1: Data Wrangling

Load the data to R and examine some basic infomation about variables. When loading the data, I also define the type of variables (numeric or text):

data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
names(data)

##  [1] "studyid"   "trt"       "age"       "sex"       "nihss"     "location" 
##  [7] "hx.isch"   "afib"      "dm"        "mrankin"   "sbp"       "iv.altep" 
## [13] "time.iv"   "aspects"   "ia.occlus" "extra.ica" "time.rand" "time.punc"

str(data)

## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Select and convert to numeric for all discrete categorical variables:

In this example, I just want to examine sex, nihss and location:

sex: 1: Female; 2:Male.
location: 1: Right; 2:Left.

# Select discrete categorical + numeric variables:
df2 <- data %>% select(age, sex, nihss, location)

# Convert to numeric for discrete categorical variables (Notes cannot labels = 0:1 because it will automatic convert to 1:2 but the result will be the same between 0:1 and 1:2):
## Hard way but can solve 1:2 problem:
### df2$sex[df2$sex == "Female"] <- 0
### df2$sex[df2$sex == "Male"] <- 1
### df2$sex = as.numeric(df2$sex)

### df2$location[df2$location == "Right"] <- 0
### df2$location[df2$location == "Left"] <- 1
### df2$location = as.numeric(df2$location) 

## Convenient way:  
df2$sex = factor(df2$sex, levels = c("Female","Male"), labels = c(1, 2)) %>%  as.numeric()

df2$location = factor(df2$location, levels = c("Right","Left"), labels = c(1, 2))  %>% as.numeric()

# Checking after converting
str(df2)

## tibble [500 x 4] (S3: tbl_df/tbl/data.frame)
##  $ age     : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex     : num [1:500] 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss   : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location: num [1:500] 1 2 1 2 1 2 1 1 2 1 ...

Discuss: Now, the data is now ready for next step. We also need to check Step 2: Missingness checking. But we don’t need step 3: Normal distribution checking.

7.2.2 Step 2: Missingness checking

Similar to Step 2 in Pearson’s r.

7.2.3 Step 3: Calculate Correlation

Example 1: Point-biserial correlation between sex and nihss

cor.test(df2$sex, df2$nihss)

## 
##  Pearson's product-moment correlation
## 
## data:  df2$sex and df2$nihss
## t = 0.47402, df = 498, p-value = 0.6357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06657791  0.10872477
## sample estimates:
##        cor 
## 0.02123666

Discuss

The point-biserial correlation coefficient is rpb = 0.02 is nearly to 0 => Weak negative correlation.
rpb has a non-significance value of p = 0.6.
R2 = 0.0004 = 0.04%. Hence, we can conclude that sex accounts for 0.04% of the variability in nihss.
=> There is no significant relationship between sex and nihss.

Example 2: Biserial correlation between sex and nihss. Although sex is not a continuous dichotomy, I still use this variable for demonstration purpose:

Calculate p, q, y:

a = table(df2$sex)
prop.table(a)

## 
##     1     2 
## 0.416 0.584

To calculate p, q, y, we use these values and the values of the normal distribution displayed in the Table of the standard normal distribution. Because the exact values of 0.584 and 0.416 are not in the table so instead we use the nearest values that we can find, which are 0.583 and 0.416:

p = 0.587
q = 0.413
y = 0.389

Calculate rb:

Using R formula or calculate by hand:
**rb = rpb x sqrt(p*q) / y**

library(polycor)
polycor::polyserial(df2$sex, df2$nihss)

## [1] 0.02164936

Calculate SE of rb:
SE = sqrt(p * q) / (y * sqrt(N)) = sqrt(0.587 * 0.413) / (0.389 * sqrt(500))
SE = 0.057
Calculate z:
z = rb / SE = 0.021 / 0.057 = 0.37
Look up the z in the normal distribution for one-tail probability in the column “Smaller Portion” => p one-tail = 0.35569 => p two-tail = 0.71 > 0.05: Not statistically significant.

Example 3: Phi correlation between sex and location

cor.test(df2$sex, df2$location)

## 
##  Pearson's product-moment correlation
## 
## data:  df2$sex and df2$location
## t = 0.34586, df = 498, p-value = 0.7296
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07229239  0.10304699
## sample estimates:
##        cor 
## 0.01549643

Discuss

The Phi correlation coefficient = 0.015 is nearly to 0 => Weak negative correlation.
Phi has a non-significance value of p = 0.7.
R2 = 0.0002 = 0.02%. Hence, we can conclude that gender accounts for 0.02% of the variability in location.
=> No significant relationship between sex and location.

7.2.4 Step 4: Drawing scatterplot for visualization

Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))

Example: Plot scatterplot for sex and nihss

ggplot(df2, aes(nihss, sex)) + 
  geom_point() + 
  geom_smooth(colour = "Red") + 
  labs(x = "nihss", y = "sex", title = " ") + 
  theme(plot.title = element_text(hjust = 0.5))

8 Partial correlation

8.1 Introduction

A correlation between two variables in which the effects of other variables are held constant is known as a partial correlation.

If we remove the portion of variation that is also shared by location, we get a measure of the unique relationship between sex and nihss. We use partial correlations to find out the size of the unique portion of variance. Therefore, we could conduct a partial correlation between esex and nihss while ‘controlling’ for the effect of location.

Note:

Both the calculation of the partial and semi-partial correlation are not allow the Missing Value, so we need prepare a clean data set without all the missing value.
These partial correlations can be done when variables are dichotomous (including the ‘third’ variable). (variables can be continuous or discrete).

So, for example, we could look at the relationship between bladder relaxation (did the person wet themselves or not?) and the number of large tarantulas crawling up your leg, controlling for fear of spiders (the first variable is dichotomous, but the second variable and ‘controlled for’ variables are continuous).
Also, the first variable can be continuous, but the second variable and ‘controlled for’ variables variable can be dichotomous.

Package: ggm
Command:
* pc = ggm::pcor(c(“var1”, “var2”, “control1”, “control2” etc.), var(dataframe))
* pc^2
* ggm::pcor(pc, number of control variables, sample size)

Anothery way:
Package: ppcor
Command: ppcor::pcor.test(df2$sex,df2$nihss,df2$location, method=“pearson”)

8.2 Practice in R

Example: Partial correlation for sex vs nihss, control for the effect of location:

# Ensure variables not have missing value:
df2 = na.omit(df2)

# Calculate the correlation between science score and engagement while controlling the variable learning
## First way:
library(ggm)
pc = ggm::pcor(c("sex", "nihss", "location"), var(df2))
pc

## [1] 0.0225065

pc^2

## [1] 0.0005065427

ggm::pcor.test(pc, 1, 500)

## $tval
## [1] 0.5018758
## 
## $df
## [1] 497
## 
## $pvalue
## [1] 0.6159772

## Anothery way:  
library(ppcor)
ppcor::pcor.test(df2$sex,df2$nihss,df2$location, method="pearson")

##    estimate   p.value statistic   n gp  Method
## 1 0.0225065 0.6159772 0.5018758 500  1 pearson

Discuss
The partial correlation for the variables sex vs nihss, control for the effect of location:

Partial correlation between sex vs nihss is 0.022 is more than the correlation when the effect of location is not controlled for (r = 0.021).
This correlation is not statistically significant (its p-value is 0.6 > 0.05).
The value of R2 for the partial correlation is 0.022^2 = 0.0005, which means that sex can now account for only 0.05% of the variance in nihss. When the effects of location were not controlled for, sex shared 0.021^2 = 0.0441% of the variation in nihss.

=> Inclusion of location has slightly increase the amount of variation in sex shared by nihss.
=> sex alone does explain very small variation in nihss.
=> No relationship between sex and nihss to location.

9 Semi-partial Correlation

9.1 Introduction

Semi-partial Correlation: The semi-partial correlation coefficient is the correlation between all of Y and that part of X which is independent of Z. That is the effect of Z has been removed from X but not from Y.

9.2 Practice with R

Example: Partial correlation for sex vs nihss, control for the effect of location:

# Ensure variables not have missing value:
df2 = na.omit(df2)

# Calculate the correlation between science score and engagement while controlling the variable learning

library(ppcor)
spcor.test(df2$sex,df2$nihss,df2$location, method="pearson")

##    estimate   p.value statistic   n gp  Method
## 1 0.0225038 0.6160195 0.5018155 500  1 pearson

10 Rank Biserial rrb

10.1 Introduction

The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data vs rankings (ordinal).

Package: rcompanion.
Command: wilcoxonRG(x = ordinal variable, g = nominal variable, verbose=TRUE).

10.2 Practice with R

Example: Rank Bisereal for sex vs aspects:

# Create a df3 dataframe has sex and aspects:  
df3 = data %>% select (sex, aspects) %>% na.omit()
df3$sex = factor(df3$sex, levels = c("Female","Male"), labels = c(1, 2)) %>%  as.numeric()

# Run the Rank Biserial rrb:
library(rcompanion)
wilcoxonRG(x = df3$aspects, g = df3$sex, verbose=TRUE, ci = T, R = 1000)

## 
## Levels: 1 2
## n for 1 = 208
## n for 2 = 288
## Mean of ranks for 1 = 247.1851
## Mean of ranks for 2 = 249.4497
## Difference in mean of ranks = -2.264557
## Total n = 496
## 2 * difference / total n = -0.00913

##         rg lower.ci upper.ci
## 1 -0.00913   -0.107   0.0977

11 References

https://www.rdocumentation.org/packages/Hmisc/versions/4.7-0/topics/rcorr
https://www.researchgate.net/post/Can_I_use_Pearsons_correlation_coefficient_to_know_the_relation_between_perception_and_gender_age_income#:~:text=For%20a%20dichotomous%20categorical%20variable,a%20point%2Dbiserial%20correlation%20coefficient.
https://bookdown.org/chua/ber642_advanced_regression/running-correlations-in-r.html#point-biserial-correlation-phi-correlation
https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv
http://faculty.cas.usf.edu/mbrannick/regression/Part3/Partials.html
https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm

Correlation analysis

Minh Tri

2022-10-31

1 Introduction

1.1 About correlation coefficient r

1.2 R2 - Coefficient of determination for interpretation

1.3 Calculating the effect size

1.4 R commands

1.4.1 cor()

1.4.2 cor.test()

1.4.3 corr.test()

1.4.4 rcorr()

1.4.5 Drawing a visual graph

2 Simulated fakestroke data

3 Pearson’s correlation coefficient (r) – Parametric test

3.1 Assumptions of Pearson’s r

3.2 Conclusion about Pearson’s r

3.3 Practice with R

3.3.1 Step 1: Data Wrangling

3.3.2 Step 2: Missingness checking

3.3.3 Step 3: Preliminary test for Assumption: Data distribution

3.3.3.1 Step 3.1: Shapiro-Wilk test

3.3.3.2 Step 3.2: Drawing Q-Q plot

3.3.4 Step 4: Pearson correlation test

3.3.4.1 cor()

3.3.4.2 cor.test()

3.3.4.3 corr.test()

3.3.4.4 rcorr()

3.3.4.5 R2: Correlation of Determination

3.3.5 Step 5: Draw a scatter plot for visualization

4 Spearman’s correlation coefficient (rs)

4.1 Introduction

4.2 Practice with R

4.2.1 Step 4: Spearman’s correlation coefficient (rs)

4.2.1.1 cor()

4.2.1.2 cor.test()

4.2.1.3 corr.test()

4.2.1.4 rcorr()

4.2.2 Step 5: Draw a scatter plot for visualization

5 Kendall’s tau (t)

5.1 Introduction

5.2 Practice with R

5.2.1 Step 4: Kendall Tau’s (t)

5.2.1.1 cor()

5.2.1.2 cor.test()

5.2.1.3 corr.test()

5.2.2 Step 5: Draw a scatter plot for visualization

6 Bootstrapping correlations

6.1 Introduction

6.2 Practice with R

7 Biserial, point-biserial correlations and Phi Correlation

7.1 Introduction

7.2 Practice with R

7.2.1 Step 1: Data Wrangling

7.2.2 Step 2: Missingness checking

7.2.3 Step 3: Calculate Correlation

7.2.4 Step 4: Drawing scatterplot for visualization

8 Partial correlation

8.1 Introduction

8.2 Practice in R

9 Semi-partial Correlation

9.1 Introduction

9.2 Practice with R

10 Rank Biserial rrb

10.1 Introduction

10.2 Practice with R

11 References