knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Note: This analysis is used for my own study purpose. In this section, I will summerize some basic commands for correlation analysis in R environment based on several online sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
First, I will add the table about parametric or non-parametric test for independent groups. This table will help you have an overview about which approach is suitable for which variables. In this practice, I will go for Correlation analysis:
Result value of coefficient:
A coefficient of +1: positively correlated.
A coefficient of -1: perfect negative relationship.
A coefficient of zero indicates no linear relationship at all.
The correlation coefficient is a commonly used measure of the size of an effect:
± 0.1: a small effect.
± 0.3: a medium effect.
± 0.5: a large effect.
However, if you can, try to interpret the size of correlation within the context of the research you’ve done rather than blindly following these benchmarks.
Correlation coefficients give no indication of the direction of causality. For two reasons:
The third-variable problem: causality between
two variables cannot be assumed because there may be other measured or
unmeasured variables affecting the result.
Direction of causality: Even if we could ignore the third-variable problem, correlation coefficient doesn’t indicate in which direction causality operates.
2 types of correlation:
A bivariate correlation: a correlation between
two variables (as described at the beginning of this chapter): Pearson’s
product-moment correlation coefficient, Spearman’s rho and Kendall’s
tau.
A partial correlation: the relationship between two variables while ‘controlling’ the effect of one or more additional variables.
Extremely important:
We cannot make direct conclusions about causality from a
correlation.
Coefficient of determination, R2: a
measure of the amount of variability in one variable that is shared by
the other.
R2 cannot be used to infer causal relationships.
Example: The correlation coefficient (r) of X and Y
is -0.4410 => so the value of R2 will be (-0.4410)^2 = 0.194. This
value tells us how much of the variability in X is shared by Y.
=> X shares 19.4% of the variability in Y.
=> Or X accounts for only 19.4% of variation in
Y.
=> This leaves 80.6% of the variability still to be accounted
for by other variables.
R2 cannot be used to infer causal relationships. X might well share 19.4% of the variation in Y, but it does not necessarily cause this variation.
Correlation coefficients are effect sizes: r => No calculations.
Pearson’s r that we can square this value to get the proportion of shared variance, R2.
For Spearman’s rs we can do this too because it uses the same equation as Pearson’s r. However, the resulting Rs2: it is the proportion of variance in the ranks that two variables share. Having said this, Rs2 is usually a good approximation for R2 (especially in conditions of near-normal distributions).
Kendall’s τ, however, is not numerically similar to either r or
rs and so τ2 does not tell us about the proportion of variance shared by
two variables or the ranks of those two variables. Kendall’s τ is 66–75%
smaller than both Spearman’s rs and Pearson’s r, but r and rs are
generally similar sizes.
=> If τ is used as an effect size it should be borne in mind that it
is not comparable to r and rs and should not be squared.
Point-biserial and biserial correlations differ in size too (the biserial correlation was bigger than the point-biserial). => Be careful to decide whether your dichotomous variable has an underlying continuum, or whether it is a truly discrete variable.
There are 4 main functions that you can apply to calculate correlation coefficient:
Function | Pearson | Spearman | Kendall | p-value | CI | Multiple correlations? | Sample size |
---|---|---|---|---|---|---|---|
cor() | x | x | x | x | |||
cor.test() | x | x | x | x | x | ||
corr.test() | x | x | x | x | x | x | x |
rcorr() | x | x | x | x | x |
Command: cor(x, y = NULL, use = “everything”, method = c(“pearson”, “kendall”, “spearman”))
x: a numeric variable or dataframe.
y: another numeric variable.
use: set equal to a character string that specifies how missing values are handled. The strings can be:
“everything”: R will output an NA instead of a correlation coefficient for any correlations involving variables containing missing values;
“all.obs”: use all observations and, therefore, returns an error message if there are any missing values in the data;
“complete.obs”: correlations are computed from only cases that are complete for all variables – sometimes known as excluding cases listwise (if a case has a missing value for any variable, then they are excluded from the whole analysis);
“pairwise.complete.obs”: n which correlations between pairs of variables are computed for cases that are complete for those two variables – sometimes known as excluding cases pairwise (if a participant has a score missing for a particular variable or analysis, then their data are excluded only from calculations involving the variable for which they have no score).
Test for association between paired samples, using one of Pearson’s product moment correlation coefficient, Kendall’s tau or Spearman’s rho.
Command: cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)
x: a numeric variable.
y: another numeric variable.
alternative: whether you want to do a two-tailed test (alternative = “two.sided” - default), or whether you predict that the correlation will be less than zero (i.e., negative) or more than zero (i.e., positive), in which case you can use alternative = “less” and alternative = “greater”, respectively.
method: the same as for cor().
conf.level: specify the width of the confidence interval computed for the correlation. The default is 0.95 (conf.level = 0.95).
Although the cor.test() function finds the correlations for a matrix, it does not report probability values. cor.test does, but for only one pair of variables at a time. corr.test() uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well. For symmetric matrices, raw probabilites are reported below the diagonal and correlations adjusted for multiple comparisons above the diagonal. In the case of different x and ys, the default is to adjust the probabilities for multiple tests.
Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”,
method=“pearson”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5,
normal=TRUE)
x: a matrix or dataframe.
y: a second matrix or dataframe with the same number of rows as x use.
use=“pairwise”: the default value and will do pairwise deletion of cases. use=“complete” will select just complete cases.
method: method=“pearson” is the default value. The alternatives to be passed to cor are “spearman” and “kendall”. These last two are much slower, particularly for big data sets.
adjust: What adjustment for multiple tests should be used? (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”). See p.adjust for details about why to use “holm” rather than “bonferroni”).
alpha: alpha level of confidence intervals.
rcorr() Computes a matrix of Pearson’s r or Spearman’s rho rank correlation coefficients for all possible pairs of columns of a matrix. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. Ranks are computed using efficient algorithms, using midranks for ties.
Package: Hmisc
Command: rcorr(x,y, type = “correlation type”)
Note:
R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables. Only appy for cor() command.
Package: corrplot
Command: corrplot(df)
df = cor(dataset)
Leave blank on non-significant coefficient_ add all correlation coefficients:
Command: corrplot(df, p.mat = testRes\(p, method = 'circle', type = 'lower', insig='blank', order = 'AOE', diag = FALSE)\)corrPos -> p1 text(p1\(x, p1\)y, round(p1$corr, 2))
Main data used in this notes: fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
Variable | Type | Description |
---|---|---|
studyid | Category | Study ID # (z001 through z500) |
trt | Category | Treatment group (Intervention or Control) |
age | Numeric | Age in years |
sex | Category | Male or Female |
nihss | Numeric | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
location | Category | Stroke Location - Left or Right Hemisphere |
hx.isch | Category | History of Ischemic Stroke (Yes/No) |
afib | Category | Atrial Fibrillation (1 = Yes, 0 = No) |
dm | Category | Diabetes Mellitus (1 = Yes, 0 = No) |
mrankin | Category | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
sbp | Numeric | Systolic blood pressure, in mm Hg |
iv.altep | Category | Treatment with IV alteplase (Yes/No) |
time.iv | Numeric | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
aspects | Category | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
ia.occlus | Category | Intracranial arterial occlusion, based on vessel imaging - five categories |
extra.ica | Category | Extracranial ICA occlusion (1 = Yes, 0 = No) |
time.rand | Numeric | Time from stroke onset to study randomization, in minutes |
time.punc | Numeric | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
Pearson correlation only indicates the degree of association between the two continuous variables but not the causation.
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
names(data)
## [1] "studyid" "trt" "age" "sex" "nihss" "location"
## [7] "hx.isch" "afib" "dm" "mrankin" "sbp" "iv.altep"
## [13] "time.iv" "aspects" "ia.occlus" "extra.ica" "time.rand" "time.punc"
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
data <- data%>%mutate_if(is.character, as.factor)
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Changing coding variable to categorical:
data$afib <- factor(data$afib , levels=0:1, labels=c("No", "Yes"))
data$dm <- factor(data$dm , levels=0:1, labels=c("No", "Yes"))
data$extra.ica <- factor(data$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Checking after converting
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Relevel factors:
data$trt = factor(data$trt, levels=c("Intervention", "Control"))
data$mrankin = factor(data$mrankin, levels=c("0", "1", "2", "> 2"))
data$ia.occlus = factor(data$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Check order of factor after relevel:
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
# Select only numeric variables:
df1 = data %>% dplyr::select_if(is.numeric)
str(df1)
## tibble [500 x 7] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
Package: VIM
Command: aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)
For more infomation, please refer to the following link:
Dealing with missingness - Imputation: https://rpubs.com/minhtri/932007
Checking the percentage of missingness:
library(VIM)
aggr(df1, plot = F, numbers = T, prop = T)
##
## Missings in variables:
## Variable Count
## sbp 1
## time.iv 55
## aspects 4
## time.rand 2
## time.punc 267
aggr(df1, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 0.534
## time.iv 0.110
## aspects 0.008
## time.rand 0.004
## sbp 0.002
## age 0.000
## nihss 0.000
aggr(df1, plot = T, numbers = T, prop = F, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## time.punc 267
## time.iv 55
## aspects 4
## time.rand 2
## sbp 1
## age 0
## nihss 0
Discuss
There are 7 variables with missingess.
However, the missingness in time.punc and
time.iv is due to the experimental design
(time.punc is only applied for Intervention group
(trt), time.iv only applied for
patients if iv.altep =Yes). Therefore, I must exclude
them from the missingness analysis.
Only 5 variables left: the proportion < 5%.
=> Complete case analysis can be applied.
=> We can ignore the missingness and progress to next step.
For Pearson: If you want to establish whether the correlation coefficient is significant, more assumption: The sampling distribution has to be normally distributed.
Mechanism: Shapiro-Wilk test compares the scores in the sample to a normally distributed set of scores with the same mean and standard deviation:
Limitations: because with large sample sizes it is
very easy to get significant results from small deviations from
normality, and so a significant test doesn’t necessarily tell us whether
the deviation from normality is enough to bias any statistical
procedures that we apply to the data.
=> Plot your data (histogram / Q-Q plots) and try to make an informed
decision about the extent of non-normality (and the values of skew and
kurtosis.).
Command:
* For numeric variables: shapiro.test(variable)
* For separate factors in 1 numeric variable: by(numeric
variable, group/categorical variable, name of test)
Practice: Testing the distribution of 7 numeric variables:
# List names of numeric variables:
names(df1)
## [1] "age" "nihss" "sbp" "time.iv" "aspects" "time.rand"
## [7] "time.punc"
# Using Shapiro test:
shapiro.test(df1$age)
##
## Shapiro-Wilk normality test
##
## data: df1$age
## W = 0.97448, p-value = 1.184e-07
shapiro.test(df1$nihss)
##
## Shapiro-Wilk normality test
##
## data: df1$nihss
## W = 0.96203, p-value = 4.588e-10
shapiro.test(df1$sbp)
##
## Shapiro-Wilk normality test
##
## data: df1$sbp
## W = 0.99841, p-value = 0.9336
shapiro.test(df1$time.iv)
##
## Shapiro-Wilk normality test
##
## data: df1$time.iv
## W = 0.91375, p-value = 3.041e-15
shapiro.test(df1$aspects)
##
## Shapiro-Wilk normality test
##
## data: df1$aspects
## W = 0.8438, p-value < 2.2e-16
shapiro.test(df1$time.rand)
##
## Shapiro-Wilk normality test
##
## data: df1$time.rand
## W = 0.95965, p-value = 1.93e-10
shapiro.test(df1$time.punc)
##
## Shapiro-Wilk normality test
##
## data: df1$time.punc
## W = 0.93577, p-value = 1.435e-08
Discuss
sbp: p-value = 0.9336 > 0.05: normal
distribution.
Other variables: p-value <0.05: non-normal distribution.
We have 2 ways to command:
Package: ggplot2
Package: ggpubr
Practice: Let try the 2 commands for Age - similar plots are drawn:
ggplot(data = df1, aes(sample = age)) +
geom_qq() +
geom_qq_line() +
labs( x = "Theoretical", y = "Sample Quantiles - Age")
library("ggpubr")
ggqqplot(df1$age, ylab = "Age ")
Let use the command for other numeric variables:
ggqqplot(df1$nihss, ylab = "nihss ")
ggqqplot(df1$sbp, ylab = "sbp ")
ggqqplot(df1$time.iv, ylab = "time.iv ")
ggqqplot(df1$aspects, ylab = "aspects ")
ggqqplot(df1$time.rand, ylab = "time.rand ")
ggqqplot(df1$time.punc, ylab = "time.punc ")
Discuss
The Q-Q plot of sbp: normal distribution.
Other variables: non-normal distribution.
=> The results of Shapiro-Wilk test are similar to the Q-Q
plot.
=> Violated assumption for normal distribution.
=> Must use non-normal approach: Spearman’s correlation
coefficient (rs) or Kendall’s tau (t).
=> However, for demonstration, I will just continue with
Pearson correlation test.
Let’s get to the main part: We have 4 types of commands: cor(); cor.test(); corr.test() and rcorr(). I will apply them 1 by 1 to compare the results age vs nihss and age vs sbp (if relevant):
cor(x,y, use = “string”, method = “correlation type”)
with use is set equal to a character string that specifies how missing values are handled. The strings can be:
I will demonstrate 3 options of use through df1. However I prefer use = “pairwise.complete.obs”:
cor(df1, use = "everything", method = "pearson")
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00000000 -0.03376615 NA NA NA NA NA
## nihss -0.03376615 1.00000000 NA NA NA NA NA
## sbp NA NA 1 NA NA NA NA
## time.iv NA NA NA 1 NA NA NA
## aspects NA NA NA NA 1 NA NA
## time.rand NA NA NA NA NA 1 NA
## time.punc NA NA NA NA NA NA 1
cor(df1, use ="complete.obs", method = "pearson")
## age nihss sbp time.iv aspects
## age 1.00000000 -0.037087611 0.065222765 0.018257624 0.0964599084
## nihss -0.03708761 1.000000000 -0.046851536 0.022913105 -0.0070076127
## sbp 0.06522276 -0.046851536 1.000000000 -0.001481055 -0.0588125399
## time.iv 0.01825762 0.022913105 -0.001481055 1.000000000 0.0214545875
## aspects 0.09645991 -0.007007613 -0.058812540 0.021454587 1.0000000000
## time.rand -0.06047223 -0.080975591 0.045421238 0.033862157 0.0076874989
## time.punc 0.09370001 -0.077125919 0.072707526 -0.145815362 0.0006809697
## time.rand time.punc
## age -0.060472227 0.0937000138
## nihss -0.080975591 -0.0771259191
## sbp 0.045421238 0.0727075264
## time.iv 0.033862157 -0.1458153617
## aspects 0.007687499 0.0006809697
## time.rand 1.000000000 -0.0116944381
## time.punc -0.011694438 1.0000000000
cor(df1, use = "pairwise.complete.obs", method = "pearson")
## age nihss sbp time.iv aspects
## age 1.00000000 -0.03376615 0.010030620 -0.02228718 0.008500860
## nihss -0.03376615 1.00000000 -0.038767105 0.06151694 -0.043631014
## sbp 0.01003062 -0.03876710 1.000000000 0.07122853 -0.008519498
## time.iv -0.02228718 0.06151694 0.071228535 1.00000000 0.030083740
## aspects 0.00850086 -0.04363101 -0.008519498 0.03008374 1.000000000
## time.rand -0.02424378 -0.07720899 0.028310877 -0.01314698 0.014546529
## time.punc 0.09933442 -0.04822361 0.064757047 -0.13864373 -0.014564781
## time.rand time.punc
## age -0.024243782 0.099334424
## nihss -0.077208987 -0.048223608
## sbp 0.028310877 0.064757047
## time.iv -0.013146984 -0.138643731
## aspects 0.014546529 -0.014564781
## time.rand 1.000000000 0.005445479
## time.punc 0.005445479 1.000000000
library(corrplot)
a = cor(df1, use ="complete.obs", method = "pearson")
corrplot(a, method = 'number') # colorful number
library(corrplot)
a = cor(df1, use ="complete.obs", method = "pearson")
corrplot(a)
## leave blank on non-significant coefficient
## add all correlation coefficients
library(corrplot)
M = cor(df1, use ="complete.obs", method = "pearson")
testRes = cor.mtest(df1, use ="complete.obs", method = "pearson")
corrplot(M, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
order = 'AOE', diag = FALSE)$corrPos -> p1
text(p1$x, p1$y, round(p1$corr, 2))
Discuss
Look at the result, we can easily notice that:
For more visualization, please visit: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Our target variables:
For age vs nihss: r = -0.03 indicated that age
is weakly negatively related to nihss. R2 = 0.09% indicated that
age shares 0.09% of the variability in
nihss.
For age vs sbp: r = 0.01: age is weakly positively related to sbp. R2 = 0.01% indicated that age shares 0.01% of the variability in sbp.
cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)
cor.test(df1$age, df1$nihss)
##
## Pearson's product-moment correlation
##
## data: df1$age and df1$nihss
## t = -0.75395, df = 498, p-value = 0.4512
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12109817 0.05408458
## sample estimates:
## cor
## -0.03376615
cor.test(df1$age, df1$sbp)
##
## Pearson's product-moment correlation
##
## data: df1$age and df1$sbp
## t = 0.22363, df = 497, p-value = 0.8231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07781638 0.09772306
## sample estimates:
## cor
## 0.01003062
Discuss
The Pearson correlation coefficient of age and
nihss, age and sbp when using cor() and
cor.test() are the same: weakly relationship.
The p-value > 0.05: not statistically significant.
Confidence interval contains 0: further evidence that the r is
not statistically significant.
=> Very very weak related and not statistically significant.
For age and nihss: r = -0.033 and p > 0.05.
This p-value tells us that the probability of getting a correlation
coefficient this big in a sample of 500 observations if the null
hypothesis were true (there was no relationship between these variables)
is very high. Therefore, we reject the alternative hypothesis and say
that the relationship between age and nihss is not
statistically significant.
For age and sbp: Similar conclusion.
Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”,
method=“pearson”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5,
normal=TRUE)
library(psych)
print(corr.test(df1), short = F)
## Call:corr.test(x = df1)
## Correlation matrix
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00 -0.03 0.01 -0.02 0.01 -0.02 0.10
## nihss -0.03 1.00 -0.04 0.06 -0.04 -0.08 -0.05
## sbp 0.01 -0.04 1.00 0.07 -0.01 0.03 0.06
## time.iv -0.02 0.06 0.07 1.00 0.03 -0.01 -0.14
## aspects 0.01 -0.04 -0.01 0.03 1.00 0.01 -0.01
## time.rand -0.02 -0.08 0.03 -0.01 0.01 1.00 0.01
## time.punc 0.10 -0.05 0.06 -0.14 -0.01 0.01 1.00
## Sample Size
## age nihss sbp time.iv aspects time.rand time.punc
## age 500 500 499 445 496 498 233
## nihss 500 500 499 445 496 498 233
## sbp 499 499 499 444 495 497 233
## time.iv 445 445 444 445 441 444 203
## aspects 496 496 495 441 496 494 233
## time.rand 498 498 497 444 494 498 231
## time.punc 233 233 233 203 233 231 233
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## age nihss sbp time.iv aspects time.rand time.punc
## age 0.00 1.00 1.00 1.00 1.00 1.00 1
## nihss 0.45 0.00 1.00 1.00 1.00 1.00 1
## sbp 0.82 0.39 0.00 1.00 1.00 1.00 1
## time.iv 0.64 0.20 0.13 0.00 1.00 1.00 1
## aspects 0.85 0.33 0.85 0.53 0.00 1.00 1
## time.rand 0.59 0.09 0.53 0.78 0.75 0.00 1
## time.punc 0.13 0.46 0.33 0.05 0.82 0.93 0
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss -0.12 -0.03 0.05 0.45 -0.16 0.09
## age-sbp -0.08 0.01 0.10 0.82 -0.10 0.12
## age-tim.v -0.12 -0.02 0.07 0.64 -0.15 0.11
## age-aspct -0.08 0.01 0.10 0.85 -0.09 0.11
## age-tm.rn -0.11 -0.02 0.06 0.59 -0.15 0.10
## age-tm.pn -0.03 0.10 0.22 0.13 -0.10 0.29
## nihss-sbp -0.13 -0.04 0.05 0.39 -0.17 0.09
## nihss-tim.v -0.03 0.06 0.15 0.20 -0.08 0.20
## nihss-aspct -0.13 -0.04 0.04 0.33 -0.17 0.09
## nihss-tm.rn -0.16 -0.08 0.01 0.09 -0.21 0.06
## nihss-tm.pn -0.18 -0.05 0.08 0.46 -0.24 0.14
## sbp-tim.v -0.02 0.07 0.16 0.13 -0.07 0.21
## sbp-aspct -0.10 -0.01 0.08 0.85 -0.12 0.10
## sbp-tm.rn -0.06 0.03 0.12 0.53 -0.10 0.15
## sbp-tm.pn -0.06 0.06 0.19 0.33 -0.13 0.26
## tim.v-aspct -0.06 0.03 0.12 0.53 -0.11 0.16
## tim.v-tm.rn -0.11 -0.01 0.08 0.78 -0.13 0.11
## tim.v-tm.pn -0.27 -0.14 0.00 0.05 -0.34 0.08
## aspct-tm.rn -0.07 0.01 0.10 0.75 -0.10 0.13
## aspct-tm.pn -0.14 -0.01 0.11 0.82 -0.19 0.16
## tm.rn-tm.pn -0.12 0.01 0.13 0.93 -0.12 0.13
Discuss
With corr.test(), we got the result of r,
sample size, p-value,
ci.
The sample size: allow us to know how many
observations used for testing the association between paired samples
(because of the missing value).
The p-value: > 0.05: not significant.
The ci cross 0: indicate not significant.
The results are just as we expected
Package: Hmisc
Command: rcorr(x,y, type = “correlation type”)
library(Hmisc)
rcorr(as.matrix(df1))
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00 -0.03 0.01 -0.02 0.01 -0.02 0.10
## nihss -0.03 1.00 -0.04 0.06 -0.04 -0.08 -0.05
## sbp 0.01 -0.04 1.00 0.07 -0.01 0.03 0.06
## time.iv -0.02 0.06 0.07 1.00 0.03 -0.01 -0.14
## aspects 0.01 -0.04 -0.01 0.03 1.00 0.01 -0.01
## time.rand -0.02 -0.08 0.03 -0.01 0.01 1.00 0.01
## time.punc 0.10 -0.05 0.06 -0.14 -0.01 0.01 1.00
##
## n
## age nihss sbp time.iv aspects time.rand time.punc
## age 500 500 499 445 496 498 233
## nihss 500 500 499 445 496 498 233
## sbp 499 499 499 444 495 497 233
## time.iv 445 445 444 445 441 444 203
## aspects 496 496 495 441 496 494 233
## time.rand 498 498 497 444 494 498 231
## time.punc 233 233 233 203 233 231 233
##
## P
## age nihss sbp time.iv aspects time.rand time.punc
## age 0.4512 0.8231 0.6392 0.8502 0.5894 0.1306
## nihss 0.4512 0.3875 0.1952 0.3322 0.0852 0.4638
## sbp 0.8231 0.3875 0.1340 0.8500 0.5289 0.3250
## time.iv 0.6392 0.1952 0.1340 0.5286 0.7824 0.0485
## aspects 0.8502 0.3322 0.8500 0.5286 0.7471 0.8250
## time.rand 0.5894 0.0852 0.5289 0.7824 0.7471 0.9344
## time.punc 0.1306 0.4638 0.3250 0.0485 0.8250 0.9344
To calculate R2 by hand, we just simply square the r-value and multiply 100%.
Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))
Example 1: Plot age vs nihss
ggplot(df1, aes(age, nihss)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "nihss", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Example 2: Plot age vs sbp
ggplot(df1, aes(age, sbp)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "sbp", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Summary
Function | Pearson | Spearman | Kendall | p-value | CI | Multiple correlations? | Sample size |
---|---|---|---|---|---|---|---|
cor() | x | x | x | x | |||
cor.test() | x | x | x | x | x | ||
corr.test() | x | x | x | x | x | x | x |
rcorr() | x | x | x | x | x |
Speaman Is a non-parametric statistic - sometimes called
Spearman’s rho.
Can be used when the data have violated parametric assumptions
such as non-normally distributed data. This happens when at least one of
your variables is on an ordinal level of measurement or
when the data from one or both variables do not follow normal
distributions.
Spearman’s test works by first ranking the data, and then applying Pearson’s equation to those ranks.
Step 1 to 3 will be similar to Pearson’s r test. I will start from step 4.
cor(x,y, use = “string”, method = “correlation type”)
with use is set equal to a character string that specifies how missing values are handled. The strings can be:
I will demonstrate the option of use = “pairwise.complete.obs” through df1:
cor(df1, use = "pairwise.complete.obs", method = "spearman")
## age nihss sbp time.iv aspects
## age 1.000000000 -0.02915220 0.049699462 -0.009632336 -0.010716491
## nihss -0.029152196 1.00000000 -0.062439507 0.079920020 -0.024695464
## sbp 0.049699462 -0.06243951 1.000000000 0.119992418 0.006328807
## time.iv -0.009632336 0.07992002 0.119992418 1.000000000 0.032214687
## aspects -0.010716491 -0.02469546 0.006328807 0.032214687 1.000000000
## time.rand -0.042617822 -0.07793713 0.031786886 -0.030225795 -0.009998565
## time.punc 0.096659241 -0.04526983 0.050351932 -0.135213032 -0.003709137
## time.rand time.punc
## age -0.0426178221 0.0966592407
## nihss -0.0779371258 -0.0452698337
## sbp 0.0317868859 0.0503519322
## time.iv -0.0302257950 -0.1352130316
## aspects -0.0099985648 -0.0037091373
## time.rand 1.0000000000 -0.0004096534
## time.punc -0.0004096534 1.0000000000
Discuss
Look at the result, we can easily notice that: Each variable is perfectly correlated with itself (obviously) and so r = 1.
Our target variables:
For age vs nihss: r = -0.029 indicated that age
is negatively related to nihss. R2 = 0.0841% indicated that
age shares 0.0841% of the variability in
nihss.
For age vs sbp: r = 0.05: age is positively related to sbp. R2 = 0.25% indicated that age shares 0.25% of the variability in sbp.
cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)
cor.test(df1$age, df1$nihss, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: df1$age and df1$nihss
## S = 21440585, p-value = 0.5155
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.0291522
cor.test(df1$age, df1$sbp, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: df1$age and df1$sbp
## S = 19679299, p-value = 0.2678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.04969946
Discuss
The Spearman correlation coefficient of age and
nihss, age and sbp when using cor() and
cor.test() are the same.
The p-value > 0.05: not statistically significant.
Confidence interval cross 0: not statistically significant.
=> There is no statistically significant relationship between
age and nihss, age and sbp.
Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”,
method=“spearman”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5,
normal=TRUE)
library(psych)
print(corr.test(df1, method = "spearman"), short = F)
## Call:corr.test(x = df1, method = "spearman")
## Correlation matrix
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00 -0.03 0.05 -0.01 -0.01 -0.04 0.10
## nihss -0.03 1.00 -0.06 0.08 -0.02 -0.08 -0.05
## sbp 0.05 -0.06 1.00 0.12 0.01 0.03 0.05
## time.iv -0.01 0.08 0.12 1.00 0.03 -0.03 -0.14
## aspects -0.01 -0.02 0.01 0.03 1.00 -0.01 0.00
## time.rand -0.04 -0.08 0.03 -0.03 -0.01 1.00 0.00
## time.punc 0.10 -0.05 0.05 -0.14 0.00 0.00 1.00
## Sample Size
## age nihss sbp time.iv aspects time.rand time.punc
## age 500 500 499 445 496 498 233
## nihss 500 500 499 445 496 498 233
## sbp 499 499 499 444 495 497 233
## time.iv 445 445 444 445 441 444 203
## aspects 496 496 495 441 496 494 233
## time.rand 498 498 497 444 494 498 231
## time.punc 233 233 233 203 233 231 233
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## age nihss sbp time.iv aspects time.rand time.punc
## age 0.00 1.00 1.00 1.00 1.00 1 1
## nihss 0.52 0.00 1.00 1.00 1.00 1 1
## sbp 0.27 0.16 0.00 0.24 1.00 1 1
## time.iv 0.84 0.09 0.01 0.00 1.00 1 1
## aspects 0.81 0.58 0.89 0.50 0.00 1 1
## time.rand 0.34 0.08 0.48 0.53 0.82 0 1
## time.punc 0.14 0.49 0.44 0.05 0.96 1 0
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss -0.12 -0.03 0.06 0.52 -0.15 0.09
## age-sbp -0.04 0.05 0.14 0.27 -0.08 0.18
## age-tim.v -0.10 -0.01 0.08 0.84 -0.13 0.11
## age-aspct -0.10 -0.01 0.08 0.81 -0.13 0.11
## age-tm.rn -0.13 -0.04 0.05 0.34 -0.17 0.09
## age-tm.pn -0.03 0.10 0.22 0.14 -0.10 0.29
## nihss-sbp -0.15 -0.06 0.03 0.16 -0.19 0.07
## nihss-tim.v -0.01 0.08 0.17 0.09 -0.06 0.22
## nihss-aspct -0.11 -0.02 0.06 0.58 -0.14 0.10
## nihss-tm.rn -0.16 -0.08 0.01 0.08 -0.21 0.06
## nihss-tm.pn -0.17 -0.05 0.08 0.49 -0.23 0.14
## sbp-tim.v 0.03 0.12 0.21 0.01 -0.02 0.26
## sbp-aspct -0.08 0.01 0.09 0.89 -0.10 0.11
## sbp-tm.rn -0.06 0.03 0.12 0.48 -0.09 0.16
## sbp-tm.pn -0.08 0.05 0.18 0.44 -0.14 0.24
## tim.v-aspct -0.06 0.03 0.13 0.50 -0.10 0.17
## tim.v-tm.rn -0.12 -0.03 0.06 0.53 -0.16 0.10
## tim.v-tm.pn -0.27 -0.14 0.00 0.05 -0.34 0.08
## aspct-tm.rn -0.10 -0.01 0.08 0.82 -0.13 0.11
## aspct-tm.pn -0.13 0.00 0.12 0.96 -0.15 0.14
## tm.rn-tm.pn -0.13 0.00 0.13 1.00 -0.13 0.13
Discuss
With corr.test(), we got the result of r,
sample size, p-value,
ci.
The sample size: allow us to know how many
observations used for testing the association between paired samples
(because of the missing value).
The p-value: observe the p result below the diagonal line
0.
The results are just as we expected.
=> There is no statistically significant relationship between
age and nihss, age and sbp.
However, for time.iv and sbp: r = 0.12; p-value
= 0.01 and its CI does not cross 0: statistically significant.
This result is different from Pearson r because
time.iv is non-normal distribution => The result of
Spearman can be trusted compared to Pearson.
=> There is significant positive relationship between time.iv
and sbp: as time.iv increase, sbp increase as well.
Furthermore, time.iv accounts for 0.12^2 = 1.44% for
the variability in sbp.
Package: Hmisc
Command: rcorr(x,y, type = “correlation type”)
library(Hmisc)
rcorr(as.matrix(df1), type = "spearman")
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00 -0.03 0.05 -0.01 -0.01 -0.04 0.10
## nihss -0.03 1.00 -0.06 0.08 -0.02 -0.08 -0.05
## sbp 0.05 -0.06 1.00 0.12 0.01 0.03 0.05
## time.iv -0.01 0.08 0.12 1.00 0.03 -0.03 -0.14
## aspects -0.01 -0.02 0.01 0.03 1.00 -0.01 0.00
## time.rand -0.04 -0.08 0.03 -0.03 -0.01 1.00 0.00
## time.punc 0.10 -0.05 0.05 -0.14 0.00 0.00 1.00
##
## n
## age nihss sbp time.iv aspects time.rand time.punc
## age 500 500 499 445 496 498 233
## nihss 500 500 499 445 496 498 233
## sbp 499 499 499 444 495 497 233
## time.iv 445 445 444 445 441 444 203
## aspects 496 496 495 441 496 494 233
## time.rand 498 498 497 444 494 498 231
## time.punc 233 233 233 203 233 231 233
##
## P
## age nihss sbp time.iv aspects time.rand time.punc
## age 0.5155 0.2678 0.8394 0.8118 0.3426 0.1413
## nihss 0.5155 0.1637 0.0922 0.5832 0.0823 0.4917
## sbp 0.2678 0.1637 0.0114 0.8883 0.4795 0.4443
## time.iv 0.8394 0.0922 0.0114 0.4998 0.5253 0.0544
## aspects 0.8118 0.5832 0.8883 0.4998 0.8246 0.9551
## time.rand 0.3426 0.0823 0.4795 0.5253 0.8246 0.9951
## time.punc 0.1413 0.4917 0.4443 0.0544 0.9551 0.9951
Discuss: The result of rcorr() is similar to corr.test().
Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))
Example 1: Plot age vs nihss
ggplot(df1, aes(age, nihss)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "nihss", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Example 2: Plot age vs sbp
ggplot(df1, aes(age, sbp)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "sbp", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Summary
4 command produces similar results Spearman’s correlation
coefficient rs.
However, I prefer corr.test() > rcorr() >
cor.test() > cor() because of the information provided in
the result sections.
From r, calculate Coefficient of determination -
R2: X shares % of the variability in Y or X accounts for only
19.4% of variation in Y.
r, R2 cannot be used to infer causal relationships: X does not necessarily cause the variation in Y.
Another non-parametric correlation.
Should be used rather than Spearman’s coefficient when you have a
small data set with a large number of tied ranks.
Although Spearman’s statistic is the more popular of the two coefficients, there is much to suggest that Kendall’s statistic is actually a better estimate of the correlation in the population.
Step 1 to 3 will be similar to Pearson’s r test. I will start from step 4.
cor(x,y, use = “string”, method = “correlation type”)
cor(df1, use = "pairwise.complete.obs", method = "kendall")
## age nihss sbp time.iv aspects
## age 1.000000000 -0.01870296 0.035677034 -0.006439801 -0.008510516
## nihss -0.018702958 1.00000000 -0.042009528 0.053594262 -0.018916012
## sbp 0.035677034 -0.04200953 1.000000000 0.081829814 0.005073658
## time.iv -0.006439801 0.05359426 0.081829814 1.000000000 0.023811504
## aspects -0.008510516 -0.01891601 0.005073658 0.023811504 1.000000000
## time.rand -0.028191859 -0.05425169 0.021044476 -0.020902271 -0.006315963
## time.punc 0.066117986 -0.03196113 0.033534926 -0.091000978 -0.003156532
## time.rand time.punc
## age -0.028191859 0.066117986
## nihss -0.054251690 -0.031961132
## sbp 0.021044476 0.033534926
## time.iv -0.020902271 -0.091000978
## aspects -0.006315963 -0.003156532
## time.rand 1.000000000 -0.005674725
## time.punc -0.005674725 1.000000000
cor.test(x, y, alternative = c(“two.sided”, “less”, “greater”), method = c(“pearson”, “kendall”, “spearman”), conf.level = 0.95, …)
cor.test(df1$age, df1$nihss, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: df1$age and df1$nihss
## z = -0.6011, p-value = 0.5478
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.01870296
cor.test(df1$age, df1$sbp, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: df1$age and df1$sbp
## z = 1.1748, p-value = 0.2401
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.03567703
Package: psych
Command: corr.test( x, y = NULL, use = “pairwise”,
method=“kendall”, adjust=“holm”, alpha=.05, ci=TRUE, minlength=5,
normal=TRUE)
library(psych)
print(corr.test(df1, method = "kendall"), short = F)
## Call:corr.test(x = df1, method = "kendall")
## Correlation matrix
## age nihss sbp time.iv aspects time.rand time.punc
## age 1.00 -0.02 0.04 -0.01 -0.01 -0.03 0.07
## nihss -0.02 1.00 -0.04 0.05 -0.02 -0.05 -0.03
## sbp 0.04 -0.04 1.00 0.08 0.01 0.02 0.03
## time.iv -0.01 0.05 0.08 1.00 0.02 -0.02 -0.09
## aspects -0.01 -0.02 0.01 0.02 1.00 -0.01 0.00
## time.rand -0.03 -0.05 0.02 -0.02 -0.01 1.00 -0.01
## time.punc 0.07 -0.03 0.03 -0.09 0.00 -0.01 1.00
## Sample Size
## age nihss sbp time.iv aspects time.rand time.punc
## age 500 500 499 445 496 498 233
## nihss 500 500 499 445 496 498 233
## sbp 499 499 499 444 495 497 233
## time.iv 445 445 444 445 441 444 203
## aspects 496 496 495 441 496 494 233
## time.rand 498 498 497 444 494 498 231
## time.punc 233 233 233 203 233 231 233
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## age nihss sbp time.iv aspects time.rand time.punc
## age 0.00 1.00 1.00 1.00 1.00 1.00 1
## nihss 0.68 0.00 1.00 1.00 1.00 1.00 1
## sbp 0.43 0.35 0.00 1.00 1.00 1.00 1
## time.iv 0.89 0.26 0.09 0.00 1.00 1.00 1
## aspects 0.85 0.67 0.91 0.62 0.00 1.00 1
## time.rand 0.53 0.23 0.64 0.66 0.89 0.00 1
## time.punc 0.31 0.63 0.61 0.20 0.96 0.93 0
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## age-nihss -0.11 -0.02 0.07 0.68 -0.14 0.10
## age-sbp -0.05 0.04 0.12 0.43 -0.10 0.17
## age-tim.v -0.10 -0.01 0.09 0.89 -0.13 0.12
## age-aspct -0.10 -0.01 0.08 0.85 -0.13 0.11
## age-tm.rn -0.12 -0.03 0.06 0.53 -0.16 0.10
## age-tm.pn -0.06 0.07 0.19 0.31 -0.13 0.26
## nihss-sbp -0.13 -0.04 0.05 0.35 -0.17 0.09
## nihss-tim.v -0.04 0.05 0.15 0.26 -0.09 0.19
## nihss-aspct -0.11 -0.02 0.07 0.67 -0.14 0.10
## nihss-tm.rn -0.14 -0.05 0.03 0.23 -0.19 0.08
## nihss-tm.pn -0.16 -0.03 0.10 0.63 -0.22 0.16
## sbp-tim.v -0.01 0.08 0.17 0.09 -0.06 0.22
## sbp-aspct -0.08 0.01 0.09 0.91 -0.10 0.11
## sbp-tm.rn -0.07 0.02 0.11 0.64 -0.10 0.15
## sbp-tm.pn -0.10 0.03 0.16 0.61 -0.16 0.22
## tim.v-aspct -0.07 0.02 0.12 0.62 -0.11 0.16
## tim.v-tm.rn -0.11 -0.02 0.07 0.66 -0.15 0.11
## tim.v-tm.pn -0.23 -0.09 0.05 0.20 -0.30 0.12
## aspct-tm.rn -0.09 -0.01 0.08 0.89 -0.12 0.11
## aspct-tm.pn -0.13 0.00 0.13 0.96 -0.13 0.13
## tm.rn-tm.pn -0.13 -0.01 0.12 0.93 -0.16 0.15
Discuss
With corr.test(), we got the result of r,
sample size, p-value,
ci.
The sample size: allow us to know how many
observations used for testing the association between paired samples
(because of the missing value).
The p-value: observe the p result below the diagonal line
0.
The results are just as we expected.
The result of p-value of time.iv and sbp = 0.09 and its CI cross 0: non-significant. This result is different from SPearman rs.
Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(method = “lm”, colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))
Example 1: Plot age vs nihss
ggplot(df1, aes(age, nihss)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "nihss", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Example 2: Plot age vs sbp
ggplot(df1, aes(age, sbp)) +
geom_point() +
geom_smooth(method = "lm", colour = "Red") +
labs(x = "age", y = "sbp", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
Summary
4 command produces similar results Kendall’s correlation
coefficient t.
However, I prefer corr.test() > cor.test() > cor() because of the information provided in the result sections.
Another way to deal with data that do not meet the assumptions of Pearson’s r is to use bootstrapping. The boot() function takes the general form:
library(boot)
object<-boot(data, function, replications)
data: the dataframe to be used.
function: a function that you write to tell boot() what you want
to bootstrap.
replications: a number specifying how many bootstrap samples you want to take (I usually set this value to 2000).
Step 1 to 3 will be similar to Pearson’s r test. I will start from final step 4.
Example: Bootstrap 2000 times, apply Kendall for age vs nihss:
bootTau<-function (df1,i) cor(df1$age[i], df1$nihss[i], use = "pairwise.complete.obs", method = "kendall")
library(boot)
boot_kendall<-boot(df1, bootTau, 2000)
boot_kendall
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df1, statistic = bootTau, R = 2000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.01870296 -0.0004003333 0.03193298
# To get the 95% confidence interval:
boot.ci(boot_kendall, conf = 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_kendall, conf = 0.95)
##
## Intervals :
## Level Normal Basic
## 95% (-0.0809, 0.0443 ) (-0.0818, 0.0449 )
##
## Level Percentile BCa
## 95% (-0.0823, 0.0444 ) (-0.0813, 0.0451 )
## Calculations and Intervals on Original Scale
Explain the formula:
Creates an object called bootTau.
The first bit of the function tells R what input to expect in the
function: in this case we need to feed a dataframe (df1) into the
function and a variable that has been called i (which refers to a
particular bootstrap sample).
The second part of the function specifies the cor() function,
which is the thing we want to bootstrap.
cor() is specified in exactly the same way except that for each
variable we have added [i] - refers to a particular bootstrap
sample.
If you want to bootstrap a Pearson or Spearman correlation you do it in exactly the same way except that you specify method = “pearson” or method = “spearman” when you define the function.
Discuss the result
Output of boot_kendall:
Output from boot.ci():
=> age and nihss weak correlation and not statistically significant.
Extremely important: This can be considered as special case of Pearson’s r.
Point-biserial correlation coefficient (rpb) is used when one variable is a discrete dichotomy - only A or B (e.g., pregnancy) – no continuum between two categories. The point-biserial correlation is equivalent to calculating the Pearson correlation between a continuous and a dichotomous variable - We can just use the standard cor.test function in R, which will output the correlation, a 95% confidence interval, and an independent t-test with associated p-value.
Biserial correlation coefficient (rb) is used when one variable is a continuous dichotomy – the degree of variable (e.g., large/small margin of passing or failing an exam). rpb < rb.
Phi Coefficient is a measure of association between two binary variables (i.e. living/dead, black/white, success/failure). The correlation is based on frequency counts. The phi coefficient is identical to the Pearson coefficient in the case of a 2 x 2 data set - We can still use the cor.test function to calculate the phi coefficient.
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
names(data)
## [1] "studyid" "trt" "age" "sex" "nihss" "location"
## [7] "hx.isch" "afib" "dm" "mrankin" "sbp" "iv.altep"
## [13] "time.iv" "aspects" "ia.occlus" "extra.ica" "time.rand" "time.punc"
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
In this example, I just want to examine sex, nihss and location:
# Select discrete categorical + numeric variables:
df2 <- data %>% select(age, sex, nihss, location)
# Convert to numeric for discrete categorical variables (Notes cannot labels = 0:1 because it will automatic convert to 1:2 but the result will be the same between 0:1 and 1:2):
## Hard way but can solve 1:2 problem:
### df2$sex[df2$sex == "Female"] <- 0
### df2$sex[df2$sex == "Male"] <- 1
### df2$sex = as.numeric(df2$sex)
### df2$location[df2$location == "Right"] <- 0
### df2$location[df2$location == "Left"] <- 1
### df2$location = as.numeric(df2$location)
## Convenient way:
df2$sex = factor(df2$sex, levels = c("Female","Male"), labels = c(1, 2)) %>% as.numeric()
df2$location = factor(df2$location, levels = c("Right","Left"), labels = c(1, 2)) %>% as.numeric()
# Checking after converting
str(df2)
## tibble [500 x 4] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : num [1:500] 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location: num [1:500] 1 2 1 2 1 2 1 1 2 1 ...
Discuss: Now, the data is now ready for next step. We also need to check Step 2: Missingness checking. But we don’t need step 3: Normal distribution checking.
Similar to Step 2 in Pearson’s r.
Example 1: Point-biserial correlation between sex and nihss
cor.test(df2$sex, df2$nihss)
##
## Pearson's product-moment correlation
##
## data: df2$sex and df2$nihss
## t = 0.47402, df = 498, p-value = 0.6357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06657791 0.10872477
## sample estimates:
## cor
## 0.02123666
Discuss
The point-biserial correlation coefficient is rpb = 0.02 is
nearly to 0 => Weak negative correlation.
rpb has a non-significance value of p = 0.6.
R2 = 0.0004 = 0.04%. Hence, we can conclude that
sex accounts for 0.04% of the variability in
nihss.
=> There is no significant relationship between sex and
nihss.
Example 2: Biserial correlation between sex and nihss. Although sex is not a continuous dichotomy, I still use this variable for demonstration purpose:
a = table(df2$sex)
prop.table(a)
##
## 1 2
## 0.416 0.584
To calculate p, q, y, we use these values and the values of the normal distribution displayed in the Table of the standard normal distribution. Because the exact values of 0.584 and 0.416 are not in the table so instead we use the nearest values that we can find, which are 0.583 and 0.416:
Using R formula or calculate by hand:
**rb = rpb x sqrt(p*q) / y**
library(polycor)
polycor::polyserial(df2$sex, df2$nihss)
## [1] 0.02164936
Calculate SE of rb:
SE = sqrt(p * q) / (y * sqrt(N)) = sqrt(0.587 * 0.413) / (0.389 *
sqrt(500))
SE = 0.057
Calculate z:
z = rb / SE = 0.021 / 0.057 = 0.37
Look up the z in the normal distribution for one-tail probability in the column “Smaller Portion” => p one-tail = 0.35569 => p two-tail = 0.71 > 0.05: Not statistically significant.
Example 3: Phi correlation between sex and location
cor.test(df2$sex, df2$location)
##
## Pearson's product-moment correlation
##
## data: df2$sex and df2$location
## t = 0.34586, df = 498, p-value = 0.7296
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07229239 0.10304699
## sample estimates:
## cor
## 0.01549643
Discuss
The Phi correlation coefficient = 0.015 is nearly to 0 => Weak negative correlation.
Phi has a non-significance value of p = 0.7.
R2 = 0.0002 = 0.02%. Hence, we can conclude that gender accounts
for 0.02% of the variability in location.
=> No significant relationship between sex and
location.
Command: ggplot(data, aes(x, y)) + geom_point() + geom_smooth(colour = “Red”) + labs(x = ““, y =”“, title =” “) + theme(plot.title = element_text(hjust = 0.5))
Example: Plot scatterplot for sex and nihss
ggplot(df2, aes(nihss, sex)) +
geom_point() +
geom_smooth(colour = "Red") +
labs(x = "nihss", y = "sex", title = " ") +
theme(plot.title = element_text(hjust = 0.5))
A correlation between two variables in which the effects of other variables are held constant is known as a partial correlation.
If we remove the portion of variation that is also shared by location, we get a measure of the unique relationship between sex and nihss. We use partial correlations to find out the size of the unique portion of variance. Therefore, we could conduct a partial correlation between esex and nihss while ‘controlling’ for the effect of location.
Note:
Both the calculation of the partial and semi-partial correlation are not allow the Missing Value, so we need prepare a clean data set without all the missing value.
These partial correlations can be done when variables are dichotomous (including the ‘third’ variable). (variables can be continuous or discrete).
So, for example, we could look at the relationship between
bladder relaxation (did the person wet themselves or not?) and the
number of large tarantulas crawling up your leg, controlling for fear of
spiders (the first variable is dichotomous, but the second variable and
‘controlled for’ variables are continuous).
Also, the first variable can be continuous, but the second variable and ‘controlled for’ variables variable can be dichotomous.
Package: ggm
Command:
* pc = ggm::pcor(c(“var1”, “var2”, “control1”, “control2” etc.),
var(dataframe))
* pc^2
* ggm::pcor(pc, number of control variables, sample
size)
Anothery way:
Package: ppcor
Command: ppcor::pcor.test(df2\(sex,df2\)nihss,df2$location,
method=“pearson”)
Example: Partial correlation for sex vs nihss, control for the effect of location:
# Ensure variables not have missing value:
df2 = na.omit(df2)
# Calculate the correlation between science score and engagement while controlling the variable learning
## First way:
library(ggm)
pc = ggm::pcor(c("sex", "nihss", "location"), var(df2))
pc
## [1] 0.0225065
pc^2
## [1] 0.0005065427
ggm::pcor.test(pc, 1, 500)
## $tval
## [1] 0.5018758
##
## $df
## [1] 497
##
## $pvalue
## [1] 0.6159772
## Anothery way:
library(ppcor)
ppcor::pcor.test(df2$sex,df2$nihss,df2$location, method="pearson")
## estimate p.value statistic n gp Method
## 1 0.0225065 0.6159772 0.5018758 500 1 pearson
Discuss
The partial correlation for the variables sex vs nihss, control
for the effect of location:
=> Inclusion of location has slightly increase
the amount of variation in sex shared by
nihss.
=> sex alone does explain very small variation in
nihss.
=> No relationship between sex and nihss to
location.
Semi-partial Correlation: The semi-partial correlation coefficient is the correlation between all of Y and that part of X which is independent of Z. That is the effect of Z has been removed from X but not from Y.
Example: Partial correlation for sex vs nihss, control for the effect of location:
# Ensure variables not have missing value:
df2 = na.omit(df2)
# Calculate the correlation between science score and engagement while controlling the variable learning
library(ppcor)
spcor.test(df2$sex,df2$nihss,df2$location, method="pearson")
## estimate p.value statistic n gp Method
## 1 0.0225038 0.6160195 0.5018155 500 1 pearson
The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data vs rankings (ordinal).
Package: rcompanion.
Command: wilcoxonRG(x = ordinal variable, g = nominal variable,
verbose=TRUE).
Example: Rank Bisereal for sex vs aspects:
# Create a df3 dataframe has sex and aspects:
df3 = data %>% select (sex, aspects) %>% na.omit()
df3$sex = factor(df3$sex, levels = c("Female","Male"), labels = c(1, 2)) %>% as.numeric()
# Run the Rank Biserial rrb:
library(rcompanion)
wilcoxonRG(x = df3$aspects, g = df3$sex, verbose=TRUE, ci = T, R = 1000)
##
## Levels: 1 2
## n for 1 = 208
## n for 2 = 288
## Mean of ranks for 1 = 247.1851
## Mean of ranks for 2 = 249.4497
## Difference in mean of ranks = -2.264557
## Total n = 496
## 2 * difference / total n = -0.00913
## rg lower.ci upper.ci
## 1 -0.00913 -0.107 0.0977
https://www.rdocumentation.org/packages/Hmisc/versions/4.7-0/topics/rcorr
https://www.researchgate.net/post/Can_I_use_Pearsons_correlation_coefficient_to_know_the_relation_between_perception_and_gender_age_income#:~:text=For%20a%20dichotomous%20categorical%20variable,a%20point%2Dbiserial%20correlation%20coefficient.
https://bookdown.org/chua/ber642_advanced_regression/running-correlations-in-r.html#point-biserial-correlation-phi-correlation
https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv
http://faculty.cas.usf.edu/mbrannick/regression/Part3/Partials.html
https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm