Foreword: The Elementary Statistics for Medical Students (ESMS) project
This is the 3rd tutorial of our ESMS project. This project aims to help Vietnamese medical students in developing their skills in R statistical programming language. We provide weekly the problem-based tutorials, each one will show the student how to resolve a common study question using the appropriate methods.
As we have seen in 2 previous tutorials, our approach is different to what they are teaching in Medical school, as we focus on both R codes, graphical presentation and interpretation skills. We also introduce R, which is the best tool for developping both creativity and knowledge in students. By writing down a statistical procedure in R, the learner can understand a method in deeper level and take control of verything, down to the smallest detail.
Introduction
The Pearson’s product-moment correlation coefficient (r) measures strength and direction of the linear association between two continuous variables. Basically, the r coefficient is defined as covariance of two variables divided by the product of their standard deviations. In fact, this method is directly related to linear regression model, as we attempt to fit a simple linear model on two variables, then to measure how far the data points discard from this line, or to determine how well the data points fit our model.
Though correlation analysis is only a part of medical research and in most cases, not enough to establish a causative inference, this method still plays an important role. Pearson’s coefficient is the simplest way to explore the relationships among continous variables in a dataset and sometime this might lead to unpredictable discoveries and allow us to develop new hypothesis.
This tutorial will guide you through a standard procedure for evaluating the linear relationship. We also introduce a bootstraping method, as well as how to interpret and report the results.
Context
Our case study enrolled 34 patients with calcium oxalate crystal and 45 normal subjects. Their urine specimens were analyzed for evaluating physical characteristics such as specific gravity, pH, Osmolarity, Conductivity, Urea and Calcium concentrations (mmol/L). Assuming that the Osmolarity is the main outcome, we want to evaluate the relationship between this outcome and other urinary indices.
Data preparation
The original data were provided by Andrews, D.F. and Herzberg, A.M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer-Verlag. It could be downloaded from the http://vincentarelbundock.github.io website.
library(tidyverse)
df=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/boot/urine.csv")%>%as_tibble()
names(df)=c("Id","ClassCOC","Gravity","Ph","Osmolarity","Conductivity","Urea","Calcium")
df$ClassCOC%<>%as.factor()%>%recode_factor(.,`0` = "Negative", `1` = "Positive")
df[,c(3:8)]=df[,c(3:8)]%>%lapply(.,as.numeric)
df%>%head()%>%knitr::kable()
Id | ClassCOC | Gravity | Ph | Osmolarity | Conductivity | Urea | Calcium |
---|---|---|---|---|---|---|---|
1 | Negative | 1.021 | 4.91 | 725 | NA | 443 | 2.45 |
2 | Negative | 1.017 | 5.74 | 577 | 20.0 | 296 | 4.49 |
3 | Negative | 1.008 | 7.20 | 321 | 14.9 | 101 | 2.36 |
4 | Negative | 1.011 | 5.51 | 408 | 12.6 | 224 | 2.15 |
5 | Negative | 1.005 | 6.52 | 187 | 7.5 | 91 | 1.16 |
6 | Negative | 1.020 | 5.27 | 668 | 25.3 | 252 | 3.34 |
Step 1: Exploring data and Testing of Assumptions
Before analysing our data by Pearson’s method, it’s necessary to make sure that following assumptions are met:
Your two variables are continuous or at least measured in an interval. If not, might be you would better consider a nonparametric method, such as Spearman’s rho coefficient.
The outliers should be avoided, as the correlation is sensitive to outliers.
There should be no missing value among the observations (basically the Pearson’s method does not accept the missing values, but this problem could be easily handled by case-wise processing)
If you want to make statistical inference using null hypothesis test, your data should be normally distributed.
In fact, there is no strict rule that limits the use of Pearson’s r, as both variables do not need to be measured on the same scale. The most important rule is that two variables should be paired (missing values are not accepted).
Table 1: Descriptive analysis
lapply(df,Hmisc::describe)
## $Id
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 79 0 79 1 40 26.67 4.9 8.8
## .25 .50 .75 .90 .95
## 20.5 40.0 59.5 71.2 75.1
##
## lowest : 1 2 3 4 5, highest: 75 76 77 78 79
##
## $ClassCOC
## X[[i]]
## n missing distinct
## 79 0 2
##
## Value Negative Positive
## Frequency 45 34
## Proportion 0.57 0.43
##
## $Gravity
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 79 0 29 0.997 1.018 0.008223 1.008 1.008
## .25 .50 .75 .90 .95
## 1.012 1.018 1.023 1.026 1.029
##
## lowest : 1.005 1.006 1.007 1.008 1.009, highest: 1.029 1.031 1.033 1.034 1.040
##
## $Ph
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 79 0 70 1 6.028 0.808 5.072 5.234
## .25 .50 .75 .90 .95
## 5.530 5.940 6.385 6.922 7.403
##
## lowest : 4.76 4.81 4.90 4.91 5.09, highest: 7.38 7.61 7.90 7.92 7.94
##
## $Osmolarity
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 78 1 76 1 615 274.2 249.6 303.3
## .25 .50 .75 .90 .95
## 411.5 612.5 797.5 885.3 958.1
##
## lowest : 187 225 241 242 251, highest: 956 970 1032 1107 1236
##
## $Conductivity
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 78 1 63 1 20.9 9.164 8.355 10.600
## .25 .50 .75 .90 .95
## 14.375 21.400 26.775 29.960 33.630
##
## lowest : 5.1 7.5 8.1 8.4 8.8, highest: 33.6 33.8 35.8 35.9 38.0
##
## $Urea
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 79 0 73 1 266.4 150.5 85.8 100.2
## .25 .50 .75 .90 .95
## 160.0 260.0 372.0 432.6 474.3
##
## lowest : 10 64 72 75 87, highest: 473 486 516 550 620
##
## $Calcium
## X[[i]]
## n missing distinct Info Mean Gmd .05 .10
## 79 0 75 1 4.139 3.533 0.758 1.028
## .25 .50 .75 .90 .95
## 1.460 3.160 5.930 8.490 9.671
##
## lowest : 0.17 0.27 0.58 0.65 0.77, highest: 9.39 12.20 12.68 13.00 14.34
Check for linear relationship
First, we should verify whether a linear relationship might exist between two variables, because the Pearson’s correlation analysis is based on a linear model. A linear relationship could be visually verified using the scatter plot.
For example, if we want to evaluate the relationship between Osmolarity and Urea, the scatter plot could be built as follows:
df%>%ggplot(aes(x=Urea,y=Osmolarity))+geom_point()+theme_bw()
Though the Pearson’s correlation does not imply the terms “dependent” or “independent” variable, most of medical studies aim to evaluate a causative association or a key physio-pathological metric such as a new developed marker or an important clinical outcome. Such target variable could be considered as “dependent” and our analysis consists of matching this variable with other parameters.
Fitting a linear curve is not necessary to verify the linear relationship, as this is not a predictive study. We can simply inspect the distribution of data points and make our decision about a potential linear relationship between two variables. Just believe your visual perception as the human brain is much more sensitive than any algorithm. The scatter plot can also help you to identify the outliers that might alter our analysis.
The above example shows a evident linear relationship between two variables. Here is another scatterplot that suggests no linear relationship.
df%>%ggplot(aes(x=Ph,y=Calcium))+geom_point()+theme_bw()
Even if the data points are well distributed in a non-linear pattern, that will also violate our assumption.
When our linear relationship assumption is violated, we can either use alternative method likes Spearman’s rho coefficient, or transform our data using different functions, such as Box-Cox, Logarithmic.
Checking for Normal distribution
The normality of data could be tested by different methods, including Shapiro-Wilk, Kolmogorov Smirnov tests (widely used in SPSS). In this tutorial we will adopt d’Agostino test, for example on Osmolarity and Conductivity
df%>%select(.,Osmolarity,Conductivity)%>%map(~fBasics::dagoTest(.))
## $Osmolarity
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 3.0753
## Z3 | Skewness: 0.492
## Z4 | Kurtosis: -1.6832
## P VALUE:
## Omnibus Test: 0.2149
## Skewness Test: 0.6227
## Kurtosis Test: 0.09234
##
## Description:
## Tue Feb 28 23:47:35 2017 by user: Admin
##
##
## $Conductivity
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 4.687
## Z3 | Skewness: -0.2236
## Z4 | Kurtosis: -2.1534
## P VALUE:
## Omnibus Test: 0.09599
## Skewness Test: 0.8231
## Kurtosis Test: 0.03129
##
## Description:
## Tue Feb 28 23:47:35 2017 by user: Admin
The Omnibus test indicates that both variables are normally distributed. The assumption is violated when the p_value for any parameters is lower than 0.05
This could be interpreted as:
“Both variables were normally distributed, as assessed by D’Agostino omnibus test (p=0.09 and 0.21).”
A) Single pair correlation analysis
This section will show you how to carry out the simplest form of Pearson’s correlation analysis that involves only two variables, for example: between Osmolarity and Conductivity
cor.test(df$Osmolarity,df$Conductivity,method="pearson",alternative="two.sided",na.action)
##
## Pearson's product-moment correlation
##
## data: df$Osmolarity and df$Conductivity
## t = 12.67, df = 75, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7380665 0.8857619
## sample estimates:
## cor
## 0.8255694
A typical correlation output contains following information:
Value of Pearson correlation coefficient (r), we have r = 0.8255694
The significance level of the correlation coefficient: p-value < 2.2e-16
df= degree of freedom = Number of paired observation (this could be reduced if you have missing values) - 2, here we have df=75 (2 cases of missing values have been removed)
95 percent confidence interval for the r coefficient: lower bound = 0.7380665 and upper bound = 0.8857619
How to interpret this result
The Pearson’s r could range from -1 to +1.
A zero value indicates no association between the variable.
A positive value indicates a positive association (i.e when a variable increases, so does the other variable).
A negative value indicates a negative association (i.e a variable would decrease as the other increases and vice-versa).
The closer r to +1 or -1, the stronger association between them will be. Achieving an absolute value of +1 or -1 means that those 2 variables are perfectly proportional or the variance of one variable could be totally explained by the variance of the other.
The closer the r value to 0 the greater dispersion of data points from the linear model and weaker the association between our 2 variables will be.
An absolute value of r between 0.1 and 0.3 indicates a weak correlation, from 0.3 to 0.5 indicates a medium association and from 0.5 to 1 indicates a strong correlation.
In our example
As the sign of the Pearson correlation coefficient is positive, we can say that Osmolarity is positively correlated to Conductivity, or there is a positive correlation between them (i.e Conductivity would increase as Osmolarity increases). This could also be interpreted as: “higher values of Osmolarity are associated / related to higher values of Conductivity”, since the verbs “increase” or “decrease” might be not always correct (a biomarker cannot make itself increased, the disease did that).
As mentioned above, the size of r value determines how strong the relationship would be. By a simple rule, when the absolute value of r higher than 0.6, we could interpret its vale as “large” or “strong” correlation, and when its value is below 0.5 we can determine a “weak” or “small” relationship.
Our result could be interpreted as: “There was a strong positive correlation between Osmolarity and Conductivity (r(75)=0.825; p<0.0001)”
Graphical presentation
df%>%ggplot(aes(x=Osmolarity,y=Conductivity))+geom_jitter(shape=21,size=3,color="black",fill="grey50",alpha=0.7)+geom_smooth(method="lm",color="red4",fill="red2",alpha=0.3)+theme_bw()
The effect size of Pearson’s coefficient
You would be surprised by discovering that the effect size of the Pearson’s r coefficient is exactly the coefficient of determination from a linear regression model, or R squared.
The R2 determine how much variance in dependent variable could be explained by the variance in the independent variable.
This experiment will show you that the Pearson’s coefficient could be considered as the squareroot of R2 of a linear model:
fit=lm(data=df,Conductivity~Osmolarity)
fit%>%summary()
##
## Call:
## lm(formula = Conductivity ~ Osmolarity, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.797 -2.652 0.105 2.569 11.227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.973392 1.433314 2.772 0.00702 **
## Osmolarity 0.027594 0.002178 12.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.547 on 75 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6773
## F-statistic: 160.5 on 1 and 75 DF, p-value: < 2.2e-16
fit%>%summary()%>%.$r.squared%>%sqrt()
## [1] 0.8255694
the R2 is 0.677, so we could interpret the effect-size of r as: “Osmolarity could explain 67.7% of the variance in Conductivity”
Bootstraping a paired correlation
The following procedure allows to determine the 97.5%CI of Pearson’s correlation coefficient by bootstrap resampling. Such method is not mandatory but could be helpful if we have small sample size or outliers, or if you simply want to generalise your result
#bootstrap
corboot1=function(x,y,data,i){
d=data[i,]
xt=d[,x]
yt=d[,y]
coef=cor(xt,yt,method="pearson",use="pairwise.complete.obs")%>%.[1]
pval=psych::corr.test(xt,yt,use="pairwise")%>%.$p%>%.[1]
return=cbind(coef,pval)
}
set.seed(123)
library(boot)
res1=boot(statistic=corboot1,x="Osmolarity",y="Conductivity",data=df,R=1000)%>%.$t%>%as_tibble()
names(res1)=c("Coefficient","Pvalue")
res1$Iteration=c(1:nrow(res1))
p1=res1%>%ggplot(aes(x=Coefficient))+geom_histogram(color="black",fill="gold")+geom_vline(xintercept=0.5,color="red4",linetype="dashed",size=1)+theme_bw()
p2=res1%>%ggplot(aes(x=Iteration,y=Coefficient))+geom_path(color="purple",size=0.8,alpha=0.7)+geom_hline(yintercept=mean(res1$Coefficient),color="red4",linetype="dashed",size=1)+theme_bw()
library(gridExtra)
grid.arrange(p1,p2)
Hmisc::describe(res1[,1])
## res1[, 1]
##
## 1 Variables 1000 Observations
## ---------------------------------------------------------------------------
## Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.8269 0.04559 0.7580 0.7705
## .25 .50 .75 .90 .95
## 0.8008 0.8298 0.8550 0.8772 0.8894
##
## lowest : 0.6651462 0.6813705 0.6929213 0.7074255 0.7117356
## highest: 0.9119198 0.9128613 0.9132657 0.9244275 0.9272008
## ---------------------------------------------------------------------------
The bootstrap result indicate that the 95%Ci of Pearson’s r between Osmolarity and Conductivity is from 0.76 to 0.89 (min=0.67, max=0.93). This results are in good agreement with the 95%CI provided by the cor.test function. As the 95%CI does not contain zero, we could assure the significance of the relationship (this is an alternative way to verify the significance without using p_value).
B) Pair-wise correlation analysis
In fact, Pearson’s coefficient on a single pair is rarely considered as the key result (except when our study question focus on this specific relationship). In most situation, our analysis consists of exploring the relationship between a given target variable and ALL remaining variables in the dataset. To avoid copy-pasting the same function, we must consider creating a loop in R:
correlation=function(data,x,y){
n=ncol(data[,y])
cormtx=cbind(Target=rep(NA,n),Coefficient=rep(NA,n),Pvalue=rep(NA,n))%>%as.data.frame()
for (i in 1:n){
var=data[,x]
targ=data[,y]%>%.[,i]
coef=cor(var,targ,method="pearson",use="pairwise.complete.obs")%>%.[1]
pval=psych::corr.test(var,targ,use="pairwise")%>%.$p%>%.[1]
cormtx$Target[i]=colnames(targ)
cormtx$Coefficient[i]=coef
cormtx$Pvalue[i]=pval
}
return(cormtx)
}
correlation(data=df,x="Osmolarity",y=c("Ph","Calcium","Conductivity","Gravity","Urea"))
## Target Coefficient Pvalue
## 1 Ph -0.2373956 3.636796e-02
## 2 Calcium 0.5248635 8.097138e-07
## 3 Conductivity 0.8255694 0.000000e+00
## 4 Gravity 0.8710360 0.000000e+00
## 5 Urea 0.8893390 0.000000e+00
In our example, we want to perform a pair-wise correlation analysis between Osmolarity and Ph, Calcium, Conductvity, Gravity and Urea.
The function “correlation” above could be reused for your own study.
C) Correlation Matrix
A more robust method to explore the whole matrix of many numeric variables will be explained in this section:
First, we could use the GGally package for this purpose. This consists of an integrated approach that include Assumption check, Correlation coefficients and Graphical presentation:
plotfuncLow <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_point(shape=21,color="black",fill="grey50")+geom_smooth(method="lm",color="red4",fill="red2",alpha=0.3)+theme_bw()
p
}
plotfuncmid <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_density(alpha=0.5,color="black",fill="red")+theme_bw()
p
}
library(GGally)
ggpairs(df,columns=3:8,lower=list(continuous=plotfuncLow),diag=list(continuous=plotfuncmid))
Simplified correlation matrix
The package coorplot provide an alternative method for representing the correlation matrix:
cor.mtest <- function(mat, conf.level = 0.95){
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for(i in 1:(n-1)){
for(j in (i+1):n){
tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
cormat<-df[,c(3:8)]%>%cor.mtest(.,0.95)
library("corrplot")
library(viridis)
df[,c(3:8)]%>%cor(.,method="pearson",use="pairwise.complete.obs")%>%corrplot(.,p.mat=cormat[[1]],sig.level=0.05,type="lower",method="pie",tl.col="black", tl.srt=45,col=viridis::plasma(n=100,begin =0.9, end = 0.4))
Finally, the numerical result could be obtained using either cor function (basic R) and rcorr function (Hmisc package). Note: the later only accept input data as matrix.
df[,c(3:8)]%>%cor(.,method="pearson",use="pairwise.complete.obs")
## Gravity Ph Osmolarity Conductivity Urea
## Gravity 1.0000000 -0.2533402 0.8710360 0.5668070 0.8234770
## Ph -0.2533402 1.0000000 -0.2373956 -0.1172726 -0.2755569
## Osmolarity 0.8710360 -0.2373956 1.0000000 0.8255694 0.8893390
## Conductivity 0.5668070 -0.1172726 0.8255694 1.0000000 0.5189940
## Urea 0.8234770 -0.2755569 0.8893390 0.5189940 1.0000000
## Calcium 0.5256987 -0.1194878 0.5248635 0.3475249 0.5023267
## Calcium
## Gravity 0.5256987
## Ph -0.1194878
## Osmolarity 0.5248635
## Conductivity 0.3475249
## Urea 0.5023267
## Calcium 1.0000000
df[,c(3:8)]%>%as.matrix()%>%Hmisc::rcorr(.,type="pearson")
## Gravity Ph Osmolarity Conductivity Urea Calcium
## Gravity 1.00 -0.25 0.87 0.57 0.82 0.53
## Ph -0.25 1.00 -0.24 -0.12 -0.28 -0.12
## Osmolarity 0.87 -0.24 1.00 0.83 0.89 0.52
## Conductivity 0.57 -0.12 0.83 1.00 0.52 0.35
## Urea 0.82 -0.28 0.89 0.52 1.00 0.50
## Calcium 0.53 -0.12 0.52 0.35 0.50 1.00
##
## n
## Gravity Ph Osmolarity Conductivity Urea Calcium
## Gravity 79 79 78 78 79 79
## Ph 79 79 78 78 79 79
## Osmolarity 78 78 78 77 78 78
## Conductivity 78 78 77 78 78 78
## Urea 79 79 78 78 79 79
## Calcium 79 79 78 78 79 79
##
## P
## Gravity Ph Osmolarity Conductivity Urea Calcium
## Gravity 0.0243 0.0000 0.0000 0.0000 0.0000
## Ph 0.0243 0.0364 0.3065 0.0140 0.2942
## Osmolarity 0.0000 0.0364 0.0000 0.0000 0.0000
## Conductivity 0.0000 0.3065 0.0000 0.0000 0.0018
## Urea 0.0000 0.0140 0.0000 0.0000 0.0000
## Calcium 0.0000 0.2942 0.0000 0.0018 0.0000
As our study question focus on Osmolarity as the target variable, we can read the correlation matrix as follows:
Except for Ph that is negatively correlated to Osmolarity (r=-0.24), all other variables represented a positive and strong correlation with our target variable (r=0.87, 0.83 and 0.89 for Gravity, Conductivity and Urea, respectively). These correlation were all statistically significative.
Bootstraping a correlation matrix
Bootstraping a correlation matrix is a little bit more complicated than that for a single pair. We can do this by a 3 steps procedure: First, we introduce a corboot function that implies directly a cor function. Then we apply this corboot function on our target matrix. Finally, we use a loop for attributing the names of pairs and explore the final dataframe.
mat=df[,c(3:8)]
corboot=function(data,i){
cor(data[i,])
}
set.seed(123)
library(boot)
res=boot(statistic=corboot,data=mat,R=1000)%>%.$t%>%as_tibble()
list=c("V1","V2","V3","V4","V5","V6")
ldf=data.frame(NULL)
n=0
for (i in 1:6){
var=colnames(mat)[i]
b=n+1
e=n+ncol(mat)
dt=res[,c(b:e)]
names(dt)=colnames(mat)
dt$Variable=var
ldf=rbind(ldf,dt)
n=n+ncol(mat)
}
ldf=ldf%>%gather(Gravity:Calcium,key="Target",value="Coefficient")
ldf%>%ggplot(aes(x=Coefficient,fill=Variable))+geom_histogram(show.legend = F,color="black")+facet_grid(Variable~Target,scales="free")+geom_vline(xintercept=0,color="blue",linetype="dashed")+scale_x_continuous(limits = c(-1,1))
ldf2=ldf%>%unite(Pair,Variable,Target)
ldf2%>%split(.$Pair)%>%map(~Hmisc::describe(.$Coefficient))
## $Calcium_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
##
## $Calcium_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.3467 0.09782 0.1974 0.2244
## .25 .50 .75 .90 .95
## 0.2901 0.3529 0.4074 0.4549 0.4771
##
## lowest : 0.08957106 0.13694367 0.14097150 0.14522399 0.14978924
## highest: 0.53203186 0.53555428 0.54089242 0.54399090 0.56215332
##
## $Calcium_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.5239 0.1039 0.3676 0.3980
## .25 .50 .75 .90 .95
## 0.4644 0.5286 0.5900 0.6389 0.6640
##
## lowest : 0.1636405 0.1868441 0.1890655 0.2180693 0.2704973
## highest: 0.7278133 0.7395206 0.7419029 0.7452068 0.7814306
##
## $Calcium_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.5144 0.09846 0.3669 0.3959
## .25 .50 .75 .90 .95
## 0.4600 0.5162 0.5793 0.6205 0.6420
##
## lowest : 0.2243556 0.2270902 0.2784960 0.2892525 0.2978975
## highest: 0.6851170 0.6887083 0.7000109 0.7123718 0.7237371
##
## $Calcium_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.1219 0.1435 -0.32955 -0.28901
## .25 .50 .75 .90 .95
## -0.20982 -0.12038 -0.03506 0.03550 0.08541
##
## lowest : -0.5033614 -0.4894865 -0.4434884 -0.4292487 -0.4167988
## highest: 0.2024711 0.2079475 0.2247542 0.2249787 0.2340928
##
## $Calcium_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.5012 0.1069 0.3419 0.3795
## .25 .50 .75 .90 .95
## 0.4377 0.5034 0.5659 0.6197 0.6484
##
## lowest : 0.1861203 0.1962297 0.1984450 0.2221892 0.2375499
## highest: 0.7297510 0.7304352 0.7505378 0.7672036 0.7854312
##
## $Conductivity_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.3467 0.09782 0.1974 0.2244
## .25 .50 .75 .90 .95
## 0.2901 0.3529 0.4074 0.4549 0.4771
##
## lowest : 0.08957106 0.13694367 0.14097150 0.14522399 0.14978924
## highest: 0.53203186 0.53555428 0.54089242 0.54399090 0.56215332
##
## $Conductivity_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
##
## $Conductivity_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.5708 0.1092 0.3932 0.4380
## .25 .50 .75 .90 .95
## 0.5123 0.5791 0.6318 0.6927 0.7274
##
## lowest : 0.2662446 0.2876707 0.3071494 0.3202558 0.3269361
## highest: 0.7661535 0.7668432 0.7709550 0.7886193 0.8145255
##
## $Conductivity_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 142 858 142 1 0.8239 0.0452 0.7632 0.7706
## .25 .50 .75 .90 .95
## 0.7976 0.8247 0.8541 0.8699 0.8768
##
## lowest : 0.6813705 0.7294604 0.7325744 0.7355072 0.7466821
## highest: 0.8963830 0.8967896 0.8974084 0.9115176 0.9118883
##
## $Conductivity_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd .05
## 377 623 377 1 -0.1216 0.1086 -0.273668
## .10 .25 .50 .75 .90 .95
## -0.241532 -0.194595 -0.120748 -0.054709 -0.007805 0.034417
##
## lowest : -0.39273545 -0.36641813 -0.36575779 -0.36340982 -0.35213337
## highest: 0.08314969 0.12017868 0.16424574 0.17210259 0.19389733
##
## $Conductivity_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.5218 0.09333 0.3742 0.4124
## .25 .50 .75 .90 .95
## 0.4694 0.5290 0.5843 0.6209 0.6470
##
## lowest : 0.2489203 0.2745463 0.2899516 0.3124759 0.3384205
## highest: 0.6935489 0.6936601 0.7037591 0.7075634 0.7443859
##
## $Gravity_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.5239 0.1039 0.3676 0.3980
## .25 .50 .75 .90 .95
## 0.4644 0.5286 0.5900 0.6389 0.6640
##
## lowest : 0.1636405 0.1868441 0.1890655 0.2180693 0.2704973
## highest: 0.7278133 0.7395206 0.7419029 0.7452068 0.7814306
##
## $Gravity_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.5708 0.1092 0.3932 0.4380
## .25 .50 .75 .90 .95
## 0.5123 0.5791 0.6318 0.6927 0.7274
##
## lowest : 0.2662446 0.2876707 0.3071494 0.3202558 0.3269361
## highest: 0.7661535 0.7668432 0.7709550 0.7886193 0.8145255
##
## $Gravity_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
##
## $Gravity_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.8737 0.05644 0.7863 0.8023
## .25 .50 .75 .90 .95
## 0.8444 0.8793 0.9113 0.9336 0.9459
##
## lowest : 0.7094249 0.7232943 0.7274568 0.7316258 0.7404852
## highest: 0.9598781 0.9606117 0.9611458 0.9641672 0.9672925
##
## $Gravity_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.2568 0.1103 -0.4201 -0.3794
## .25 .50 .75 .90 .95
## -0.3217 -0.2576 -0.1869 -0.1321 -0.1023
##
## lowest : -0.630683327 -0.549624205 -0.548206978 -0.523962612 -0.521183392
## highest: -0.026580885 -0.023248859 -0.020900122 -0.006203768 0.016260853
##
## $Gravity_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.8255 0.07757 0.6966 0.7281
## .25 .50 .75 .90 .95
## 0.7846 0.8331 0.8823 0.9052 0.9150
##
## lowest : 0.5780316 0.5814722 0.5847227 0.5991420 0.6058019
## highest: 0.9391618 0.9395732 0.9396153 0.9464864 0.9555937
##
## $Osmolarity_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.5144 0.09846 0.3669 0.3959
## .25 .50 .75 .90 .95
## 0.4600 0.5162 0.5793 0.6205 0.6420
##
## lowest : 0.2243556 0.2270902 0.2784960 0.2892525 0.2978975
## highest: 0.6851170 0.6887083 0.7000109 0.7123718 0.7237371
##
## $Osmolarity_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 142 858 142 1 0.8239 0.0452 0.7632 0.7706
## .25 .50 .75 .90 .95
## 0.7976 0.8247 0.8541 0.8699 0.8768
##
## lowest : 0.6813705 0.7294604 0.7325744 0.7355072 0.7466821
## highest: 0.8963830 0.8967896 0.8974084 0.9115176 0.9118883
##
## $Osmolarity_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.8737 0.05644 0.7863 0.8023
## .25 .50 .75 .90 .95
## 0.8444 0.8793 0.9113 0.9336 0.9459
##
## lowest : 0.7094249 0.7232943 0.7274568 0.7316258 0.7404852
## highest: 0.9598781 0.9606117 0.9611458 0.9641672 0.9672925
##
## $Osmolarity_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
##
## $Osmolarity_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 -0.2418 0.1061 -0.39200 -0.36302
## .25 .50 .75 .90 .95
## -0.30120 -0.24289 -0.17769 -0.12257 -0.08785
##
## lowest : -0.485944879 -0.485328177 -0.469680809 -0.466464992 -0.453425383
## highest: -0.019575968 -0.019513772 0.001365378 0.015483628 0.102608070
##
## $Osmolarity_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.8885 0.02461 0.8515 0.8611
## .25 .50 .75 .90 .95
## 0.8749 0.8889 0.9034 0.9155 0.9222
##
## lowest : 0.8227201 0.8247502 0.8305188 0.8329580 0.8339577
## highest: 0.9374534 0.9376413 0.9386876 0.9417923 0.9501728
##
## $Ph_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.1219 0.1435 -0.32955 -0.28901
## .25 .50 .75 .90 .95
## -0.20982 -0.12038 -0.03506 0.03550 0.08541
##
## lowest : -0.5033614 -0.4894865 -0.4434884 -0.4292487 -0.4167988
## highest: 0.2024711 0.2079475 0.2247542 0.2249787 0.2340928
##
## $Ph_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd .05
## 377 623 377 1 -0.1216 0.1086 -0.273668
## .10 .25 .50 .75 .90 .95
## -0.241532 -0.194595 -0.120748 -0.054709 -0.007805 0.034417
##
## lowest : -0.39273545 -0.36641813 -0.36575779 -0.36340982 -0.35213337
## highest: 0.08314969 0.12017868 0.16424574 0.17210259 0.19389733
##
## $Ph_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.2568 0.1103 -0.4201 -0.3794
## .25 .50 .75 .90 .95
## -0.3217 -0.2576 -0.1869 -0.1321 -0.1023
##
## lowest : -0.630683327 -0.549624205 -0.548206978 -0.523962612 -0.521183392
## highest: -0.026580885 -0.023248859 -0.020900122 -0.006203768 0.016260853
##
## $Ph_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 -0.2418 0.1061 -0.39200 -0.36302
## .25 .50 .75 .90 .95
## -0.30120 -0.24289 -0.17769 -0.12257 -0.08785
##
## lowest : -0.485944879 -0.485328177 -0.469680809 -0.466464992 -0.453425383
## highest: -0.019575968 -0.019513772 0.001365378 0.015483628 0.102608070
##
## $Ph_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
##
## $Ph_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.2749 0.1142 -0.43821 -0.40861
## .25 .50 .75 .90 .95
## -0.34662 -0.27352 -0.20821 -0.14382 -0.09705
##
## lowest : -0.5663802067 -0.5596691597 -0.5586663955 -0.5230563505 -0.5229045050
## highest: -0.0107582658 0.0007919951 0.0069800099 0.0352025917 0.0427327604
##
## $Urea_Calcium
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.5012 0.1069 0.3419 0.3795
## .25 .50 .75 .90 .95
## 0.4377 0.5034 0.5659 0.6197 0.6484
##
## lowest : 0.1861203 0.1962297 0.1984450 0.2221892 0.2375499
## highest: 0.7297510 0.7304352 0.7505378 0.7672036 0.7854312
##
## $Urea_Conductivity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 377 623 377 1 0.5218 0.09333 0.3742 0.4124
## .25 .50 .75 .90 .95
## 0.4694 0.5290 0.5843 0.6209 0.6470
##
## lowest : 0.2489203 0.2745463 0.2899516 0.3124759 0.3384205
## highest: 0.6935489 0.6936601 0.7037591 0.7075634 0.7443859
##
## $Urea_Gravity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 0.8255 0.07757 0.6966 0.7281
## .25 .50 .75 .90 .95
## 0.7846 0.8331 0.8823 0.9052 0.9150
##
## lowest : 0.5780316 0.5814722 0.5847227 0.5991420 0.6058019
## highest: 0.9391618 0.9395732 0.9396153 0.9464864 0.9555937
##
## $Urea_Osmolarity
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 382 618 382 1 0.8885 0.02461 0.8515 0.8611
## .25 .50 .75 .90 .95
## 0.8749 0.8889 0.9034 0.9155 0.9222
##
## lowest : 0.8227201 0.8247502 0.8305188 0.8329580 0.8339577
## highest: 0.9374534 0.9376413 0.9386876 0.9417923 0.9501728
##
## $Urea_Ph
## .$Coefficient
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 -0.2749 0.1142 -0.43821 -0.40861
## .25 .50 .75 .90 .95
## -0.34662 -0.27352 -0.20821 -0.14382 -0.09705
##
## lowest : -0.5663802067 -0.5596691597 -0.5586663955 -0.5230563505 -0.5229045050
## highest: -0.0107582658 0.0007919951 0.0069800099 0.0352025917 0.0427327604
##
## $Urea_Urea
## .$Coefficient
## n missing distinct Info Mean Gmd
## 1000 0 1 0 1 0
##
## Value 1
## Frequency 1000
## Proportion 1
Reporting the results
Putting everything together, we could report the results of our study as follows:
“The association between Osmolarity and other variables were evaluated by Pearson’s r coefficient.
Except for urinary pH that presents a weak and negative correlation with our target variable (r(76)=-0.24, p=0.03), strong, positive and significative correlations were found between Osmolarity and other variables. For instance, the variance in Urea, Gravity and Conductivity might explain 79.2%, 75.7% and 68.9% variance in Osmolarity. The Calcium concentration show a medium but significative correlation with Osmolarity.
Exercise
Pick up another target variable from dataset and perform pair-wised correlation on this one
Reproduce the same analysis on your own dataset
Thank you