Case study 3: Pearson’s product-moment correlation coefficient

Foreword: The Elementary Statistics for Medical Students (ESMS) project

This is the 3rd tutorial of our ESMS project. This project aims to help Vietnamese medical students in developing their skills in R statistical programming language. We provide weekly the problem-based tutorials, each one will show the student how to resolve a common study question using the appropriate methods.

As we have seen in 2 previous tutorials, our approach is different to what they are teaching in Medical school, as we focus on both R codes, graphical presentation and interpretation skills. We also introduce R, which is the best tool for developping both creativity and knowledge in students. By writing down a statistical procedure in R, the learner can understand a method in deeper level and take control of verything, down to the smallest detail.

Introduction

The Pearson’s product-moment correlation coefficient (r) measures strength and direction of the linear association between two continuous variables. Basically, the r coefficient is defined as covariance of two variables divided by the product of their standard deviations. In fact, this method is directly related to linear regression model, as we attempt to fit a simple linear model on two variables, then to measure how far the data points discard from this line, or to determine how well the data points fit our model.

Though correlation analysis is only a part of medical research and in most cases, not enough to establish a causative inference, this method still plays an important role. Pearson’s coefficient is the simplest way to explore the relationships among continous variables in a dataset and sometime this might lead to unpredictable discoveries and allow us to develop new hypothesis.

This tutorial will guide you through a standard procedure for evaluating the linear relationship. We also introduce a bootstraping method, as well as how to interpret and report the results.

Context

Our case study enrolled 34 patients with calcium oxalate crystal and 45 normal subjects. Their urine specimens were analyzed for evaluating physical characteristics such as specific gravity, pH, Osmolarity, Conductivity, Urea and Calcium concentrations (mmol/L). Assuming that the Osmolarity is the main outcome, we want to evaluate the relationship between this outcome and other urinary indices.

Data preparation

The original data were provided by Andrews, D.F. and Herzberg, A.M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer-Verlag. It could be downloaded from the http://vincentarelbundock.github.io website.

library(tidyverse)

df=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/boot/urine.csv")%>%as_tibble()

names(df)=c("Id","ClassCOC","Gravity","Ph","Osmolarity","Conductivity","Urea","Calcium")
df$ClassCOC%<>%as.factor()%>%recode_factor(.,`0` = "Negative", `1` = "Positive")

df[,c(3:8)]=df[,c(3:8)]%>%lapply(.,as.numeric)

df%>%head()%>%knitr::kable()

Id	ClassCOC	Gravity	Ph	Osmolarity	Conductivity	Urea	Calcium
1	Negative	1.021	4.91	725	NA	443	2.45
2	Negative	1.017	5.74	577	20.0	296	4.49
3	Negative	1.008	7.20	321	14.9	101	2.36
4	Negative	1.011	5.51	408	12.6	224	2.15
5	Negative	1.005	6.52	187	7.5	91	1.16
6	Negative	1.020	5.27	668	25.3	252	3.34

Step 1: Exploring data and Testing of Assumptions

Before analysing our data by Pearson’s method, it’s necessary to make sure that following assumptions are met:

Your two variables are continuous or at least measured in an interval. If not, might be you would better consider a nonparametric method, such as Spearman’s rho coefficient.
The outliers should be avoided, as the correlation is sensitive to outliers.
There should be no missing value among the observations (basically the Pearson’s method does not accept the missing values, but this problem could be easily handled by case-wise processing)
If you want to make statistical inference using null hypothesis test, your data should be normally distributed.

In fact, there is no strict rule that limits the use of Pearson’s r, as both variables do not need to be measured on the same scale. The most important rule is that two variables should be paired (missing values are not accepted).

Table 1: Descriptive analysis

lapply(df,Hmisc::describe)

## $Id
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       79        0       79        1       40    26.67      4.9      8.8 
##      .25      .50      .75      .90      .95 
##     20.5     40.0     59.5     71.2     75.1 
## 
## lowest :  1  2  3  4  5, highest: 75 76 77 78 79
## 
## $ClassCOC
## X[[i]] 
##        n  missing distinct 
##       79        0        2 
##                             
## Value      Negative Positive
## Frequency        45       34
## Proportion     0.57     0.43
## 
## $Gravity
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       79        0       29    0.997    1.018 0.008223    1.008    1.008 
##      .25      .50      .75      .90      .95 
##    1.012    1.018    1.023    1.026    1.029 
## 
## lowest : 1.005 1.006 1.007 1.008 1.009, highest: 1.029 1.031 1.033 1.034 1.040
## 
## $Ph
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       79        0       70        1    6.028    0.808    5.072    5.234 
##      .25      .50      .75      .90      .95 
##    5.530    5.940    6.385    6.922    7.403 
## 
## lowest : 4.76 4.81 4.90 4.91 5.09, highest: 7.38 7.61 7.90 7.92 7.94
## 
## $Osmolarity
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       78        1       76        1      615    274.2    249.6    303.3 
##      .25      .50      .75      .90      .95 
##    411.5    612.5    797.5    885.3    958.1 
## 
## lowest :  187  225  241  242  251, highest:  956  970 1032 1107 1236
## 
## $Conductivity
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       78        1       63        1     20.9    9.164    8.355   10.600 
##      .25      .50      .75      .90      .95 
##   14.375   21.400   26.775   29.960   33.630 
## 
## lowest :  5.1  7.5  8.1  8.4  8.8, highest: 33.6 33.8 35.8 35.9 38.0
## 
## $Urea
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       79        0       73        1    266.4    150.5     85.8    100.2 
##      .25      .50      .75      .90      .95 
##    160.0    260.0    372.0    432.6    474.3 
## 
## lowest :  10  64  72  75  87, highest: 473 486 516 550 620
## 
## $Calcium
## X[[i]] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       79        0       75        1    4.139    3.533    0.758    1.028 
##      .25      .50      .75      .90      .95 
##    1.460    3.160    5.930    8.490    9.671 
## 
## lowest :  0.17  0.27  0.58  0.65  0.77, highest:  9.39 12.20 12.68 13.00 14.34

Check for linear relationship

First, we should verify whether a linear relationship might exist between two variables, because the Pearson’s correlation analysis is based on a linear model. A linear relationship could be visually verified using the scatter plot.

For example, if we want to evaluate the relationship between Osmolarity and Urea, the scatter plot could be built as follows:

df%>%ggplot(aes(x=Urea,y=Osmolarity))+geom_point()+theme_bw()

Though the Pearson’s correlation does not imply the terms “dependent” or “independent” variable, most of medical studies aim to evaluate a causative association or a key physio-pathological metric such as a new developed marker or an important clinical outcome. Such target variable could be considered as “dependent” and our analysis consists of matching this variable with other parameters.

Fitting a linear curve is not necessary to verify the linear relationship, as this is not a predictive study. We can simply inspect the distribution of data points and make our decision about a potential linear relationship between two variables. Just believe your visual perception as the human brain is much more sensitive than any algorithm. The scatter plot can also help you to identify the outliers that might alter our analysis.

The above example shows a evident linear relationship between two variables. Here is another scatterplot that suggests no linear relationship.

df%>%ggplot(aes(x=Ph,y=Calcium))+geom_point()+theme_bw()

Even if the data points are well distributed in a non-linear pattern, that will also violate our assumption.

When our linear relationship assumption is violated, we can either use alternative method likes Spearman’s rho coefficient, or transform our data using different functions, such as Box-Cox, Logarithmic.

Checking for Normal distribution

The normality of data could be tested by different methods, including Shapiro-Wilk, Kolmogorov Smirnov tests (widely used in SPSS). In this tutorial we will adopt d’Agostino test, for example on Osmolarity and Conductivity

df%>%select(.,Osmolarity,Conductivity)%>%map(~fBasics::dagoTest(.))

## $Osmolarity
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 3.0753
##     Z3  | Skewness: 0.492
##     Z4  | Kurtosis: -1.6832
##   P VALUE:
##     Omnibus  Test: 0.2149 
##     Skewness Test: 0.6227 
##     Kurtosis Test: 0.09234 
## 
## Description:
##  Tue Feb 28 23:47:35 2017 by user: Admin
## 
## 
## $Conductivity
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 4.687
##     Z3  | Skewness: -0.2236
##     Z4  | Kurtosis: -2.1534
##   P VALUE:
##     Omnibus  Test: 0.09599 
##     Skewness Test: 0.8231 
##     Kurtosis Test: 0.03129 
## 
## Description:
##  Tue Feb 28 23:47:35 2017 by user: Admin

The Omnibus test indicates that both variables are normally distributed. The assumption is violated when the p_value for any parameters is lower than 0.05

This could be interpreted as:

“Both variables were normally distributed, as assessed by D’Agostino omnibus test (p=0.09 and 0.21).”

A) Single pair correlation analysis

This section will show you how to carry out the simplest form of Pearson’s correlation analysis that involves only two variables, for example: between Osmolarity and Conductivity

cor.test(df$Osmolarity,df$Conductivity,method="pearson",alternative="two.sided",na.action)

## 
##  Pearson's product-moment correlation
## 
## data:  df$Osmolarity and df$Conductivity
## t = 12.67, df = 75, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7380665 0.8857619
## sample estimates:
##       cor 
## 0.8255694

A typical correlation output contains following information:

Value of Pearson correlation coefficient (r), we have r = 0.8255694

The significance level of the correlation coefficient: p-value < 2.2e-16

df= degree of freedom = Number of paired observation (this could be reduced if you have missing values) - 2, here we have df=75 (2 cases of missing values have been removed)

95 percent confidence interval for the r coefficient: lower bound = 0.7380665 and upper bound = 0.8857619

How to interpret this result

The Pearson’s r could range from -1 to +1.

A zero value indicates no association between the variable.

A positive value indicates a positive association (i.e when a variable increases, so does the other variable).

A negative value indicates a negative association (i.e a variable would decrease as the other increases and vice-versa).

The closer r to +1 or -1, the stronger association between them will be. Achieving an absolute value of +1 or -1 means that those 2 variables are perfectly proportional or the variance of one variable could be totally explained by the variance of the other.

The closer the r value to 0 the greater dispersion of data points from the linear model and weaker the association between our 2 variables will be.

An absolute value of r between 0.1 and 0.3 indicates a weak correlation, from 0.3 to 0.5 indicates a medium association and from 0.5 to 1 indicates a strong correlation.

In our example

As the sign of the Pearson correlation coefficient is positive, we can say that Osmolarity is positively correlated to Conductivity, or there is a positive correlation between them (i.e Conductivity would increase as Osmolarity increases). This could also be interpreted as: “higher values of Osmolarity are associated / related to higher values of Conductivity”, since the verbs “increase” or “decrease” might be not always correct (a biomarker cannot make itself increased, the disease did that).

As mentioned above, the size of r value determines how strong the relationship would be. By a simple rule, when the absolute value of r higher than 0.6, we could interpret its vale as “large” or “strong” correlation, and when its value is below 0.5 we can determine a “weak” or “small” relationship.

Our result could be interpreted as: “There was a strong positive correlation between Osmolarity and Conductivity (r(75)=0.825; p<0.0001)”

Graphical presentation

df%>%ggplot(aes(x=Osmolarity,y=Conductivity))+geom_jitter(shape=21,size=3,color="black",fill="grey50",alpha=0.7)+geom_smooth(method="lm",color="red4",fill="red2",alpha=0.3)+theme_bw()

The effect size of Pearson’s coefficient

You would be surprised by discovering that the effect size of the Pearson’s r coefficient is exactly the coefficient of determination from a linear regression model, or R squared.

The R2 determine how much variance in dependent variable could be explained by the variance in the independent variable.

This experiment will show you that the Pearson’s coefficient could be considered as the squareroot of R2 of a linear model:

fit=lm(data=df,Conductivity~Osmolarity)

fit%>%summary()

## 
## Call:
## lm(formula = Conductivity ~ Osmolarity, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.797  -2.652   0.105   2.569  11.227 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.973392   1.433314   2.772  0.00702 ** 
## Osmolarity  0.027594   0.002178  12.670  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.547 on 75 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6773 
## F-statistic: 160.5 on 1 and 75 DF,  p-value: < 2.2e-16

fit%>%summary()%>%.$r.squared%>%sqrt()

## [1] 0.8255694

the R2 is 0.677, so we could interpret the effect-size of r as: “Osmolarity could explain 67.7% of the variance in Conductivity”

Bootstraping a paired correlation

The following procedure allows to determine the 97.5%CI of Pearson’s correlation coefficient by bootstrap resampling. Such method is not mandatory but could be helpful if we have small sample size or outliers, or if you simply want to generalise your result

#bootstrap 

corboot1=function(x,y,data,i){
  d=data[i,]
  xt=d[,x]
  yt=d[,y]
  coef=cor(xt,yt,method="pearson",use="pairwise.complete.obs")%>%.[1]
  pval=psych::corr.test(xt,yt,use="pairwise")%>%.$p%>%.[1]
  return=cbind(coef,pval)
}

set.seed(123)
library(boot)
res1=boot(statistic=corboot1,x="Osmolarity",y="Conductivity",data=df,R=1000)%>%.$t%>%as_tibble()

names(res1)=c("Coefficient","Pvalue")
res1$Iteration=c(1:nrow(res1))

p1=res1%>%ggplot(aes(x=Coefficient))+geom_histogram(color="black",fill="gold")+geom_vline(xintercept=0.5,color="red4",linetype="dashed",size=1)+theme_bw()
p2=res1%>%ggplot(aes(x=Iteration,y=Coefficient))+geom_path(color="purple",size=0.8,alpha=0.7)+geom_hline(yintercept=mean(res1$Coefficient),color="red4",linetype="dashed",size=1)+theme_bw()

library(gridExtra)

grid.arrange(p1,p2)

Hmisc::describe(res1[,1])

## res1[, 1] 
## 
##  1  Variables      1000  Observations
## ---------------------------------------------------------------------------
## Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.8269  0.04559   0.7580   0.7705 
##      .25      .50      .75      .90      .95 
##   0.8008   0.8298   0.8550   0.8772   0.8894 
## 
## lowest : 0.6651462 0.6813705 0.6929213 0.7074255 0.7117356
## highest: 0.9119198 0.9128613 0.9132657 0.9244275 0.9272008
## ---------------------------------------------------------------------------

The bootstrap result indicate that the 95%Ci of Pearson’s r between Osmolarity and Conductivity is from 0.76 to 0.89 (min=0.67, max=0.93). This results are in good agreement with the 95%CI provided by the cor.test function. As the 95%CI does not contain zero, we could assure the significance of the relationship (this is an alternative way to verify the significance without using p_value).

B) Pair-wise correlation analysis

In fact, Pearson’s coefficient on a single pair is rarely considered as the key result (except when our study question focus on this specific relationship). In most situation, our analysis consists of exploring the relationship between a given target variable and ALL remaining variables in the dataset. To avoid copy-pasting the same function, we must consider creating a loop in R:

correlation=function(data,x,y){
n=ncol(data[,y])
cormtx=cbind(Target=rep(NA,n),Coefficient=rep(NA,n),Pvalue=rep(NA,n))%>%as.data.frame()
for (i in 1:n){
  var=data[,x]
  targ=data[,y]%>%.[,i]
  coef=cor(var,targ,method="pearson",use="pairwise.complete.obs")%>%.[1]
  pval=psych::corr.test(var,targ,use="pairwise")%>%.$p%>%.[1]
  cormtx$Target[i]=colnames(targ)
  cormtx$Coefficient[i]=coef
  cormtx$Pvalue[i]=pval
}
return(cormtx)
}

correlation(data=df,x="Osmolarity",y=c("Ph","Calcium","Conductivity","Gravity","Urea"))

##         Target Coefficient       Pvalue
## 1           Ph  -0.2373956 3.636796e-02
## 2      Calcium   0.5248635 8.097138e-07
## 3 Conductivity   0.8255694 0.000000e+00
## 4      Gravity   0.8710360 0.000000e+00
## 5         Urea   0.8893390 0.000000e+00

In our example, we want to perform a pair-wise correlation analysis between Osmolarity and Ph, Calcium, Conductvity, Gravity and Urea.

The function “correlation” above could be reused for your own study.

C) Correlation Matrix

A more robust method to explore the whole matrix of many numeric variables will be explained in this section:

First, we could use the GGally package for this purpose. This consists of an integrated approach that include Assumption check, Correlation coefficients and Graphical presentation:

plotfuncLow <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_point(shape=21,color="black",fill="grey50")+geom_smooth(method="lm",color="red4",fill="red2",alpha=0.3)+theme_bw()
  p
}

plotfuncmid <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_density(alpha=0.5,color="black",fill="red")+theme_bw()
  p
}

library(GGally)

ggpairs(df,columns=3:8,lower=list(continuous=plotfuncLow),diag=list(continuous=plotfuncmid))

Simplified correlation matrix

The package coorplot provide an alternative method for representing the correlation matrix:

cor.mtest <- function(mat, conf.level = 0.95){
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  for(i in 1:(n-1)){
    for(j in (i+1):n){
      tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
      p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
      lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
      uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
    }
  }
  return(list(p.mat, lowCI.mat, uppCI.mat))
}

cormat<-df[,c(3:8)]%>%cor.mtest(.,0.95)

library("corrplot")
library(viridis)

df[,c(3:8)]%>%cor(.,method="pearson",use="pairwise.complete.obs")%>%corrplot(.,p.mat=cormat[[1]],sig.level=0.05,type="lower",method="pie",tl.col="black", tl.srt=45,col=viridis::plasma(n=100,begin =0.9, end = 0.4))

Finally, the numerical result could be obtained using either cor function (basic R) and rcorr function (Hmisc package). Note: the later only accept input data as matrix.

df[,c(3:8)]%>%cor(.,method="pearson",use="pairwise.complete.obs")

##                 Gravity         Ph Osmolarity Conductivity       Urea
## Gravity       1.0000000 -0.2533402  0.8710360    0.5668070  0.8234770
## Ph           -0.2533402  1.0000000 -0.2373956   -0.1172726 -0.2755569
## Osmolarity    0.8710360 -0.2373956  1.0000000    0.8255694  0.8893390
## Conductivity  0.5668070 -0.1172726  0.8255694    1.0000000  0.5189940
## Urea          0.8234770 -0.2755569  0.8893390    0.5189940  1.0000000
## Calcium       0.5256987 -0.1194878  0.5248635    0.3475249  0.5023267
##                 Calcium
## Gravity       0.5256987
## Ph           -0.1194878
## Osmolarity    0.5248635
## Conductivity  0.3475249
## Urea          0.5023267
## Calcium       1.0000000

df[,c(3:8)]%>%as.matrix()%>%Hmisc::rcorr(.,type="pearson")

##              Gravity    Ph Osmolarity Conductivity  Urea Calcium
## Gravity         1.00 -0.25       0.87         0.57  0.82    0.53
## Ph             -0.25  1.00      -0.24        -0.12 -0.28   -0.12
## Osmolarity      0.87 -0.24       1.00         0.83  0.89    0.52
## Conductivity    0.57 -0.12       0.83         1.00  0.52    0.35
## Urea            0.82 -0.28       0.89         0.52  1.00    0.50
## Calcium         0.53 -0.12       0.52         0.35  0.50    1.00
## 
## n
##              Gravity Ph Osmolarity Conductivity Urea Calcium
## Gravity           79 79         78           78   79      79
## Ph                79 79         78           78   79      79
## Osmolarity        78 78         78           77   78      78
## Conductivity      78 78         77           78   78      78
## Urea              79 79         78           78   79      79
## Calcium           79 79         78           78   79      79
## 
## P
##              Gravity Ph     Osmolarity Conductivity Urea   Calcium
## Gravity              0.0243 0.0000     0.0000       0.0000 0.0000 
## Ph           0.0243         0.0364     0.3065       0.0140 0.2942 
## Osmolarity   0.0000  0.0364            0.0000       0.0000 0.0000 
## Conductivity 0.0000  0.3065 0.0000                  0.0000 0.0018 
## Urea         0.0000  0.0140 0.0000     0.0000              0.0000 
## Calcium      0.0000  0.2942 0.0000     0.0018       0.0000

As our study question focus on Osmolarity as the target variable, we can read the correlation matrix as follows:

Except for Ph that is negatively correlated to Osmolarity (r=-0.24), all other variables represented a positive and strong correlation with our target variable (r=0.87, 0.83 and 0.89 for Gravity, Conductivity and Urea, respectively). These correlation were all statistically significative.

Bootstraping a correlation matrix

Bootstraping a correlation matrix is a little bit more complicated than that for a single pair. We can do this by a 3 steps procedure: First, we introduce a corboot function that implies directly a cor function. Then we apply this corboot function on our target matrix. Finally, we use a loop for attributing the names of pairs and explore the final dataframe.

mat=df[,c(3:8)]

corboot=function(data,i){
  cor(data[i,])
}

set.seed(123)
library(boot)
res=boot(statistic=corboot,data=mat,R=1000)%>%.$t%>%as_tibble()

list=c("V1","V2","V3","V4","V5","V6")

ldf=data.frame(NULL)

n=0

for (i in 1:6){
  var=colnames(mat)[i]
  b=n+1
  e=n+ncol(mat)
  dt=res[,c(b:e)]
  names(dt)=colnames(mat)
  dt$Variable=var
  ldf=rbind(ldf,dt)
  n=n+ncol(mat)
}

ldf=ldf%>%gather(Gravity:Calcium,key="Target",value="Coefficient")

ldf%>%ggplot(aes(x=Coefficient,fill=Variable))+geom_histogram(show.legend = F,color="black")+facet_grid(Variable~Target,scales="free")+geom_vline(xintercept=0,color="blue",linetype="dashed")+scale_x_continuous(limits = c(-1,1))

ldf2=ldf%>%unite(Pair,Variable,Target)
ldf2%>%split(.$Pair)%>%map(~Hmisc::describe(.$Coefficient))

## $Calcium_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1
## 
## $Calcium_Conductivity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.3467  0.09782   0.1974   0.2244 
##      .25      .50      .75      .90      .95 
##   0.2901   0.3529   0.4074   0.4549   0.4771 
## 
## lowest : 0.08957106 0.13694367 0.14097150 0.14522399 0.14978924
## highest: 0.53203186 0.53555428 0.54089242 0.54399090 0.56215332
## 
## $Calcium_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.5239   0.1039   0.3676   0.3980 
##      .25      .50      .75      .90      .95 
##   0.4644   0.5286   0.5900   0.6389   0.6640 
## 
## lowest : 0.1636405 0.1868441 0.1890655 0.2180693 0.2704973
## highest: 0.7278133 0.7395206 0.7419029 0.7452068 0.7814306
## 
## $Calcium_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.5144  0.09846   0.3669   0.3959 
##      .25      .50      .75      .90      .95 
##   0.4600   0.5162   0.5793   0.6205   0.6420 
## 
## lowest : 0.2243556 0.2270902 0.2784960 0.2892525 0.2978975
## highest: 0.6851170 0.6887083 0.7000109 0.7123718 0.7237371
## 
## $Calcium_Ph
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.1219   0.1435 -0.32955 -0.28901 
##      .25      .50      .75      .90      .95 
## -0.20982 -0.12038 -0.03506  0.03550  0.08541 
## 
## lowest : -0.5033614 -0.4894865 -0.4434884 -0.4292487 -0.4167988
## highest:  0.2024711  0.2079475  0.2247542  0.2249787  0.2340928
## 
## $Calcium_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.5012   0.1069   0.3419   0.3795 
##      .25      .50      .75      .90      .95 
##   0.4377   0.5034   0.5659   0.6197   0.6484 
## 
## lowest : 0.1861203 0.1962297 0.1984450 0.2221892 0.2375499
## highest: 0.7297510 0.7304352 0.7505378 0.7672036 0.7854312
## 
## $Conductivity_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.3467  0.09782   0.1974   0.2244 
##      .25      .50      .75      .90      .95 
##   0.2901   0.3529   0.4074   0.4549   0.4771 
## 
## lowest : 0.08957106 0.13694367 0.14097150 0.14522399 0.14978924
## highest: 0.53203186 0.53555428 0.54089242 0.54399090 0.56215332
## 
## $Conductivity_Conductivity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1
## 
## $Conductivity_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.5708   0.1092   0.3932   0.4380 
##      .25      .50      .75      .90      .95 
##   0.5123   0.5791   0.6318   0.6927   0.7274 
## 
## lowest : 0.2662446 0.2876707 0.3071494 0.3202558 0.3269361
## highest: 0.7661535 0.7668432 0.7709550 0.7886193 0.8145255
## 
## $Conductivity_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      142      858      142        1   0.8239   0.0452   0.7632   0.7706 
##      .25      .50      .75      .90      .95 
##   0.7976   0.8247   0.8541   0.8699   0.8768 
## 
## lowest : 0.6813705 0.7294604 0.7325744 0.7355072 0.7466821
## highest: 0.8963830 0.8967896 0.8974084 0.9115176 0.9118883
## 
## $Conductivity_Ph
## .$Coefficient 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##       377       623       377         1   -0.1216    0.1086 -0.273668 
##       .10       .25       .50       .75       .90       .95 
## -0.241532 -0.194595 -0.120748 -0.054709 -0.007805  0.034417 
## 
## lowest : -0.39273545 -0.36641813 -0.36575779 -0.36340982 -0.35213337
## highest:  0.08314969  0.12017868  0.16424574  0.17210259  0.19389733
## 
## $Conductivity_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.5218  0.09333   0.3742   0.4124 
##      .25      .50      .75      .90      .95 
##   0.4694   0.5290   0.5843   0.6209   0.6470 
## 
## lowest : 0.2489203 0.2745463 0.2899516 0.3124759 0.3384205
## highest: 0.6935489 0.6936601 0.7037591 0.7075634 0.7443859
## 
## $Gravity_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.5239   0.1039   0.3676   0.3980 
##      .25      .50      .75      .90      .95 
##   0.4644   0.5286   0.5900   0.6389   0.6640 
## 
## lowest : 0.1636405 0.1868441 0.1890655 0.2180693 0.2704973
## highest: 0.7278133 0.7395206 0.7419029 0.7452068 0.7814306
## 
## $Gravity_Conductivity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.5708   0.1092   0.3932   0.4380 
##      .25      .50      .75      .90      .95 
##   0.5123   0.5791   0.6318   0.6927   0.7274 
## 
## lowest : 0.2662446 0.2876707 0.3071494 0.3202558 0.3269361
## highest: 0.7661535 0.7668432 0.7709550 0.7886193 0.8145255
## 
## $Gravity_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1
## 
## $Gravity_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.8737  0.05644   0.7863   0.8023 
##      .25      .50      .75      .90      .95 
##   0.8444   0.8793   0.9113   0.9336   0.9459 
## 
## lowest : 0.7094249 0.7232943 0.7274568 0.7316258 0.7404852
## highest: 0.9598781 0.9606117 0.9611458 0.9641672 0.9672925
## 
## $Gravity_Ph
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.2568   0.1103  -0.4201  -0.3794 
##      .25      .50      .75      .90      .95 
##  -0.3217  -0.2576  -0.1869  -0.1321  -0.1023 
## 
## lowest : -0.630683327 -0.549624205 -0.548206978 -0.523962612 -0.521183392
## highest: -0.026580885 -0.023248859 -0.020900122 -0.006203768  0.016260853
## 
## $Gravity_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.8255  0.07757   0.6966   0.7281 
##      .25      .50      .75      .90      .95 
##   0.7846   0.8331   0.8823   0.9052   0.9150 
## 
## lowest : 0.5780316 0.5814722 0.5847227 0.5991420 0.6058019
## highest: 0.9391618 0.9395732 0.9396153 0.9464864 0.9555937
## 
## $Osmolarity_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.5144  0.09846   0.3669   0.3959 
##      .25      .50      .75      .90      .95 
##   0.4600   0.5162   0.5793   0.6205   0.6420 
## 
## lowest : 0.2243556 0.2270902 0.2784960 0.2892525 0.2978975
## highest: 0.6851170 0.6887083 0.7000109 0.7123718 0.7237371
## 
## $Osmolarity_Conductivity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      142      858      142        1   0.8239   0.0452   0.7632   0.7706 
##      .25      .50      .75      .90      .95 
##   0.7976   0.8247   0.8541   0.8699   0.8768 
## 
## lowest : 0.6813705 0.7294604 0.7325744 0.7355072 0.7466821
## highest: 0.8963830 0.8967896 0.8974084 0.9115176 0.9118883
## 
## $Osmolarity_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.8737  0.05644   0.7863   0.8023 
##      .25      .50      .75      .90      .95 
##   0.8444   0.8793   0.9113   0.9336   0.9459 
## 
## lowest : 0.7094249 0.7232943 0.7274568 0.7316258 0.7404852
## highest: 0.9598781 0.9606117 0.9611458 0.9641672 0.9672925
## 
## $Osmolarity_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1
## 
## $Osmolarity_Ph
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1  -0.2418   0.1061 -0.39200 -0.36302 
##      .25      .50      .75      .90      .95 
## -0.30120 -0.24289 -0.17769 -0.12257 -0.08785 
## 
## lowest : -0.485944879 -0.485328177 -0.469680809 -0.466464992 -0.453425383
## highest: -0.019575968 -0.019513772  0.001365378  0.015483628  0.102608070
## 
## $Osmolarity_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.8885  0.02461   0.8515   0.8611 
##      .25      .50      .75      .90      .95 
##   0.8749   0.8889   0.9034   0.9155   0.9222 
## 
## lowest : 0.8227201 0.8247502 0.8305188 0.8329580 0.8339577
## highest: 0.9374534 0.9376413 0.9386876 0.9417923 0.9501728
## 
## $Ph_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.1219   0.1435 -0.32955 -0.28901 
##      .25      .50      .75      .90      .95 
## -0.20982 -0.12038 -0.03506  0.03550  0.08541 
## 
## lowest : -0.5033614 -0.4894865 -0.4434884 -0.4292487 -0.4167988
## highest:  0.2024711  0.2079475  0.2247542  0.2249787  0.2340928
## 
## $Ph_Conductivity
## .$Coefficient 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##       377       623       377         1   -0.1216    0.1086 -0.273668 
##       .10       .25       .50       .75       .90       .95 
## -0.241532 -0.194595 -0.120748 -0.054709 -0.007805  0.034417 
## 
## lowest : -0.39273545 -0.36641813 -0.36575779 -0.36340982 -0.35213337
## highest:  0.08314969  0.12017868  0.16424574  0.17210259  0.19389733
## 
## $Ph_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.2568   0.1103  -0.4201  -0.3794 
##      .25      .50      .75      .90      .95 
##  -0.3217  -0.2576  -0.1869  -0.1321  -0.1023 
## 
## lowest : -0.630683327 -0.549624205 -0.548206978 -0.523962612 -0.521183392
## highest: -0.026580885 -0.023248859 -0.020900122 -0.006203768  0.016260853
## 
## $Ph_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1  -0.2418   0.1061 -0.39200 -0.36302 
##      .25      .50      .75      .90      .95 
## -0.30120 -0.24289 -0.17769 -0.12257 -0.08785 
## 
## lowest : -0.485944879 -0.485328177 -0.469680809 -0.466464992 -0.453425383
## highest: -0.019575968 -0.019513772  0.001365378  0.015483628  0.102608070
## 
## $Ph_Ph
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1
## 
## $Ph_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.2749   0.1142 -0.43821 -0.40861 
##      .25      .50      .75      .90      .95 
## -0.34662 -0.27352 -0.20821 -0.14382 -0.09705 
## 
## lowest : -0.5663802067 -0.5596691597 -0.5586663955 -0.5230563505 -0.5229045050
## highest: -0.0107582658  0.0007919951  0.0069800099  0.0352025917  0.0427327604
## 
## $Urea_Calcium
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.5012   0.1069   0.3419   0.3795 
##      .25      .50      .75      .90      .95 
##   0.4377   0.5034   0.5659   0.6197   0.6484 
## 
## lowest : 0.1861203 0.1962297 0.1984450 0.2221892 0.2375499
## highest: 0.7297510 0.7304352 0.7505378 0.7672036 0.7854312
## 
## $Urea_Conductivity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      377      623      377        1   0.5218  0.09333   0.3742   0.4124 
##      .25      .50      .75      .90      .95 
##   0.4694   0.5290   0.5843   0.6209   0.6470 
## 
## lowest : 0.2489203 0.2745463 0.2899516 0.3124759 0.3384205
## highest: 0.6935489 0.6936601 0.7037591 0.7075634 0.7443859
## 
## $Urea_Gravity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.8255  0.07757   0.6966   0.7281 
##      .25      .50      .75      .90      .95 
##   0.7846   0.8331   0.8823   0.9052   0.9150 
## 
## lowest : 0.5780316 0.5814722 0.5847227 0.5991420 0.6058019
## highest: 0.9391618 0.9395732 0.9396153 0.9464864 0.9555937
## 
## $Urea_Osmolarity
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      382      618      382        1   0.8885  0.02461   0.8515   0.8611 
##      .25      .50      .75      .90      .95 
##   0.8749   0.8889   0.9034   0.9155   0.9222 
## 
## lowest : 0.8227201 0.8247502 0.8305188 0.8329580 0.8339577
## highest: 0.9374534 0.9376413 0.9386876 0.9417923 0.9501728
## 
## $Urea_Ph
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  -0.2749   0.1142 -0.43821 -0.40861 
##      .25      .50      .75      .90      .95 
## -0.34662 -0.27352 -0.20821 -0.14382 -0.09705 
## 
## lowest : -0.5663802067 -0.5596691597 -0.5586663955 -0.5230563505 -0.5229045050
## highest: -0.0107582658  0.0007919951  0.0069800099  0.0352025917  0.0427327604
## 
## $Urea_Urea
## .$Coefficient 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0        1        0 
##                
## Value         1
## Frequency  1000
## Proportion    1

Reporting the results

Putting everything together, we could report the results of our study as follows:

“The association between Osmolarity and other variables were evaluated by Pearson’s r coefficient.

Except for urinary pH that presents a weak and negative correlation with our target variable (r(76)=-0.24, p=0.03), strong, positive and significative correlations were found between Osmolarity and other variables. For instance, the variance in Urea, Gravity and Conductivity might explain 79.2%, 75.7% and 68.9% variance in Osmolarity. The Calcium concentration show a medium but significative correlation with Osmolarity.

Exercise

Pick up another target variable from dataset and perform pair-wised correlation on this one
Reproduce the same analysis on your own dataset

Thank you

ESMS - Case 3: Pearson product-moment correlation coefficient

Lê Đông Nhật Nam

28 Feb 2017

Case study 3: Pearson’s product-moment correlation coefficient