For this exercise, the data from the survey obtained from MS398 (Biostatistics) students from 2010 and 2016 shall be used.

First, we clean up the R environment to ensure that we will be running the correct dataset in our downstream analysis.

rm(list=ls())

Step 1: Open data file and create the data project

We will then set our work directory. This is the directory that contains the input files and will also be the location where output files are saved. getwd() checks to see if we are in the right directory.

setwd("C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_9 Correlation Exercise")
getwd()
## [1] "C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_9 Correlation Exercise"

Example: Correlation

For this example, we will look into the correlation of three sets of select variables from the student survey.

Step 1

survey <-read.csv("STDNTSURVEY.csv", header = T, sep=",")
attach(survey)
names(survey)
##  [1] "PROG"         "FULLPART"     "SEM"          "STAGE"       
##  [5] "GWA"          "FINSUPPORT"   "EMPLOYED"     "SEX"         
##  [9] "CIVILSTATUS"  "RELIGION"     "AGE"          "AVESLEEP"    
## [13] "EXERCISE"     "UNSTRESS"     "STUDYAREA"    "HOURSNET"    
## [17] "SOCIALNET"    "CUPRICELUNCH" "VIANDLUNCH"   "BFASTDRINK"  
## [21] "SOCIALDRINK"  "SMOKER"       "TVSTA"        "MOBILEPROV"  
## [25] "NUMGADGETS"   "WEIGHT"       "HEIGHT"       "WAIST"       
## [29] "PALM"         "LEGLEN"       "ARMLEN"       "ARMSPAN"     
## [33] "SHOULDERWD"   "SHOESIZE"

Step 2: Load required libraries

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Step 3: Select first set of variables to analyze

We will now select the first set of variables for our correlation analysis. For this set, we will look into weight, arm span, height, and waist.

selvar1<-c("WEIGHT","ARMSPAN","HEIGHT","WAIST")
corrvar1 <- survey[selvar1]
head(corrvar1)
##   WEIGHT ARMSPAN HEIGHT WAIST
## 1  155.0    71.0   71.0  30.5
## 2  155.0    68.0   64.0  34.0
## 3  113.2    59.5   61.5  26.0
## 4  184.0    72.0   72.0  32.5
## 5   96.0    58.0   59.0  26.5
## 6  132.0    67.0   66.0  31.0

Step 4: Assess the state of the data through plots

Pairwise scatter plots, histogram, locally-smoothed regression and Pearso correlation.

pairs.panels(corrvar1)

Scatter plot

plot(corrvar1$WEIGHT,corrvar1$HEIGHT,
     main="Scatterplot of Weight vs. Height",
     xlab="Weight",ylab="HEIGHT",pch=19)

Step 5: Do the same for the second and third sets or variables

We will now repeat the first set of steps for variable sets 2 and 3.

Select variables

Select Variables Set 2

selvar2 <-c("GWA","HOURSNET","AVESLEEP")
corrvar2 <-survey[selvar2]
head(corrvar2)
##    GWA HOURSNET AVESLEEP
## 1 1.20        5      4.0
## 2 1.75        4      6.0
## 3 1.60        5      6.0
## 4   NA        4      9.0
## 5 1.25       12      6.5
## 6 1.50       12      6.0

Select Variables Set 3

selvar3 <-c("ARMSPAN","ARMLEN","LEGLEN",
            "WEIGHT","HEIGHT","SHOULDERWD",
            "WAIST","SHOESIZE")
corrvar3 <-survey[selvar3]
head(corrvar3)
##   ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST SHOESIZE
## 1    71.0   31.2   38.0  155.0   71.0       19.0  30.5     11.5
## 2    68.0   26.4   34.5  155.0   64.0       18.0  34.0      9.0
## 3    59.5   26.0   32.4  113.2   61.5       16.0  26.0     10.0
## 4    72.0   29.9   39.0  184.0   72.0       18.0  32.5     11.5
## 5    58.0   25.0   29.4   96.0   59.0       14.9  26.5      8.5
## 6    67.0   28.7   37.0  132.0   66.0       17.0  31.0     11.5

Generate plots

Pairs panel for variable set 2.

pairs.panels(corrvar2)

Scatter plot for variable set 2

plot(corrvar2$GWA,corrvar2$AVESLEEP,
     main="Scatterplot of Grades vs. Ave. Sleeping Hrs.",
     xlab="Ave.Grades",ylab="Ave. Sleeping Hrs.", pch=19)

Step 6: Correlation using Hmisc

We performed a Pearson correlation for each of the three sets of variables chosen. Use the rcorr() function from the Hmisc package to implement this procedure.

For variable set 1:

Pearson correlation with variable labels.

myQs1<-cbind(survey$WEIGHT,survey$ARMSPAN,survey$HEIGHT,survey$WAIST)
rcorr(myQs1,type="pearson")
##      [,1] [,2] [,3] [,4]
## [1,] 1.00 0.18 0.64 0.81
## [2,] 0.18 1.00 0.33 0.02
## [3,] 0.64 0.33 1.00 0.46
## [4,] 0.81 0.02 0.46 1.00
## 
## n= 86 
## 
## 
## P
##      [,1]   [,2]   [,3]   [,4]  
## [1,]        0.1017 0.0000 0.0000
## [2,] 0.1017        0.0018 0.8480
## [3,] 0.0000 0.0018        0.0000
## [4,] 0.0000 0.8480 0.0000

For variable set 2:

Pearson correlation with no variable labels.

myQs2 <-cbind(survey$GWA, survey$HOURSNET, survey$AVESLEEP)
rcorr(myQs2, type="pearson")
##       [,1]  [,2] [,3]
## [1,]  1.00 -0.11 0.13
## [2,] -0.11  1.00 0.03
## [3,]  0.13  0.03 1.00
## 
## n
##      [,1] [,2] [,3]
## [1,]   76   76   76
## [2,]   76   86   86
## [3,]   76   86   86
## 
## P
##      [,1]   [,2]   [,3]  
## [1,]        0.3259 0.2797
## [2,] 0.3259        0.7900
## [3,] 0.2797 0.7900

Pearson correlation with variable labels.

myQsc <-as.matrix(corrvar3)
rcorr(myQsc, type="pearson")
##            ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST SHOESIZE
## ARMSPAN       1.00   0.41   0.66   0.18   0.33       0.16  0.02     0.31
## ARMLEN        0.41   1.00   0.37   0.36   0.50       0.14  0.18     0.35
## LEGLEN        0.66   0.37   1.00   0.09   0.11       0.05 -0.06     0.14
## WEIGHT        0.18   0.36   0.09   1.00   0.64       0.45  0.81     0.56
## HEIGHT        0.33   0.50   0.11   0.64   1.00       0.51  0.46     0.58
## SHOULDERWD    0.16   0.14   0.05   0.45   0.51       1.00  0.32     0.49
## WAIST         0.02   0.18  -0.06   0.81   0.46       0.32  1.00     0.37
## SHOESIZE      0.31   0.35   0.14   0.56   0.58       0.49  0.37     1.00
## 
## n= 86 
## 
## 
## P
##            ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST  SHOESIZE
## ARMSPAN            0.0000 0.0000 0.1017 0.0018 0.1423     0.8480 0.0041  
## ARMLEN     0.0000         0.0004 0.0006 0.0000 0.1868     0.0936 0.0011  
## LEGLEN     0.0000  0.0004        0.4097 0.2998 0.6743     0.6000 0.1858  
## WEIGHT     0.1017  0.0006 0.4097        0.0000 0.0000     0.0000 0.0000  
## HEIGHT     0.0018  0.0000 0.2998 0.0000        0.0000     0.0000 0.0000  
## SHOULDERWD 0.1423  0.1868 0.6743 0.0000 0.0000            0.0026 0.0000  
## WAIST      0.8480  0.0936 0.6000 0.0000 0.0000 0.0026            0.0005  
## SHOESIZE   0.0041  0.0011 0.1858 0.0000 0.0000 0.0000     0.0005

**Step 7: Correlation matrix visualization

To visualize the correlation matrix, use the package ggpcorrlot. For this part, we will use the first set of variables. Then load the required package.

library(ggcorrplot)

We will then compute a correlation matrix.

corr <-round(cor(corrvar1),1)
head(corr[,1:4])
##         WEIGHT ARMSPAN HEIGHT WAIST
## WEIGHT     1.0     0.2    0.6   0.8
## ARMSPAN    0.2     1.0    0.3   0.0
## HEIGHT     0.6     0.3    1.0   0.5
## WAIST      0.8     0.0    0.5   1.0

After this, we can get the correlation matrix p-values.

p.mat <-cor_pmat(corrvar1)
head(p.mat[,1:4])
##               WEIGHT     ARMSPAN       HEIGHT        WAIST
## WEIGHT  0.000000e+00 0.101660934 2.192698e-11 1.612333e-21
## ARMSPAN 1.016609e-01 0.000000000 1.760706e-03 8.479829e-01
## HEIGHT  2.192698e-11 0.001760706 0.000000e+00 8.898182e-06
## WAIST   1.612333e-21 0.847982920 8.898182e-06 0.000000e+00

We can then visualize the correlation matrix using different shapes. For this example, we will use the default square, as well as the circle method.

method= square which is the deafult

ggcorrplot(corr)

To set the method = circle, use the code below.

ggcorrplot(corr, method="circle")

We can also reorder the matrix for easier analysis. Here, we will reorder the matrix based on hierarchical clustering.

ggcorrplot(corr,hc.order = T, outline.col="white")

We can also alter the correlogram layout.

To show the lower triangle, use the code below.

ggcorrplot(corr, hc.order = T, outline.col="white", type="lower")

To show upper triangle, use the code below.

ggcorrplot(corr, hc.order = T, outline.col="white", type="upper")

For further aesthetics, we can use the themes and define color schemes.

ggcorrplot(corr,hc.order = T,outline.col="white",type="lower",
           ggtheme=ggplot2::theme_gray(),
           colors=c("#6D9EC1","white","#E46726"))

To aid in analysis, we can also add correlation coefficients in the plot.

ggcorrplot(corr, hc.order=T, outline.col="white", type="lower",
           lab=TRUE)

Furthermore, we can also show the correlation coefficients significance level.

Barring non-significant coefficient.

ggcorrplot(corr,hc.order = T, outline.color = "white", type="lower",
p.mat=p.mat)

Leave blank on non-significant coefficient.

ggcorrplot(corr,hc.order = T, outline.col="white", type="lower",
           insig="blank")