For this exercise, the data from the survey obtained from MS398 (Biostatistics) students from 2010 and 2016 shall be used.
First, we clean up the R environment to ensure that we will be running the correct dataset in our downstream analysis.
rm(list=ls())
Step 1: Open data file and create the data project
We will then set our work directory. This is the directory that contains the input files and will also be the location where output files are saved. getwd() checks to see if we are in the right directory.
setwd("C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_9 Correlation Exercise")
getwd()
## [1] "C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_9 Correlation Exercise"
For this example, we will look into the correlation of three sets of select variables from the student survey.
Step 1
survey <-read.csv("STDNTSURVEY.csv", header = T, sep=",")
attach(survey)
names(survey)
## [1] "PROG" "FULLPART" "SEM" "STAGE"
## [5] "GWA" "FINSUPPORT" "EMPLOYED" "SEX"
## [9] "CIVILSTATUS" "RELIGION" "AGE" "AVESLEEP"
## [13] "EXERCISE" "UNSTRESS" "STUDYAREA" "HOURSNET"
## [17] "SOCIALNET" "CUPRICELUNCH" "VIANDLUNCH" "BFASTDRINK"
## [21] "SOCIALDRINK" "SMOKER" "TVSTA" "MOBILEPROV"
## [25] "NUMGADGETS" "WEIGHT" "HEIGHT" "WAIST"
## [29] "PALM" "LEGLEN" "ARMLEN" "ARMSPAN"
## [33] "SHOULDERWD" "SHOESIZE"
Step 2: Load required libraries
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Step 3: Select first set of variables to analyze
We will now select the first set of variables for our correlation analysis. For this set, we will look into weight, arm span, height, and waist.
selvar1<-c("WEIGHT","ARMSPAN","HEIGHT","WAIST")
corrvar1 <- survey[selvar1]
head(corrvar1)
## WEIGHT ARMSPAN HEIGHT WAIST
## 1 155.0 71.0 71.0 30.5
## 2 155.0 68.0 64.0 34.0
## 3 113.2 59.5 61.5 26.0
## 4 184.0 72.0 72.0 32.5
## 5 96.0 58.0 59.0 26.5
## 6 132.0 67.0 66.0 31.0
Step 4: Assess the state of the data through plots
Pairwise scatter plots, histogram, locally-smoothed regression and Pearso correlation.
pairs.panels(corrvar1)
Scatter plot
plot(corrvar1$WEIGHT,corrvar1$HEIGHT,
main="Scatterplot of Weight vs. Height",
xlab="Weight",ylab="HEIGHT",pch=19)
Step 5: Do the same for the second and third sets or variables
We will now repeat the first set of steps for variable sets 2 and 3.
Select variables
Select Variables Set 2
selvar2 <-c("GWA","HOURSNET","AVESLEEP")
corrvar2 <-survey[selvar2]
head(corrvar2)
## GWA HOURSNET AVESLEEP
## 1 1.20 5 4.0
## 2 1.75 4 6.0
## 3 1.60 5 6.0
## 4 NA 4 9.0
## 5 1.25 12 6.5
## 6 1.50 12 6.0
Select Variables Set 3
selvar3 <-c("ARMSPAN","ARMLEN","LEGLEN",
"WEIGHT","HEIGHT","SHOULDERWD",
"WAIST","SHOESIZE")
corrvar3 <-survey[selvar3]
head(corrvar3)
## ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST SHOESIZE
## 1 71.0 31.2 38.0 155.0 71.0 19.0 30.5 11.5
## 2 68.0 26.4 34.5 155.0 64.0 18.0 34.0 9.0
## 3 59.5 26.0 32.4 113.2 61.5 16.0 26.0 10.0
## 4 72.0 29.9 39.0 184.0 72.0 18.0 32.5 11.5
## 5 58.0 25.0 29.4 96.0 59.0 14.9 26.5 8.5
## 6 67.0 28.7 37.0 132.0 66.0 17.0 31.0 11.5
Generate plots
Pairs panel for variable set 2.
pairs.panels(corrvar2)
plot(corrvar2$GWA,corrvar2$AVESLEEP,
main="Scatterplot of Grades vs. Ave. Sleeping Hrs.",
xlab="Ave.Grades",ylab="Ave. Sleeping Hrs.", pch=19)
Step 6: Correlation using Hmisc
We performed a Pearson correlation for each of the three sets of variables chosen. Use the rcorr() function from the Hmisc package to implement this procedure.
For variable set 1:
Pearson correlation with variable labels.
myQs1<-cbind(survey$WEIGHT,survey$ARMSPAN,survey$HEIGHT,survey$WAIST)
rcorr(myQs1,type="pearson")
## [,1] [,2] [,3] [,4]
## [1,] 1.00 0.18 0.64 0.81
## [2,] 0.18 1.00 0.33 0.02
## [3,] 0.64 0.33 1.00 0.46
## [4,] 0.81 0.02 0.46 1.00
##
## n= 86
##
##
## P
## [,1] [,2] [,3] [,4]
## [1,] 0.1017 0.0000 0.0000
## [2,] 0.1017 0.0018 0.8480
## [3,] 0.0000 0.0018 0.0000
## [4,] 0.0000 0.8480 0.0000
For variable set 2:
Pearson correlation with no variable labels.
myQs2 <-cbind(survey$GWA, survey$HOURSNET, survey$AVESLEEP)
rcorr(myQs2, type="pearson")
## [,1] [,2] [,3]
## [1,] 1.00 -0.11 0.13
## [2,] -0.11 1.00 0.03
## [3,] 0.13 0.03 1.00
##
## n
## [,1] [,2] [,3]
## [1,] 76 76 76
## [2,] 76 86 86
## [3,] 76 86 86
##
## P
## [,1] [,2] [,3]
## [1,] 0.3259 0.2797
## [2,] 0.3259 0.7900
## [3,] 0.2797 0.7900
Pearson correlation with variable labels.
myQsc <-as.matrix(corrvar3)
rcorr(myQsc, type="pearson")
## ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST SHOESIZE
## ARMSPAN 1.00 0.41 0.66 0.18 0.33 0.16 0.02 0.31
## ARMLEN 0.41 1.00 0.37 0.36 0.50 0.14 0.18 0.35
## LEGLEN 0.66 0.37 1.00 0.09 0.11 0.05 -0.06 0.14
## WEIGHT 0.18 0.36 0.09 1.00 0.64 0.45 0.81 0.56
## HEIGHT 0.33 0.50 0.11 0.64 1.00 0.51 0.46 0.58
## SHOULDERWD 0.16 0.14 0.05 0.45 0.51 1.00 0.32 0.49
## WAIST 0.02 0.18 -0.06 0.81 0.46 0.32 1.00 0.37
## SHOESIZE 0.31 0.35 0.14 0.56 0.58 0.49 0.37 1.00
##
## n= 86
##
##
## P
## ARMSPAN ARMLEN LEGLEN WEIGHT HEIGHT SHOULDERWD WAIST SHOESIZE
## ARMSPAN 0.0000 0.0000 0.1017 0.0018 0.1423 0.8480 0.0041
## ARMLEN 0.0000 0.0004 0.0006 0.0000 0.1868 0.0936 0.0011
## LEGLEN 0.0000 0.0004 0.4097 0.2998 0.6743 0.6000 0.1858
## WEIGHT 0.1017 0.0006 0.4097 0.0000 0.0000 0.0000 0.0000
## HEIGHT 0.0018 0.0000 0.2998 0.0000 0.0000 0.0000 0.0000
## SHOULDERWD 0.1423 0.1868 0.6743 0.0000 0.0000 0.0026 0.0000
## WAIST 0.8480 0.0936 0.6000 0.0000 0.0000 0.0026 0.0005
## SHOESIZE 0.0041 0.0011 0.1858 0.0000 0.0000 0.0000 0.0005
**Step 7: Correlation matrix visualization
To visualize the correlation matrix, use the package ggpcorrlot. For this part, we will use the first set of variables. Then load the required package.
library(ggcorrplot)
We will then compute a correlation matrix.
corr <-round(cor(corrvar1),1)
head(corr[,1:4])
## WEIGHT ARMSPAN HEIGHT WAIST
## WEIGHT 1.0 0.2 0.6 0.8
## ARMSPAN 0.2 1.0 0.3 0.0
## HEIGHT 0.6 0.3 1.0 0.5
## WAIST 0.8 0.0 0.5 1.0
After this, we can get the correlation matrix p-values.
p.mat <-cor_pmat(corrvar1)
head(p.mat[,1:4])
## WEIGHT ARMSPAN HEIGHT WAIST
## WEIGHT 0.000000e+00 0.101660934 2.192698e-11 1.612333e-21
## ARMSPAN 1.016609e-01 0.000000000 1.760706e-03 8.479829e-01
## HEIGHT 2.192698e-11 0.001760706 0.000000e+00 8.898182e-06
## WAIST 1.612333e-21 0.847982920 8.898182e-06 0.000000e+00
We can then visualize the correlation matrix using different shapes. For this example, we will use the default square, as well as the circle method.
method= square which is the deafult
ggcorrplot(corr)
To set the method = circle, use the code below.
ggcorrplot(corr, method="circle")
We can also reorder the matrix for easier analysis. Here, we will reorder the matrix based on hierarchical clustering.
ggcorrplot(corr,hc.order = T, outline.col="white")
We can also alter the correlogram layout.
To show the lower triangle, use the code below.
ggcorrplot(corr, hc.order = T, outline.col="white", type="lower")
To show upper triangle, use the code below.
ggcorrplot(corr, hc.order = T, outline.col="white", type="upper")
For further aesthetics, we can use the themes and define color schemes.
ggcorrplot(corr,hc.order = T,outline.col="white",type="lower",
ggtheme=ggplot2::theme_gray(),
colors=c("#6D9EC1","white","#E46726"))
To aid in analysis, we can also add correlation coefficients in the plot.
ggcorrplot(corr, hc.order=T, outline.col="white", type="lower",
lab=TRUE)
Furthermore, we can also show the correlation coefficients significance level.
Barring non-significant coefficient.
ggcorrplot(corr,hc.order = T, outline.color = "white", type="lower",
p.mat=p.mat)
Leave blank on non-significant coefficient.
ggcorrplot(corr,hc.order = T, outline.col="white", type="lower",
insig="blank")