Correlation (as we will be using it, the Pearson Correlation) is a valuable tool to analyze data. Perhaps ironically, it is rarely reported in academic work (typically in favor of different types of regression), however it’s a useful tool when investigating data … as long as you are aware of the dangers.
The idea behind correlation is that as the value of one measurement varies, so does the value of another measurement:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cars_example<-mtcars %>% select(hp,wt) %>% arrange(-hp,wt)
cars_example
## hp wt
## Maserati Bora 335 3.570
## Ford Pantera L 264 3.170
## Duster 360 245 3.570
## Camaro Z28 245 3.840
## Chrysler Imperial 230 5.345
## Lincoln Continental 215 5.424
## Cadillac Fleetwood 205 5.250
## Merc 450SL 180 3.730
## Merc 450SLC 180 3.780
## Merc 450SE 180 4.070
## Ferrari Dino 175 2.770
## Hornet Sportabout 175 3.440
## Pontiac Firebird 175 3.845
## AMC Javelin 150 3.435
## Dodge Challenger 150 3.520
## Merc 280 123 3.440
## Merc 280C 123 3.440
## Lotus Europa 113 1.513
## Mazda RX4 110 2.620
## Mazda RX4 Wag 110 2.875
## Hornet 4 Drive 110 3.215
## Volvo 142E 109 2.780
## Valiant 105 3.460
## Toyota Corona 97 2.465
## Merc 230 95 3.150
## Datsun 710 93 2.320
## Porsche 914-2 91 2.140
## Fiat X1-9 66 1.935
## Fiat 128 66 2.200
## Toyota Corolla 65 1.835
## Merc 240D 62 3.190
## Honda Civic 52 1.615
The relationship doesn’t have to be perfect (as in this case) – HP seems to go up and weight goes up. There are counter examples (the Chrysler Imperial, for example) but the relationship generally holds true.
The essence of correlation is “relationships”.
cor(cars_example)
## hp wt
## hp 1.0000000 0.6587479
## wt 0.6587479 1.0000000
We see that the correlation between “HP” and “HP” is 1 (that is, 100%). Because they are the same. The correlation between HP and WT is 0.66 (you can call this “66%” in class, although the “percent” terminology is not used in the literature).
What does this mean? There is some positive relationship between HP and WT (as one increases, the other tends to increase as well).
Can we run a correlation analysis on the whole dataset?
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
# in this case they call all be correlated because they are numerical values. if
# any variable is considered characters you cannot correlate the whole file
We can also create a scatterplot with some of these variables:
pairs(~mpg+cyl+disp+hp+drat+wt,data=mtcars)
cor(mtcars$mpg,mtcars$drat)
## [1] 0.6811719
Short answer: yes, if all the variables are numeric. If not, you’ll have to “SELECT” them into a new dataset.
What do we see here? Any observations?
Notice: CYL and DISP are very strongly correlated in the positive direction (as CYL goes up, DISP goes up)
WT and MPG are strongly (and negatively) correlated (as WT goes up, MPG goes down and vice-versa)
Correlations are good for being able to discover relationships in the data – but they don’t necessarily “explain” the data, so to speak. “Explaining” data must ultimately require human judgment.
Example: we know there is a relationship between CYL and DISP. As one goes up, the other goes up too. But does CYL cause DISP to go up? Or does DISP cause CYL to go up?
You can’t tell this just from the data – but DISP (engine displacement) gets bigger because CYL (number of cylinders) gets bigger.
The field of knowledge about cause-and-effect is called “CAUSALITY” and it is mostly beyond the scope of this course, except to say the following:
This is ultimately why – as stated previously – you cant write an academic paper that just says “there is a correlation between X and Y, therefore X causes Y” or something similar. Correlation is more of a first step in examining the data – its a descriptive statistic to help you visualize how variables might be affecting each other.
Determining cause and effect is actually quite difficult. Linear regression can indirectly help with this, however, and we will discuss it in depth over the next few weeks.
cor.test(cars_example$hp,cars_example$wt,method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cars_example$hp and cars_example$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4025113 0.8192573
## sample estimates:
## cor
## 0.6587479
##estimating that there is a 95% chance the correlation is between .4025 and .8192 # if correlation includes 0 in confidence interval then the actually correlation # could be zero (or not at all correlated)
We see more information when running a Pearson’s correlation: the correlation is very significant (unlikely due to chance). Also, the confidence interval does not “cross zero”, which means that the result is robust and positive.
Problem: Pearson’s correlation is only valid if the original variables are normally distributed and the relationship between variables is linear.
hist(cars_example$hp,probability = T)
lines(density(cars_example$hp),col="red",lwd=3)
hist(cars_example$wt,probability = T)
lines(density(cars_example$wt),col="red",lwd=3)
Immediately we can tell that HP is un-normally distributed.
We have a few options here. We can either transform the variables (as we discussed last week) or we can use a different correlation method – one that doesn’t involve parametric assumptions like normality.
shapiro.test(cars_example$hp)
##
## Shapiro-Wilk normality test
##
## data: cars_example$hp
## W = 0.93342, p-value = 0.04881
(HP is probably, mathematically not-normal)
shapiro.test(cars_example$wt)
##
## Shapiro-Wilk normality test
##
## data: cars_example$wt
## W = 0.94326, p-value = 0.09265
(WT is indistinguishable from normal)
Let’s create a new variable using “MUTATE” which transforms HP via the Log-Transformation:
cars_example<-cars_example %>% mutate(HP_LOG_TRANSFORM=log(hp))
head(cars_example)
## hp wt HP_LOG_TRANSFORM
## Maserati Bora 335 3.570 5.814131
## Ford Pantera L 264 3.170 5.575949
## Duster 360 245 3.570 5.501258
## Camaro Z28 245 3.840 5.501258
## Chrysler Imperial 230 5.345 5.438079
## Lincoln Continental 215 5.424 5.370638
shapiro.test(cars_example$HP_LOG_TRANSFORM)
##
## Shapiro-Wilk normality test
##
## data: cars_example$HP_LOG_TRANSFORM
## W = 0.97026, p-value = 0.5065
hist(cars_example$HP_LOG_TRANSFORM,probability = T)
lines(density(cars_example$HP_LOG_TRANSFORM),col="red",lwd=3)
Just barely normal. We can then run the cor.test on them:
cor.test(cars_example$HP_LOG_TRANSFORM,cars_example$wt)
##
## Pearson's product-moment correlation
##
## data: cars_example$HP_LOG_TRANSFORM and cars_example$wt
## t = 5.6149, df = 30, p-value = 4.108e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4892538 0.8518867
## sample estimates:
## cor
## 0.7158277
You’ll notice that the correlation has increased somewhat and the confidence interval has decreased – this means that the measurement is now more accurate because HP has been normalized.
OR we can try a non-parametric correlation test, such as the Spearman’s RHO Correlation
cor.test(cars_example$hp,cars_example$wt,method="spearman")
## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: cars_example$hp and cars_example$wt
## S = 1229.4, p-value = 1.954e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7746767
Notice that the SPEARMAN’s RHO result and the LOG TRANSFORMED result are (somewhat) similar.
SPEARMAN’s RHO gives us an error if there are “ties” (same answers) in the data (if one car has the same HP as another, etc.)
Another method is KENDALL’S TAU (which can be used when ranks have ties)
cor.test(cars_example$hp,cars_example$wt,method="kendall")
## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "kendall"): Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: cars_example$hp and cars_example$wt
## z = 4.845, p-value = 1.266e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6113081
We still get this error but this result is somewhat more accurate. Even then, the correlation is still both significant and positive, so no harm done if we choose the “wrong” statistical method. The answers are still mostly the same.
Correlation analysis allows us to see different relationships within the data.
This is useful when initially looking at the data, running descriptive statistics and trying to understand your data. Correlation analysis should be part of the “descriptive statistics” that you run at the beginning of every project.
This is neat, but don’t be fooled: there are some major exceptions
The most obvious exception to correlations is attempting to correlate non-continuous and non-ordinal measurements.
For example, we want to see if there is a correlation between having red hair and liking Strawberry ice cream. We asked 10 students for their favorite ice cream, and we measured their hair color (RED or NOT RED):
flavor1<-c("Chocolate","Chocolate","Vanilla","Strawberry","Strawberry","Vanilla","Chocolate","Chocolate","Strawberry","Chocolate")
flavor_id<-c(1,1,3,2,2,3,1,1,2,1)
student_id<-c(1:10)
red_hair<-c(0,1,0,0,0,1,1,0,0,0)
test_frame<-data.frame(flavor_id,flavor1,student_id,red_hair)
test_frame
## flavor_id flavor1 student_id red_hair
## 1 1 Chocolate 1 0
## 2 1 Chocolate 2 1
## 3 3 Vanilla 3 0
## 4 2 Strawberry 4 0
## 5 2 Strawberry 5 0
## 6 3 Vanilla 6 1
## 7 1 Chocolate 7 1
## 8 1 Chocolate 8 0
## 9 2 Strawberry 9 0
## 10 1 Chocolate 10 0
We see that
Red hair = 1 NOT red hair = 0
BUT
Chocolate = 1 Strawberry = 2 Vanilla = 3
So, if we try to find correlations between them:
cor.test(test_frame$flavor_id,test_frame$red_hair,method="kendall")
## Warning in cor.test.default(test_frame$flavor_id, test_frame$red_hair, method =
## "kendall"): Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: test_frame$flavor_id and test_frame$red_hair
## z = -0.24744, p-value = 0.8046
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.07838618
Don’t be fooled – this measurement doesn’t make sense because it doesn’t make sense to “score” ice cream as “1,2,3” for Chocolate, Strawberry and Vanilla. Is vanilla “better” than Strawberry? Is chocolate “worse” than strawberry?
For this kind of analysis, it will only work if we convert “strawberry” to a binary variable (like red hair). Observe:
test_frame<-test_frame %>% mutate(STRAWBERRY_VARIABLE=ifelse(flavor_id==2,1,0))
test_frame
## flavor_id flavor1 student_id red_hair STRAWBERRY_VARIABLE
## 1 1 Chocolate 1 0 0
## 2 1 Chocolate 2 1 0
## 3 3 Vanilla 3 0 0
## 4 2 Strawberry 4 0 1
## 5 2 Strawberry 5 0 1
## 6 3 Vanilla 6 1 0
## 7 1 Chocolate 7 1 0
## 8 1 Chocolate 8 0 0
## 9 2 Strawberry 9 0 1
## 10 1 Chocolate 10 0 0
Now the correlation analysis makes more sense. We are measuring an ordinal version of red hair (0 for no red hair, 1 for red hair) against an ordinal version of strawberry (0 for not strawberry, 1 for strawberry)
cor.test(test_frame$STRAWBERRY_VARIABLE,test_frame$red_hair,method="kendall")
## Warning in cor.test.default(test_frame$STRAWBERRY_VARIABLE,
## test_frame$red_hair, : Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: test_frame$STRAWBERRY_VARIABLE and test_frame$red_hair
## z = -1.2857, p-value = 0.1985
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.4285714
We fixed the analysis design, and it appears that there is actually a negative correlation between liking strawberry ice cream and having red hair. BUT WAIT, look: the p-value is much greater than 0.05. This means that while the correlation is negative, the result is insignificant.
We could report this as “There is no significant relationship between red hair color and preference for strawberry ice cream”. No fancy graph is needed – the sentence alone will work.