Week 7

CORRELATION

Correlation (as we will be using it, the Pearson Correlation) is a valuable tool to analyze data. Perhaps ironically, it is rarely reported in academic work (typically in favor of different types of regression), however it’s a useful tool when investigating data … as long as you are aware of the dangers.

The idea behind correlation is that as the value of one measurement varies, so does the value of another measurement:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

cars_example<-mtcars %>% select(hp,wt) %>% arrange(-hp,wt)

cars_example

##                      hp    wt
## Maserati Bora       335 3.570
## Ford Pantera L      264 3.170
## Duster 360          245 3.570
## Camaro Z28          245 3.840
## Chrysler Imperial   230 5.345
## Lincoln Continental 215 5.424
## Cadillac Fleetwood  205 5.250
## Merc 450SL          180 3.730
## Merc 450SLC         180 3.780
## Merc 450SE          180 4.070
## Ferrari Dino        175 2.770
## Hornet Sportabout   175 3.440
## Pontiac Firebird    175 3.845
## AMC Javelin         150 3.435
## Dodge Challenger    150 3.520
## Merc 280            123 3.440
## Merc 280C           123 3.440
## Lotus Europa        113 1.513
## Mazda RX4           110 2.620
## Mazda RX4 Wag       110 2.875
## Hornet 4 Drive      110 3.215
## Volvo 142E          109 2.780
## Valiant             105 3.460
## Toyota Corona        97 2.465
## Merc 230             95 3.150
## Datsun 710           93 2.320
## Porsche 914-2        91 2.140
## Fiat X1-9            66 1.935
## Fiat 128             66 2.200
## Toyota Corolla       65 1.835
## Merc 240D            62 3.190
## Honda Civic          52 1.615

The relationship doesn’t have to be perfect (as in this case) – HP seems to go up and weight goes up. There are counter examples (the Chrysler Imperial, for example) but the relationship generally holds true.

The essence of correlation is “relationships”.

cor(cars_example)

##           hp        wt
## hp 1.0000000 0.6587479
## wt 0.6587479 1.0000000

We see that the correlation between “HP” and “HP” is 1 (that is, 100%). Because they are the same. The correlation between HP and WT is 0.66 (you can call this “66%” in class, although the “percent” terminology is not used in the literature).

What does this mean? There is some positive relationship between HP and WT (as one increases, the other tends to increase as well).

Can we run a correlation analysis on the whole dataset?

cor(mtcars)

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

# in this case they call all be correlated because they are numerical values. if
# any variable is considered characters you cannot correlate the whole file

We can also create a scatterplot with some of these variables:

pairs(~mpg+cyl+disp+hp+drat+wt,data=mtcars)

cor(mtcars$mpg,mtcars$drat)

## [1] 0.6811719

Short answer: yes, if all the variables are numeric. If not, you’ll have to “SELECT” them into a new dataset.

What do we see here? Any observations?

Notice: CYL and DISP are very strongly correlated in the positive direction (as CYL goes up, DISP goes up)

WT and MPG are strongly (and negatively) correlated (as WT goes up, MPG goes down and vice-versa)

What is the catch?

Correlations are good for being able to discover relationships in the data – but they don’t necessarily “explain” the data, so to speak. “Explaining” data must ultimately require human judgment.

Example: we know there is a relationship between CYL and DISP. As one goes up, the other goes up too. But does CYL cause DISP to go up? Or does DISP cause CYL to go up?

You can’t tell this just from the data – but DISP (engine displacement) gets bigger because CYL (number of cylinders) gets bigger.

The field of knowledge about cause-and-effect is called “CAUSALITY” and it is mostly beyond the scope of this course, except to say the following:

CORRELATION DOES NOT NECESSARILY EQUAL CAUSATION

This is ultimately why – as stated previously – you cant write an academic paper that just says “there is a correlation between X and Y, therefore X causes Y” or something similar. Correlation is more of a first step in examining the data – its a descriptive statistic to help you visualize how variables might be affecting each other.

Determining cause and effect is actually quite difficult. Linear regression can indirectly help with this, however, and we will discuss it in depth over the next few weeks.

cor.test(cars_example$hp,cars_example$wt,method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cars_example$hp and cars_example$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479

##estimating that there is a 95% chance the correlation is between .4025 and .8192 # if correlation includes 0 in confidence interval then the actually correlation # could be zero (or not at all correlated)

We see more information when running a Pearson’s correlation: the correlation is very significant (unlikely due to chance). Also, the confidence interval does not “cross zero”, which means that the result is robust and positive.

Problem: Pearson’s correlation is only valid if the original variables are normally distributed and the relationship between variables is linear.

hist(cars_example$hp,probability = T)
lines(density(cars_example$hp),col="red",lwd=3)

hist(cars_example$wt,probability = T)
lines(density(cars_example$wt),col="red",lwd=3)

Immediately we can tell that HP is un-normally distributed.

We have a few options here. We can either transform the variables (as we discussed last week) or we can use a different correlation method – one that doesn’t involve parametric assumptions like normality.

shapiro.test(cars_example$hp)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$hp
## W = 0.93342, p-value = 0.04881

(HP is probably, mathematically not-normal)

shapiro.test(cars_example$wt)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$wt
## W = 0.94326, p-value = 0.09265

(WT is indistinguishable from normal)

Let’s create a new variable using “MUTATE” which transforms HP via the Log-Transformation:

cars_example<-cars_example %>% mutate(HP_LOG_TRANSFORM=log(hp))

head(cars_example)

##                      hp    wt HP_LOG_TRANSFORM
## Maserati Bora       335 3.570         5.814131
## Ford Pantera L      264 3.170         5.575949
## Duster 360          245 3.570         5.501258
## Camaro Z28          245 3.840         5.501258
## Chrysler Imperial   230 5.345         5.438079
## Lincoln Continental 215 5.424         5.370638

shapiro.test(cars_example$HP_LOG_TRANSFORM)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$HP_LOG_TRANSFORM
## W = 0.97026, p-value = 0.5065

hist(cars_example$HP_LOG_TRANSFORM,probability = T)
lines(density(cars_example$HP_LOG_TRANSFORM),col="red",lwd=3)

Just barely normal. We can then run the cor.test on them:

cor.test(cars_example$HP_LOG_TRANSFORM,cars_example$wt)

## 
##  Pearson's product-moment correlation
## 
## data:  cars_example$HP_LOG_TRANSFORM and cars_example$wt
## t = 5.6149, df = 30, p-value = 4.108e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4892538 0.8518867
## sample estimates:
##       cor 
## 0.7158277

You’ll notice that the correlation has increased somewhat and the confidence interval has decreased – this means that the measurement is now more accurate because HP has been normalized.

OR we can try a non-parametric correlation test, such as the Spearman’s RHO Correlation

cor.test(cars_example$hp,cars_example$wt,method="spearman")

## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  cars_example$hp and cars_example$wt
## S = 1229.4, p-value = 1.954e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7746767

Notice that the SPEARMAN’s RHO result and the LOG TRANSFORMED result are (somewhat) similar.

SPEARMAN’s RHO gives us an error if there are “ties” (same answers) in the data (if one car has the same HP as another, etc.)

Another method is KENDALL’S TAU (which can be used when ranks have ties)

cor.test(cars_example$hp,cars_example$wt,method="kendall")

## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "kendall"): Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  cars_example$hp and cars_example$wt
## z = 4.845, p-value = 1.266e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.6113081

We still get this error but this result is somewhat more accurate. Even then, the correlation is still both significant and positive, so no harm done if we choose the “wrong” statistical method. The answers are still mostly the same.

WHAT DOES ALL THIS MEAN?

Correlation analysis allows us to see different relationships within the data.

This is useful when initially looking at the data, running descriptive statistics and trying to understand your data. Correlation analysis should be part of the “descriptive statistics” that you run at the beginning of every project.

EXCEPTIONS

This is neat, but don’t be fooled: there are some major exceptions

The most obvious exception to correlations is attempting to correlate non-continuous and non-ordinal measurements.

For example, we want to see if there is a correlation between having red hair and liking Strawberry ice cream. We asked 10 students for their favorite ice cream, and we measured their hair color (RED or NOT RED):

flavor1<-c("Chocolate","Chocolate","Vanilla","Strawberry","Strawberry","Vanilla","Chocolate","Chocolate","Strawberry","Chocolate")
flavor_id<-c(1,1,3,2,2,3,1,1,2,1)
student_id<-c(1:10)
red_hair<-c(0,1,0,0,0,1,1,0,0,0)

test_frame<-data.frame(flavor_id,flavor1,student_id,red_hair)

test_frame

##    flavor_id    flavor1 student_id red_hair
## 1          1  Chocolate          1        0
## 2          1  Chocolate          2        1
## 3          3    Vanilla          3        0
## 4          2 Strawberry          4        0
## 5          2 Strawberry          5        0
## 6          3    Vanilla          6        1
## 7          1  Chocolate          7        1
## 8          1  Chocolate          8        0
## 9          2 Strawberry          9        0
## 10         1  Chocolate         10        0

We see that

Red hair = 1 NOT red hair = 0

BUT

Chocolate = 1 Strawberry = 2 Vanilla = 3

So, if we try to find correlations between them:

cor.test(test_frame$flavor_id,test_frame$red_hair,method="kendall")

## Warning in cor.test.default(test_frame$flavor_id, test_frame$red_hair, method =
## "kendall"): Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  test_frame$flavor_id and test_frame$red_hair
## z = -0.24744, p-value = 0.8046
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.07838618

Don’t be fooled – this measurement doesn’t make sense because it doesn’t make sense to “score” ice cream as “1,2,3” for Chocolate, Strawberry and Vanilla. Is vanilla “better” than Strawberry? Is chocolate “worse” than strawberry?

For this kind of analysis, it will only work if we convert “strawberry” to a binary variable (like red hair). Observe:

test_frame<-test_frame %>% mutate(STRAWBERRY_VARIABLE=ifelse(flavor_id==2,1,0))
test_frame

##    flavor_id    flavor1 student_id red_hair STRAWBERRY_VARIABLE
## 1          1  Chocolate          1        0                   0
## 2          1  Chocolate          2        1                   0
## 3          3    Vanilla          3        0                   0
## 4          2 Strawberry          4        0                   1
## 5          2 Strawberry          5        0                   1
## 6          3    Vanilla          6        1                   0
## 7          1  Chocolate          7        1                   0
## 8          1  Chocolate          8        0                   0
## 9          2 Strawberry          9        0                   1
## 10         1  Chocolate         10        0                   0

Now the correlation analysis makes more sense. We are measuring an ordinal version of red hair (0 for no red hair, 1 for red hair) against an ordinal version of strawberry (0 for not strawberry, 1 for strawberry)

cor.test(test_frame$STRAWBERRY_VARIABLE,test_frame$red_hair,method="kendall")

## Warning in cor.test.default(test_frame$STRAWBERRY_VARIABLE,
## test_frame$red_hair, : Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  test_frame$STRAWBERRY_VARIABLE and test_frame$red_hair
## z = -1.2857, p-value = 0.1985
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.4285714

We fixed the analysis design, and it appears that there is actually a negative correlation between liking strawberry ice cream and having red hair. BUT WAIT, look: the p-value is much greater than 0.05. This means that while the correlation is negative, the result is insignificant.

We could report this as “There is no significant relationship between red hair color and preference for strawberry ice cream”. No fancy graph is needed – the sentence alone will work.

SUMMARY OF CORRELATIONS

Pearson’s - continuous data, linear relationship, data is normal (or transformed to normal)
Spearman’s RHO - Non-normal distribution
Kendall’s TAU - data has ties, non-normal, small sample sizes

HOMEWORK

Use your approved data-set
Select some variables of interest and see if there is any obvious correlations using the COR command
Examine the same variables visually using the PAIRS command
Select two variables that seem correlated (positively or negatively) and examine them using PEARSON, SPEARMAN or KENDALL (depending on which is more appropriate)
Explain your findings and justify your choice in selecting the correlation method
publish to Rpubs and submit via CANVAS