CORRELATION

Correlation (as we will be using it, the Pearson Correlation) is a valuable tool to analyze data. Perhaps ironically, it is rarely reported in academic work (typically in favor of different types of regression), however it’s a useful tool when investigating data … as long as you are aware of the dangers. Note– the data has to be normally distributed because the Pearson Correlation includes a p-value, which is derived from standard deviations.)

The idea behind correlation is that as the value of one measurement varies, so does the value of another measurement:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cars_example<-mtcars %>% select(hp,wt) %>% arrange(-hp,wt)
#the minus in front of hp makes hp be listed in descending order; without the minus sign, it's in ascending order) 

cars_example
##                      hp    wt
## Maserati Bora       335 3.570
## Ford Pantera L      264 3.170
## Duster 360          245 3.570
## Camaro Z28          245 3.840
## Chrysler Imperial   230 5.345
## Lincoln Continental 215 5.424
## Cadillac Fleetwood  205 5.250
## Merc 450SL          180 3.730
## Merc 450SLC         180 3.780
## Merc 450SE          180 4.070
## Ferrari Dino        175 2.770
## Hornet Sportabout   175 3.440
## Pontiac Firebird    175 3.845
## AMC Javelin         150 3.435
## Dodge Challenger    150 3.520
## Merc 280            123 3.440
## Merc 280C           123 3.440
## Lotus Europa        113 1.513
## Mazda RX4           110 2.620
## Mazda RX4 Wag       110 2.875
## Hornet 4 Drive      110 3.215
## Volvo 142E          109 2.780
## Valiant             105 3.460
## Toyota Corona        97 2.465
## Merc 230             95 3.150
## Datsun 710           93 2.320
## Porsche 914-2        91 2.140
## Fiat X1-9            66 1.935
## Fiat 128             66 2.200
## Toyota Corolla       65 1.835
## Merc 240D            62 3.190
## Honda Civic          52 1.615

The relationship doesn’t have to be perfect (as in this case) – HP seems to go up and weight goes up. There are counter examples (the Chrysler Imperial, for example) but the relationship generally holds true.

The essence of correlation is “relationships”.

cor(cars_example)
##           hp        wt
## hp 1.0000000 0.6587479
## wt 0.6587479 1.0000000
#this works for this data frame because the variables are numeric

We see that the correlation between “HP” and “HP” is 1 (that is, 100%). Because they are the same. The correlation between HP and WT is 0.66 (you can call this “66%” in class, although the “percent” terminology is not used in the literature).

What does this mean? There is some positive relationship between HP and WT (as one increases, the other tends to increase as well).

Can we run a correlation analysis on the whole dataset?

cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

This is an interesting way to see if there are relationships between variables in a dataframe, but it is just a starting point!

We can also create a scatterplot with some of these variables:

pairs(~mpg+cyl+disp+hp+drat+wt,data=mtcars)

Short answer: yes, if all the variables are numeric. If not, you’ll have to “SELECT” them into a new dataset.

What do we see here? Any observations?

Notice: CYL and DISP are very strongly correlated in the positive direction (as CYL goes up, DISP goes up)

WT and MPG are strongly (and negatively) correlated (as WT goes up, MPG goes down and vice-versa)

What is the catch?

Correlations are good for being able to discover relationships in the data – but they don’t necessarily “explain” the data, so to speak. “Explaining” data must ultimately require human judgment.

Example: we know there is a relationship between CYL and DISP. As one goes up, the other goes up too. But does CYL cause DISP to go up? Or does DISP cause CYL to go up?

You can’t tell this just from the data – but DISP (engine displacement) gets bigger because CYL (number of cylinders) gets bigger.

The field of knowledge about cause-and-effect is called “CAUSALITY” and it is mostly beyond the scope of this course, except to say the following:

CORRELATION DOES NOT NECESSARILY EQUAL CAUSATION

This is ultimately why – as stated previously – you cant write an academic paper that just says “there is a correlation between X and Y, therefore X causes Y” or something similar. Correlation is more of a first step in examining the data – its a descriptive statistic to help you visualize how variables might be affecting each other.

Determining cause and effect is actually quite difficult. Linear regression can indirectly help with this, however, and we will discuss it in depth over the next few weeks.

cor.test(cars_example$hp,cars_example$wt,method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  cars_example$hp and cars_example$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479

Note: Pearson is automatically the method cor tests default to in R. p-value is like the margin of error on a poll Basically: There is a 95% chance that the true correlation is between .402 and .819; the estimate is .658. The null hypothesis in a cor.test is that the correlation is 0.

We see more information when running a Pearson’s correlation: the correlation is very significant (unlikely due to chance). Also, the confidence interval does not “cross zero”, which means that the result is robust and positive.

Problem: Pearson’s correlation is only valid if the original variables are normally distributed and the relationship between variables is linear.

hist(cars_example$hp,probability = T)
lines(density(cars_example$hp),col="red",lwd=3)

hist(cars_example$wt,probability = T)
lines(density(cars_example$wt),col="red",lwd=3)

Immediately we can tell that HP is un-normally distributed. Our Pearson correlation of .66 is incorrect!

We have a few options here. We can either transform the variables (as we discussed last week) or we can use a different correlation method – one that doesn’t involve parametric assumptions like normality.

shapiro.test(cars_example$hp)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$hp
## W = 0.93342, p-value = 0.04881

(HP is probably, mathematically not-normal; p-value is right below .05)

shapiro.test(cars_example$wt)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$wt
## W = 0.94326, p-value = 0.09265

(WT is indistinguishable from normal)

Let’s create a new variable using “MUTATE” which transforms HP via the Log-Transformation:

cars_example<-cars_example %>% mutate(HP_LOG_TRANSFORM=log(hp))
#this is putting a dataframe inside itself-- use carefully!

head(cars_example)
##                      hp    wt HP_LOG_TRANSFORM
## Maserati Bora       335 3.570         5.814131
## Ford Pantera L      264 3.170         5.575949
## Duster 360          245 3.570         5.501258
## Camaro Z28          245 3.840         5.501258
## Chrysler Imperial   230 5.345         5.438079
## Lincoln Continental 215 5.424         5.370638
shapiro.test(cars_example$HP_LOG_TRANSFORM)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_example$HP_LOG_TRANSFORM
## W = 0.97026, p-value = 0.5065
hist(cars_example$HP_LOG_TRANSFORM,probability = T)
lines(density(cars_example$HP_LOG_TRANSFORM),col="red",lwd=3)

More normal. The p-value is way above .05. We can then run the cor.test on them:

cor.test(cars_example$HP_LOG_TRANSFORM,cars_example$wt)
## 
##  Pearson's product-moment correlation
## 
## data:  cars_example$HP_LOG_TRANSFORM and cars_example$wt
## t = 5.6149, df = 30, p-value = 4.108e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4892538 0.8518867
## sample estimates:
##       cor 
## 0.7158277

The p-value is even smaller! Normalizing the data does wonders

You’ll notice that the correlation has increased somewhat and the confidence interval has decreased – this means that the measurement is now more accurate because HP has been normalized.

OR we can try a non-parametric correlation test, such as the Spearman’s RHO Correlation (this doesn’t require standard deviations– doesn’t require normality)

cor.test(cars_example$hp,cars_example$wt,method="spearman")
## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  cars_example$hp and cars_example$wt
## S = 1229.4, p-value = 1.954e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7746767

rho basically = correlation

Notice that the SPEARMAN’s RHO result and the LOG TRANSFORMED result are (somewhat) similar.

SPEARMAN’s RHO gives us an error if there are “ties” (same answers) in the data (if one car has the same HP as another, etc.)

Another method is KENDALL’S TAU (which can be used when ranks have ties) (this is Erik’s favorite– you can use it on data that isn’t normal AND isn’t linear)

cor.test(cars_example$hp,cars_example$wt,method="kendall")
## Warning in cor.test.default(cars_example$hp, cars_example$wt, method =
## "kendall"): Cannot compute exact p-value with ties
## 
##  Kendall's rank correlation tau
## 
## data:  cars_example$hp and cars_example$wt
## z = 4.845, p-value = 1.266e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.6113081

If you use cor.test in your paper, specify which method you use. Erik feels this is more accurate than the methods were previously used because there are fewer assumptions made in Kendall’s Tau.

We still get this error but this result is somewhat more accurate. Even then, the correlation is still both significant and positive, so no harm done if we choose the “wrong” statistical method. The answers are still mostly the same.

WHAT DOES ALL THIS MEAN?

Correlation analysis allows us to see different relationships within the data.

This is useful when initially looking at the data, running descriptive statistics and trying to understand your data. Correlation analysis should be part of the “descriptive statistics” that you run at the beginning of every project.

Don’t do a correlation test on thousands of variables– do correlation on variables you think there may a relationship between.

EXCEPTIONS

This is neat, but don’t be fooled: there are some major exceptions

The most obvious exception to correlations is attempting to correlate non-continuous and non-ordinal(linear) measurements.

For example, we want to see if there is a correlation between having red hair and liking Strawberry ice cream. We asked 10 students for their favorite ice cream, and we measured their hair color (RED or NOT RED):

flavor1<-c("Chocolate","Chocolate","Vanilla","Strawberry","Strawberry","Vanilla","Chocolate","Chocolate","Strawberry","Chocolate")
flavor_id<-c(1,1,3,2,2,3,1,1,2,1)
student_id<-c(1:10)
red_hair<-c(0,1,0,0,0,1,1,0,0,0)

test_frame<-data.frame(flavor_id,flavor1,student_id,red_hair)

test_frame
##    flavor_id    flavor1 student_id red_hair
## 1          1  Chocolate          1        0
## 2          1  Chocolate          2        1
## 3          3    Vanilla          3        0
## 4          2 Strawberry          4        0
## 5          2 Strawberry          5        0
## 6          3    Vanilla          6        1
## 7          1  Chocolate          7        1
## 8          1  Chocolate          8        0
## 9          2 Strawberry          9        0
## 10         1  Chocolate         10        0

We see that

Red hair = 1 NOT red hair = 0

BUT

Chocolate = 1 Strawberry = 2 Vanilla = 3

So, if we try to find correlations between them:

cor.test(test_frame$flavor_id,test_frame$red_hair,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  test_frame$flavor_id and test_frame$red_hair
## t = -0.079057, df = 8, p-value = 0.9389
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6461985  0.6124606
## sample estimates:
##         cor 
## -0.02793994

Don’t be fooled – this measurement doesn’t make sense because it doesn’t make sense to “score” ice cream as “1,2,3” for Chocolate, Strawberry and Vanilla. Is vanilla “better” than Strawberry? Is chocolate “worse” than strawberry? Also, the red hair variable is not continuous.

For this kind of analysis, it will only work if we convert “strawberry” to a binary variable (like red hair). Observe:

test_frame<-test_frame %>% mutate(STRAWBERRY_VARIABLE=ifelse(flavor_id==2,1,0))
test_frame
##    flavor_id    flavor1 student_id red_hair STRAWBERRY_VARIABLE
## 1          1  Chocolate          1        0                   0
## 2          1  Chocolate          2        1                   0
## 3          3    Vanilla          3        0                   0
## 4          2 Strawberry          4        0                   1
## 5          2 Strawberry          5        0                   1
## 6          3    Vanilla          6        1                   0
## 7          1  Chocolate          7        1                   0
## 8          1  Chocolate          8        0                   0
## 9          2 Strawberry          9        0                   1
## 10         1  Chocolate         10        0                   0

Now the correlation analysis makes more sense. We are measuring an ordinal version of red hair (0 for no red hair, 1 for red hair) against an ordinal version of strawberry (0 for not strawberry, 1 for strawberry)

cor.test(test_frame$STRAWBERRY_VARIABLE,test_frame$red_hair,method="kendall")
## Warning in cor.test.default(test_frame$STRAWBERRY_VARIABLE,
## test_frame$red_hair, : Cannot compute exact p-value with ties
## 
##  Kendall's rank correlation tau
## 
## data:  test_frame$STRAWBERRY_VARIABLE and test_frame$red_hair
## z = -1.2857, p-value = 0.1985
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.4285714

[Kendall’s Tau is sometimes called the matching correlation because it calculates how high data matches, or something]. Null hypothesis= there is no relationship between red hair and strawberry ice cream. The p-value is high, so we can’t reject the null hypothesis.

We fixed the analysis design, and it appears that there is actually a negative correlation between liking strawberry ice cream and having red hair. BUT WAIT, look: the p-value is much greater than 0.05. This means that while the correlation is negative, the result is insignificant.

We could report this as “There is no significant relationship between red hair color and preference for strawberry ice cream”. No fancy graph is needed – the sentence alone will work.

SUMMARY OF CORRELATIONS

  1. Pearson’s - continuous data, linear relationship, data is normal (or transformed to normal)
  2. Spearman’s RHO - Non-normal distribution
  3. Kendall’s TAU - data has ties, non-normal, small sample sizes

HOMEWORK

  1. Use your approved data-set
  2. Select some variables of interest and see if there is any obvious correlations using the COR command [more than 1, less than 5] if I wanted to use the ownership type (nonprofit/government/private), I would need to create three separate columns with 0s and 1s for each separate category, like we did with the ice cream example. Question for Erik– can I combine types into categories? such as all the different kinds of nonprofits into one lump?
  3. Examine the same variables visually using the PAIRS command
  4. Select two variables that seem correlated (positively or negatively) and examine them using PEARSON, SPEARMAN or KENDALL (depending on which is more appropriate) [use the summary of correlations above to justify the explaination]
  5. Explain your findings and justify your choice in selecting the correlation method
  6. publish to Rpubs and submit via CANVAS

Week 7 In-Class Excercises

faith<-faithful
hist(faith$waiting)

hist(faith$eruptions)

boxplot(faith$waiting)

plot(faith$eruptions, faith$waiting)

cor(faith$waiting,faith$eruptions)
## [1] 0.9008112
 cor.test(faith$eruptions, faith$waiting, method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  faith$eruptions and faith$waiting
## z = 13.902, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5747674

What a tiny p-value! 16 0s are the smallest number R can calculate We can reject the null hypothesis!

hist(log(faith$eruptions))