The idea of correlation was discussed in the lecture and the textbook. This is the idea that two things that we measure can be somehow related to one another.
Some of the relationships we can measure are meaningful and might reflect a causal relationship where one thing causes a change in another. Some of the relationships are spurious and do not reflect a causal relationship.
In this lab, you will learn how to compute correlations between two variables in software and then ask questions about the correlations you observe.
The goals for this lab are the following:
We use data from the World Happiness Report. They just released the 10th Anniversary Edition of their report for data for 2022.
The 2018 and 2022 data sets from Happiness Report are already available to you in this lab’s folder on Cocalc.com.
R has the cor function for computing correlation
coefficient r (Pearson’s coefficient) between any two
variables. This same function computes other correlation versions, but
we’ll skip those here. To use the function, you need two variables with
numbers like these:
x <- c(5,3,2,5,4,6,4,8,7)
y <- c(6,5,6,7,8,9,8,10,12)
cor(x,y)
## [1] 0.8022222
Create two lists of 5 numbers and ask R to calculate the correlation. Use variables w and z to represent your lists.
Enter your code below:
You can use the runif() function to sample random
numbers from a uniform distribution between a minimum and a maximum
value.
In the example below, sample 15 (n=15) random numbers between the range 8 (min = 8) and 30 (max=30). Every time you run this code, the 15 values in x will be re-sampled, and 15 new random numbers.
x <- runif(n=15, min=8, max=30)
x
## [1] 11.30572 20.02493 19.67985 10.28725 27.07067 27.45102 23.40553 17.73924
## [9] 24.71006 25.60447 17.05964 13.34574 14.60198 14.86911 28.70322
You can compute the correlation between two sets of random numbers by
first sampling random numbers into each variable and then running the
cor() function.
x <- runif(n=15, min=8, max=30)
y <- runif(n=15, min=8, max=30)
cor(x,y)
## [1] 0.1561992
Running the above code will give different values for the correlation each time because the numbers in x and y are always randomly different.
As you learned from the textbook, we can find correlations by chance alone, even when there is no actual correlation between the variables. For example, if we sampled randomly into x and then sampled some numbers randomly into y. We know they aren’t related because we randomly sampled the numbers. However, doing this creates some correlations by chance.
Let’s explore the idea that correlations between two measures can arise by chance alone by generating a sample of 50 correlations and then estimate the mean of these correlations generated by chance. We should expect that the mean of these correlations should be zero because they rise from randomly selected x and y distributions.
One way to estimate the mean of the correlations is to repeat the above code many times. For example, if you run the above code 50 times, you could save the correlations each time, then compute the mean. This would estimate the mean of correlations that can be produced by chance. How can you repeat the above code many times to solve this problem?
We can do this using a for loop. The code below shows
how to repeat everything inside the for loop 50 times. The variable
i is an index that goes from 1 to 50. The
saved_value variable starts as an empty variable, and then
we put a value into it (at index position i, from 1 to 50). In this
code, we put the correlations of x and y into the
saved_value variable. At the end of the simulation, the
save_value variable contains 50 numbers. The
mean() function is used to find the mean value for the 50
simulations.
Doing this will allow you to run the simulation 50 times and find the mean correlation that arises by chance.
saved_value <- c() #make an empty variable
for (i in 1:50){
x <- runif(n=15, min=8, max=30)
y <- runif(n=15, min=8, max=30)
saved_value[i] <- cor(x,y)
}
saved_value
## [1] 0.197354529 0.314166022 -0.237419317 0.193519479 0.110485387
## [6] 0.099368396 -0.224797280 -0.727794282 0.022174253 0.364544551
## [11] 0.176770579 0.063885830 0.467487926 0.261369272 0.128532368
## [16] 0.269863577 0.234850039 0.167752832 -0.281779922 -0.242647863
## [21] 0.058138588 0.249817708 0.370316476 0.088809782 -0.324128431
## [26] -0.120780235 0.096036104 -0.348387605 -0.356161642 -0.166270355
## [31] -0.127472605 0.376108854 0.435273519 -0.122272093 0.432920882
## [36] -0.004847941 -0.003145139 -0.528588245 0.252776704 0.261889997
## [41] -0.015062336 0.223625040 0.099466188 0.056224865 0.273408441
## [46] -0.154305963 -0.481217924 0.378886779 0.357404275 0.195712990
mean(saved_value)
## [1] 0.05623726
You can see that the mean is not always zero. Sometimes it looks like there is a correlation when we know there shouldn’t be. You can keep re-doing these simulations by re-building (re-knitting) your R Markdown document. You simulate the outcomes as often as you press the “Build” button.
Create two lists of 100 numbers and use R to calculate the standard deviation of the correlations.
Enter your code below:
Let’s take a look at some correlations in real data. We are going to look at responses to a questionnaire about happiness that was sent around the world from the world happiness report
We load the data into a data frame. Reminder, the following assumes that you have downloaded the data.
library(data.table)
whr18_data <- fread('WHR2018.csv')
If the Internet is enabled in your project, you can also load the data using the following URL
library(data.table)
whr_data <- fread("https://raw.githubusercontent.com/CrumpLab/statisticsLab/master/data/WHR2018.csv")
dim(whr18_data)
## [1] 1562 19
colnames(whr18_data)
## [1] "country"
## [2] "year"
## [3] "Life Ladder"
## [4] "Log GDP per capita"
## [5] "Social support"
## [6] "Healthy life expectancy at birth"
## [7] "Freedom to make life choices"
## [8] "Generosity"
## [9] "Perceptions of corruption"
## [10] "Positive affect"
## [11] "Negative affect"
## [12] "Confidence in national government"
## [13] "Democratic Quality"
## [14] "Delivery Quality"
## [15] "Standard deviation of ladder by country-year"
## [16] "Standard deviation/Mean of ladder by country-year"
## [17] "GINI index (World Bank estimate)"
## [18] "GINI index (World Bank estimate), average 2000-15"
## [19] "gini of household income reported in Gallup, by wp5-year"
head(whr18_data)
## country year Life Ladder Log GDP per capita Social support
## 1: Afghanistan 2008 3.723590 7.168690 0.4506623
## 2: Afghanistan 2009 4.401778 7.333790 0.5523084
## 3: Afghanistan 2010 4.758381 7.386629 0.5390752
## 4: Afghanistan 2011 3.831719 7.415019 0.5211036
## 5: Afghanistan 2012 3.782938 7.517126 0.5206367
## 6: Afghanistan 2013 3.572100 7.503376 0.4835519
## Healthy life expectancy at birth Freedom to make life choices Generosity
## 1: 49.20966 0.7181143 0.18181947
## 2: 49.62443 0.6788964 0.20361446
## 3: 50.00896 0.6001272 0.13763019
## 4: 50.36730 0.4959014 0.17532922
## 5: 50.70926 0.5309350 0.24715924
## 6: 51.04298 0.5779554 0.07473493
## Perceptions of corruption Positive affect Negative affect
## 1: 0.8816863 0.5176372 0.2581955
## 2: 0.8500354 0.5839256 0.2370924
## 3: 0.7067661 0.6182654 0.2753238
## 4: 0.7311085 0.6113873 0.2671747
## 5: 0.7756198 0.7103847 0.2679191
## 6: 0.8232041 0.6205848 0.2733281
## Confidence in national government Democratic Quality Delivery Quality
## 1: 0.6120721 -1.929690 -1.655084
## 2: 0.6115452 -2.044093 -1.635025
## 3: 0.2993574 -1.991810 -1.617176
## 4: 0.3073857 -1.919018 -1.616221
## 5: 0.4354402 -1.842996 -1.404078
## 6: 0.4828473 -1.879709 -1.403036
## Standard deviation of ladder by country-year
## 1: 1.774662
## 2: 1.722688
## 3: 1.878622
## 4: 1.785360
## 5: 1.798283
## 6: 1.223690
## Standard deviation/Mean of ladder by country-year
## 1: 0.4765997
## 2: 0.3913617
## 3: 0.3948027
## 4: 0.4659422
## 5: 0.4753669
## 6: 0.3425687
## GINI index (World Bank estimate)
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
## GINI index (World Bank estimate), average 2000-15
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
## gini of household income reported in Gallup, by wp5-year
## 1: NA
## 2: 0.4419058
## 3: 0.3273182
## 4: 0.3367642
## 5: 0.3445396
## 6: 0.3043685
You should see that there is data for many different countries across a few different years. There are many different kinds of measures, each given a name. I’ll show you some examples of asking about correlations with this data; then, you get to ask and answer your questions.
For the year 2017 only, does a country’s measure for “freedom to make life choices” correlate with the country’s measure for “Confidence in national government”?
Let’s find out. We calculate the correlation:
x <- whr18_data$`Freedom to make life choices`
y <- whr18_data$`Confidence in national government`
cor(x,y)
## [1] NA
What happened here? The correlation was NA (meaning
undefined).
Let’s look at the first 100 values of the variable
x.
head(x,100)
## [1] 0.7181143 0.6788964 0.6001272 0.4959014 0.5309350 0.5779554 0.5085140
## [8] 0.3889276 0.5225662 0.4270109 0.5286047 0.5252233 0.5689584 0.4874963
## [15] 0.6015121 0.6318303 0.7346484 0.7038507 0.7298189 0.7496110 0.5926958
## [22] 0.5295613 0.5866635 NA NA 0.4366705 0.5837018 0.4560286
## [29] 0.4095549 0.3745416 0.7330037 0.6528326 0.6782222 0.6366464 0.7302582
## [36] 0.8158020 0.7474984 0.7372504 0.7450578 0.8812237 0.8477022 0.8319662
## [43] 0.5201978 0.6054108 0.4621566 0.4414127 0.4592567 0.4645247 0.5018638
## [50] 0.5040818 0.5064871 0.5510266 0.6109869 0.6136971 0.9349733 0.8906820
## [57] 0.9157333 0.9320586 0.9445865 0.9351463 0.9333792 0.9229323 0.9218710
## [64] 0.9223157 0.9105502 0.9413823 0.8790693 0.8959798 0.9393556 0.9197039
## [71] 0.9217345 0.8850269 0.9003052 0.8885140 0.8900306 0.7715282 0.5220463
## [78] 0.6010433 0.4981384 0.5010711 0.5374843 0.5988591 0.6719570 0.7327729
## [85] 0.7642895 0.7125732 0.7310305 0.8959314 0.8620029 0.8698704 0.6818229
## [92] 0.8092059 NA 0.8495212 0.8886911 0.9058585 0.6116642 0.6045383
## [99] 0.6060122 0.6309311
The correlation was NA because there are some missing
data points. We can remove all the rows with missing data first, then do
the correlation.
We will do this in a couple of steps. First, create our own data
frame with only the numbers we want to analyze. We can select the
columns we want to keep using select. Then we use
filter to remove the rows with NAs.
library(dplyr)
smaller_df <- whr18_data %>%
select(country, year,
`Freedom to make life choices`,
`Confidence in national government`) %>%
filter(!is.na(`Freedom to make life choices`),
!is.na(`Confidence in national government`))
smaller_df <- smaller_df %>% filter(year=='2017')
sx <- smaller_df$`Freedom to make life choices`
sy <- smaller_df$`Confidence in national government`
r <- cor(sx,sy)
r
## [1] 0.4419082
Now we see the correlation is 0.4419082.
This shows that as Freedom to make life choices
increases in a country, that country’s confidence in its national
government also increases. This is a positive correlation.
Find the correlation for the newest data report (2022) from the World Happiness Report. Choose the variables of your interest.
Enter your code below:
The material used in this document contains excerpts and modifications from: