Today we learn to work with correlations. So far, we’ve focused on accessing data in a dataset, describing variables, and calculating summary statistics (Mean, Median, SD, etc.). We’ve also learned to create crosstabs using the table command. These steps are critical to any data analysis project. But once we have grasp of the type of data we have and the distributions of different variables, we want to know how different variables relate to each other. Specifically, we want to ask questions and find their answers in the data.
The question of whether and how two or more variables are related could take multiple forms, but it’s all based on the same general idea: are changes in one variable related to changes in another variable? For continuous variables, the concept of change could be two or more going up and/or down at the same time. For mixed categorical-continuous variables, it could be whether a continuous variable has higher or lower means for different levels of a categorical variable. In either case we will follow a similar process: (1) find the amount of change (2) standardize it and convert it to a relevant test statistic (3) calculate the probability of finding a test statistic, given our sample size
Now there is a lot of variation on these three steps depending on the type of data or the goal of the analysis. But the basic premise is the same. And, in nearly all cases, we interpret the results of our analysis similarly: given certain assumptions, and assuming there is no relationship between our variables, what is the probability of finding a test statistics of a certain size (or more larger) in sample of a certain size.
Note: What we make of the relationship between variables depends on the nature of our data. Our interpretation depend on how the data was collected, it’s purpose, etc.
Today we’ll learn these concept and apply them to correlations. We’ll learn how to calculate correlations in R, what to make of the results, and how to report them.
Correlation - basics
By now we know two things that are important when describing a distribution of data: central tendency (where most numbers fall, for example the mean), and variability (how much do numbers vary around the measure of central tendency, for example, standard deviation). To calculate the variance, we use the process below:
Let’s look at an example:
sample.data<-c(1,2,2,4,4,5,4,9,6,4,3,1,6,8,3)
sample.data.mean<-mean(sample.data)
sample.data.deviation<-sample.data-sample.data.mean
sample.data.deviation.squared<-(sample.data.deviation)^2
sum(sample.data.deviation.squared)/(length(sample.data)-1)
[1] 5.552381
All of the above process can be summarized into one line:
var(sample.data)
[1] 5.552381
We also learned that taking the square root of variance (sqrt(variance)) gives us the standard deviation (sd()) which is in the same units as our data. Now we learn about a very similar concept: covariance. As the name implies, it’s something related to variance but there is the prefix co, which tells us that the it’s about joint variance. Instead of telling us to what extent a variable deviates from it’s mean, covariance tells us to what extent deviations of two variable from their means are related to each other. Let’s have a look at how it is calculated:
\[cov_{x,y} = \frac{\sum\limits_{i=1}^{n}{(x_i-\overline{x}) \cdot (y_i-\overline{y})} }{n-1}\]
As you can see, most of the steps are similar to calculating variance. Except, we instead of squaring deviations, we multiple the deviations for the two variable with one another. This is important: two variable should to have the same number of data points (same length) so we can calculate their covariance. If one variable is missing, then the deviation of the other variable cannot be multiplied by anything so it’ll have to be omitted. Let’s lookat an exmaple below:
x <- c(1,2,2,4,4,5,4,9,6,4,3,1,6,8,3)
y <- c(1,3,6,7,4,2,5,6,7,4,2,2,4,5,6)
x.mean <- mean(x)
y.mean <- mean(y)
x.deivation <- x - x.mean
y.deivation <- y - y.mean
sum(x.deivation * y.deivation) / (length(x)-1) # or it could be length(y). It doesn't make a difference because they have the same length
[1] 2.17619
This can be summarized into one line:
cov(x,y)
[1] 2.17619
But here’s a question: what does the number 2.17 mean? If you recall, the problem with variance was that it came in squared units and it was hard to interpret. With covariance, we have a similar problem. Although higher values mean the variables vary together more, it is hard to compare with the covariance between other variables. To comapre, we need common scale. Just as standard deviation gave us some common measure to compares things, we can standardize covariance to get a number that is more comparable across different sets of variables. How do we do this? We divide teh covariance by the standard deviations of the two variables:
\[ r = \frac{cov_{x,y}}{SD(X) SD(Y)}\] \(r\), or the Pearson correlation coefficient, is a standardized covariance. It has the following properties:
When we standardize the covariance, we get:
cov(x,y)/(sd(x)*sd(y))
[1] 0.4749596
But the this can also be written as:
cor(x,y)
[1] 0.4749596
So the two variable x and y are correlated at 0.47. But what does this mean? Looking at the plot below might give us a sense:
plot(x,y)
abline(a = 1, cor(x,y)) # this command draws a line with the intercept a and slope of b. Here, the slope is the correlation coefficient.
abline(a = 2, cor(x,y))
abline(a = 3, cor(x,y))
There seems to be a positive linear trend between x and y. They increase together (whne an x deviates from its mean, y also deviates from its mean). We know the slope of this line is the correlation line.
Pearson \(r\) is an estimate. So it has some error. Recall that for for sample means, we’d use \(SE = 1/\sqrt{n}\) to get a sense of how good our estimate is of the true population mean. For Pearson \(r\) too, we can calculate a standard error as seen below (n is the number of observations):
\[SE_r =\sqrt{\frac{1-r^2}{n-2}}\] Note: You do not need to memorize this. But it is important to see how different statistical methods follow a similar logic.
Recall that the mean and it’s standard error, allowed us to find the probability of a data point. Using z-scores, we would look on the normal distribution and find the percentiles. We could also calculate the 95% confidence intervals for the mean (95% CI [Mean - 1.96 * SE, Mean + 1.96 * SE]).
With \(r\) and it’s standard error read, we can do the same with Pearson’s r. The only difference is that we use a different distribution call t-distribution. This is pretty much like the normal distribution but it become flatter at lower number of data points. this is also not that important).So we calculate what is called a \(t-statistics\). The t statistics has its own distribution and probabilities so we can calculate the likelihood of finding a correlation of size r in a sample of size n.
\[t=\frac{r}{SE_r} = \frac{r}{\sqrt{(1-r^2)/(N-2)}}\]
Doing this in r, we use the cor.test command:
cor.test(x,y)
Pearson's product-moment correlation
data: x and y
t = 1.946, df = 13, p-value = 0.0736
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.04929754 0.79403133
sample estimates:
cor
0.4749596
As you can see, the correlation is the same as before. However, we get a other data which we can use in our reporting: \(r\) = 0.4749596, 95%CI[-0.04929754,0.79403133], \(t\)(13) = 1.946, P < 0.0736. Assuming x and y are unrelated, the likelihood of finding a t-value of 1.946 or larger in a sample of 13 people is 0.0736 or lower.
Let’s load a new dataset and look at some correlations. Load the data set telecom_churn.csv and answer the following questions:
Each row represents a customer and each column contains attributes related to customer as described in the column description. The variables are:
Questions: