November 2023

What is Correlation?

Correlation is a statistical measure which expresses the extent to which two variables are linearly related.

Correlation test are useful for describing simple relationships among data, however it is important to note correlation does not tell us about cause and effect.

Correlation is described with the correlation coefficient which ranges from -1 to 1.

Once the correlation coefficient is obtained, and deemed significant we can also look at its strength with the p-value.

Calculating the Correlation Coefficient

The formula for the correlation coefficient is as follows:

r = \(\frac{n\sum xy)-(\sum x)(\sum y)}{\sqrt{[n\sum x^2-(\sum x)^2] [n\sum y^2-(\sum y)^2]}}\)

however, the correlation coefficient can easily be calculated using many software or statistical programs.

Using RStudio to Find the Correlation Coefficient

To demonstrate using RStudio to calculate the correlation coefficient we will examine two data sets, the first is a data set of Adidas Sales in the US and the second is smaller data set of cars miles per gallon and the cars weight.

Adidas Data Set

Adidas Data Set

Adidas Data Set: Scatter Plots and Correlation

Before calculating the correlation for a pair of variables, we should check whether the association between the variables is approximately linear.

We can do this using a scatter plot.

Examining this graph we would determine there does not appear to be a linear relationship among the variables we have chosen and may want to examine other variables instead.

Adidas Data Set: Testing Other Variables

Once again, there appears to be no linear relationship among the chosen variables. Without a linear relationship we cannot conduct a proper test of correlation.

MPG vs Car Weight Scatter Plot

Next will will examine the data set containing data on the miles per gallon a car gets and that cars weight for a linear relationship.

From the graph we can see an approximately linear relationship does exist, thus we can confidently compute the correlation coefficient.

Computing the Correlation Coefficient

Now that we have identified a linear relationship we can compute the correlation coefficient.

As mentioned earlier, the equation is:

r = \(\frac{n\sum xy)-(\sum x)(\sum y)}{\sqrt{[n\sum x^2-(\sum x)^2] [n\sum y^2-(\sum y)^2]}}\)

Where, n = 301

While a tedious process if done manually, software can be used to easily calculate r.

Our computed r is:

-0.8783

Interpreting the Correlation Coefficient

Our correlation coefficient r = -0.88, indicates a strong negative relationship between our variables.

This means that as car weight decreases the cars miles per gallon increases.

Given our findings, indicating a strong relationship between our variables, we may want to conduct further testing.

Further Testing

Now that we know there is a strong correlation, we can use further testing to gain a greater understanding of our data. We can do this using the following codes.

cor.test(mpg_vs_weight$mpg, mpg_vs_weight$weight)
## 
##  Pearson's product-moment correlation
## 
## data:  mpg_vs_weight$mpg and mpg_vs_weight$weight
## t = -31.709, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9018288 -0.8495329
## sample estimates:
##        cor 
## -0.8782815

Further Testing Interpretation

As seen on the last slide, the results of our correlation test were as follows:

t= -31.709

df = 298

p < 2.2e-16 at \(\alpha\) = .05

Given our p-value at \(P(\alpha < .05)\) we can conclude our findings are significant.

My Code For Plotly Plot

Code for importing the libraries:

mpg_vs_weight = read.csv("Auto.csv")
library(plotly)
library(dplyr)

Code for the plot:

plot_ly(data = mpg_vs_weight, x= mpg_vs_weight$mpg, 
        y = mpg_vs_weight$weight, 
        type = "scatter", mode = "markers", 
        marker = list(size = 5, color = "red")) %>%
  layout(title = "Miles per Gallon vs Weight", 
         xaxis = list(title = "Miles per Gallon"), 
         yaxis = list(title = "Weight"))