Correlation

Title: Main characteristics of correlations

Synopsis: This document is aimed at helping to remember the main characteristics of correlations

Where does de correlations coefficient comes from?

Sir Francis Galton

Galton produced over 340 papers and books. He also created the statistical concept of correlation and widely promoted regression toward the mean. He was the first to apply statistical methods to the study of human differences and inheritance of intelligence, and introduced the use of questionnaires and surveys for collecting data on human communities, which he needed for genealogical and biographical works and for his anthropometric studies.

The first study of Galton was about comparision of measurements.

They wanted to know if there was a relationship between physical measurements of related adults. Galton was looking at the relationship between parent and adult child in terms of physical measurements (height). Because the two variables were heights and therefore in the same scale Golem could compare them directly. He came up with a statistic he called ‘r’.

The problem was that he couldn´t compare variables in differen scales. So, some years later he had the opportunity to work on variables that were actually in differente scales. This time the relationship between physical characteristicas within the same person. Specifically arm length and height. He realised that if he normalized both of the original variables (make them into z.scores) then they would end up in the same scale and therefore could be compared just as he had done before. So, two continuous variables, regardless of their scale could actually be compared to one another.

Pearson was Galton student and was given the task to formalise the method to calculate ‘r’ and since he was the one who formalised it he got all the credit.

(To see how this is calculated visit: Foundations of Data Analysis - Part 1 - Week 3: Bivariate Distributions > Lecture Videos > The Origin of “r”. )

Pearson states that ‘r’ is equal to the sumation of the product of the z scores of the two variables divided by the total number of observations.

r = sum(x * y) / n-1

So that´s where the correlation or Pearson Correlation Coefficient comes from.

A correlação não tem unidade (adimencional). Correlation is not affected by the scale of the variable. A relationship is a relationshiop regardless of that scale (remenber the z.score).

Correlation is not affected by the scale of the variable because to calculate the correlation between two variables, firts you must convert their values to z-scores eliminating therefore the scale effect.

Correlation does not implies causation.

Correlation equals a relation.

Cor(X, Y) = Cor(Y, X)

Ranges from -1 a +1, is represented by ‘r’:

Correlação negativa - Rxy < 0 (negative correlation - derease in the dependente variable (y) as the independent variable (x) increase ) Correlação positiva - Rxy > 0 (positive correlation - Increase in the dependente variable (y) as the independent variable (x) increase ) Correlação positiva - Rxy = 1 (correlação perfeita) Correlação Rxy = 0 - Não existe correlação linear
Correlação não linear - o ‘r’ não mede esse tipo de correlação
Cor(X, Y) =1 and Cor(X,Y) = -1 only when the X or Y observations fall perfectly on a positive or negative sloped line, respectively

Cor(X,Y) measures the strength of the linear relationship between the X and Y data, with stronger relationsships as Cor(X,Y) heads towards -1 or 1.

So you should only use the Pearson Correlation Coefficient if there is a linear correlation to our data. The first evaluation is by plotting a scatterplot. Not only to see if there is a relationship visually but also so see if that relationship is linear.

The question is: ** What happens to ‘r’ when that relationship is not linear?**

Or in other words:

How does non-linearity effect the pearson Correlation Coefficient?

It happens when up to a certain point the relationship is ascending and after that point discending. No relationship there. A good example of it is the Yerks-Dodson Curve. The inverted U shape to a relationship.

It is important to consider this because if we find a very low correlation (.43) we could consider that there is no correlation, but in the case of a inverted U shape, there is a correlation, but it is not captured by the Pearson coefficiente, because of its non-linearity. Therefore the need of a graph that would help us to see this.

The danger here is that up to a certain point a inverted U shape relation ship is acctually a linear relationship, think about the half of the curve. If a study fails to consider the hall spectrum of data it will think it has a linear relationship.

So, draw you picture. Visualize the relationship. And decide: Is the Pearson Relationship coefficiente a good choice?

(for more detaisl see - Foundations of Data Analysis - Part 1 - Week 3: Bivariate Distributions > Lecture Videos > Linearity and the Correlation Coefficient)

O comportamento conjunto de duas variáveis quantitativas pode ser observado pelo diagrama de dispersão e medido por meio do Coeficiente de Correlação de Pearson (r).

Pode ser utilizado também se uma das variáveis é contínua e outro dicotômica (homem/mulher-0/1).

Scatterplots and the r value

The correlation coefficiente then emerged as a standard metric of a linear relationship between two numeric variables.

The question is how can we set up a graph and how we tie in what we are seeing with our actual correlation value.

Keep in mind the correlation does not imply causation.

Correlaton is nothing more that a numerical value for the relationship between two numerical variables.

The point here is to define which variable goes in which axis.

The idea is identify which one varaiable drives the other variable. Typically the variable of interest (dependent) goes in the y-axis and the driving variable (independent variable) goes in the x-axis.

A practical example using the original Galton dataset

# To load the Galton dataset you can
  Galton = read.csv("galton.csv")
# or
  
  #library(UsingR)
  #data(Galton)


str(Galton)

## 'data.frame':    928 obs. of  3 variables:
##  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
##  $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...

plot(Galton$child ~ Galton$parent, data=Galton,bty="l",pch=20)
abline(lm(Galton$parent ~ Galton$child, data=Galton),lty=1,lwd=2,col="red")

In this case the correlation is:

  # See ?cor for more details
  cor(Galton$parent,Galton$child)

## [1] 0.4587624

  cor(Galton$child,Galton$parent)

## [1] 0.4587624

Correlation matrix

You can create a graphical correlations matrix.

    #install.packages("corrplot")
    library(corrplot)
    M <- cor(mtcars)
    corrplot.mixed(M)

    #Use ?corrplot for more details

Visit here for a wide variety of setting for the corrplot() function.

Or a table of correlations

    M <- cor(mtcars[,1:3])
    M

##             mpg        cyl       disp
## mpg   1.0000000 -0.8521620 -0.8475514
## cyl  -0.8521620  1.0000000  0.9020329
## disp -0.8475514  0.9020329  1.0000000

    #View(M)

Matrix of scatterplots

You can also create a correlation matrix:

    M <- cor(mtcars)
    pairs(M)

Teste de Hipótese para Correlação Linear

H0: p = 0 - Significa que a correlação é igual a 0
H1: P <> 0 - Significa que a correlação é diferente de 0

Para ver isso, faz-se o teste T.

Analisa-se o p-valor do teste e, se o valor for maior que p > 0,05, aceita H0.
Se menor, rejeita-se H0. Ou seja, aceita H1.

Observação:

Correlação não é o mesmo que causa e efeito. Duas variáveis podem estar altamente correlacionadas e, no entanto, não haver entre elas, relação de causa e efeito.

How does non-linearity effect the Pearson Correlation coefficient?

It can not be used when we do not have a linear relationship because it will provide a erroneous r-value. To avoid erroneous usage of Pearson correlation firts visualize the relationship, Ask yourself “Is the Pearson correlaton a good choice?” and use it only if the Pearson Correlation value is appropriate for your data.

A little more about the Pearson Correlation Coefficient

Só faz-se a correção de Pearson se as variáveis forem lineares.

Antes de aplicar cor() para algum conjunto de dados crie um scatterplot (plot()) e visualmente cheque se os dados parecem correlacionados mesmo que superficialmente. Se visualmente já se puder dizer que a correlação não é linear, o teste de correlação não se aplica.

For all columns

    ```r
    cor(longley, method = "pearson")
    ```
    
    ```
    ##              GNP.deflator       GNP Unemployed Armed.Forces Population
    ## GNP.deflator    1.0000000 0.9915892  0.6206334    0.4647442  0.9791634
    ## GNP             0.9915892 1.0000000  0.6042609    0.4464368  0.9910901
    ## Unemployed      0.6206334 0.6042609  1.0000000   -0.1774206  0.6865515
    ## Armed.Forces    0.4647442 0.4464368 -0.1774206    1.0000000  0.3644163
    ## Population      0.9791634 0.9910901  0.6865515    0.3644163  1.0000000
    ## Year            0.9911492 0.9952735  0.6682566    0.4172451  0.9939528
    ## Employed        0.9708985 0.9835516  0.5024981    0.4573074  0.9603906
    ##                   Year  Employed
    ## GNP.deflator 0.9911492 0.9708985
    ## GNP          0.9952735 0.9835516
    ## Unemployed   0.6682566 0.5024981
    ## Armed.Forces 0.4172451 0.4573074
    ## Population   0.9939528 0.9603906
    ## Year         1.0000000 0.9713295
    ## Employed     0.9713295 1.0000000
    ```

For two specific columns

    ```r
    with(longley, cor(Employed,GNP,method="spearman"))
    ```
    
    ```
    ## [1] 0.9852941
    ```

Creating a correlation matrix

  #str(longley)
  columnNames = c("Unemployed","Population","Employed")
  cor(longley[,columnNames])

##            Unemployed Population  Employed
## Unemployed  1.0000000  0.6865515 0.5024981
## Population  0.6865515  1.0000000 0.9603906
## Employed    0.5024981  0.9603906 1.0000000

Spearman Correlation coefficient

Se os dados são em nível ordinal ou não apresentam as pressuposições para realizar a correlação de Pearson, utiliza-se o Coeficiente de Correlação de Spearman (rho).

For all columns

    ```r
    cor(longley, method = "spearman")
    ```
    
    ```
    ##              GNP.deflator       GNP Unemployed Armed.Forces Population
    ## GNP.deflator    1.0000000 0.9970588  0.6647059    0.2205882  0.9970588
    ## GNP             0.9970588 1.0000000  0.6382353    0.2235294  0.9941176
    ## Unemployed      0.6647059 0.6382353  1.0000000   -0.3411765  0.6852941
    ## Armed.Forces    0.2205882 0.2235294 -0.3411765    1.0000000  0.2264706
    ## Population      0.9970588 0.9941176  0.6852941    0.2264706  1.0000000
    ## Year            0.9970588 0.9941176  0.6852941    0.2264706  1.0000000
    ## Employed        0.9823529 0.9852941  0.5647059    0.2264706  0.9764706
    ##                   Year  Employed
    ## GNP.deflator 0.9970588 0.9823529
    ## GNP          0.9941176 0.9852941
    ## Unemployed   0.6852941 0.5647059
    ## Armed.Forces 0.2264706 0.2264706
    ## Population   1.0000000 0.9764706
    ## Year         1.0000000 0.9764706
    ## Employed     0.9764706 1.0000000
    ```

For two specific columns

    ```r
    with(longley, cor(Employed,GNP,method="pearson"))
    ```
    
    ```
    ## [1] 0.9835516
    ```