Correlation

options(scipen=999)
library(ggplot2)
library(dplyr)
library(reshape2)
library(DMwR)
library(knitr)
library(kableExtra)

BaseballDf <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
BaseballDf$TEAM_BATTING_1B = BaseballDf$TEAM_BATTING_H - BaseballDf$TEAM_BATTING_2B - BaseballDf$TEAM_BATTING_3B - BaseballDf$TEAM_BATTING_HR

BaseballDf_kNN <- BaseballDf %>% 
  select(TARGET_WINS, TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_PITCHING_H,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP,TEAM_BATTING_1B)

BaseballDf_kNN <- knnImputation(BaseballDf_kNN, 15, meth='weighAvg')

for(i in 1:ncol(BaseballDf_kNN)){
  BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i] <- as.numeric(as.character(BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i]))
}

df <- BaseballDf_kNN %>% 
  select(TEAM_PITCHING_H, TARGET_WINS)

cor.df <- data.frame(Name = "Pearson's", "cor" = round(cor(df, method = "pearson", use = "pairwise.complete.obs")[2],2))
cor.df <- rbind(cor.df, data.frame(Name = "Spearman's", "cor" = round(cor(df, method = "spearman", use = "pairwise.complete.obs")[2],2)))
cor.df <- rbind(cor.df, data.frame(Name = "Kendall's Tau", "cor" = round(cor(df, method = "kendall", use = "pairwise.complete.obs")[2],2)))

While working on Data621 project-1 using moneyball dataset, I noticed correlation coefficient computed using Pearson, Spearman and Kendall's Tau methods is not same. Especially correlation coefficient between variables TEAM_PITCHING_H and TARGET_WINS is different.

cor.df %>% 
  kable("html",caption = "Correlation Coefficient", aligh="c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

Correlation Coefficient
Name	cor
Pearson’s	-0.11
Spearman’s	0.21
Kendall’s Tau	0.15

Though correlation between variables is not strong, it does vary from method to method. Pearson's method show negative correlation whereas Spearman's and Kendall's method show positive correlation between variables. This observation geared me towards writing this blog.

In this blog post, I will discuss

Data Normality Tests
Correlation Tests
Comparision between Pearson, Spearman and Kendall's Tau correlation coefficients.

Data Normality Tests

Step one, inspect the data distribution using scatter-plot. The pattern we are trying to identify does TARGET_WINS value increase or decrease with a change in TEAM_PITCHING_H variable.

df <- BaseballDf_kNN %>% 
  select(TEAM_PITCHING_H, TARGET_WINS)

df.lm <-lm(TARGET_WINS~TEAM_PITCHING_H, data=df)


if (coef(df.lm)[2] < 0) {
  z <- list(xx = format(coef(df.lm)[1], digits = 4),
          yy = format(abs(coef(df.lm)[2]), digits = 4),
          r2 = format(summary(df.lm)$r.squared, digits = 3));
  eq <- substitute(italic(hat(y)) == xx - yy %.% italic(x)*","~~italic(r)^2~"="~r2,z)  
} else {
    z <- list(xx = format(coef(df.lm)[1], digits = 4),
        yy = format(abs(coef(df.lm)[2]), digits = 4),
        r2 = format(summary(df.lm)$r.squared, digits = 3));
    
  eq <- substitute(italic(hat(y)) == xx + yy %.% italic(x)*","~~italic(R)^2~"="~r2,z) 
}


df.lmeq <- as.character(as.expression(eq))

ggplot(df, aes(x=TEAM_PITCHING_H,y=TARGET_WINS)) + 
  geom_point(shape=1, size=1, color="black", alpha=1/2) +
    geom_smooth(method=lm, se=T, color="red", size=0.5) +
  labs(title = sprintf("Scatter Plot"), subtitle = "TEAM_PITCHING_H Vs. TARGET_WINS") + xlab("TEAM_PITCHING_H") + ylab("TARGET_WINS") +
  annotate("text", x = 20000, y = 75, label = df.lmeq, colour="red", size = 3.5, parse=T) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white"))

Looking at above plot, one can notice

Data is not spread evenly along x-axis. Shaded area suggests relation is not linear.
The plot does not suggest any monotonic correlation either, as data is not spread evenly along x-axis.
It suggests the existence of outliers and outliers have a great impact on the linear model.
\(R^2\) value is 0.01, suggesting the linear model explains only 1% of TARGET_WINS.

However, \(\beta_1\) value is -0.001231, indicates for every hit allowed during pitching, chances of winning the game are decreased by 0.001231.

Step two,

Let’s test if two variables are normally distributed. I will be using Shapiro-Wilk test. The null hypothesis of this test is that the population is normally distributed.

\(H_0:\) Data is normally distributed.

\(H_A:\) Data is not normally distributed.

#Test TEAM_PITCHING_H
shapiro.test(df$TEAM_PITCHING_H)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$TEAM_PITCHING_H
## W = 0.24611, p-value < 0.00000000000000022

#Test TARGET_WINS
shapiro.test(df$TARGET_WINS)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$TARGET_WINS
## W = 0.98825, p-value = 0.000000000001007

For both variables p-value is less than significance level 0.05, indicating data is not normally distributed.

#Test TEAM_PITCHING_H

ggplot(data=as.data.frame(qqnorm( df$TEAM_PITCHING_H , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TEAM_PITCHING_H")

#Test TARGET_WINS
ggplot(data=as.data.frame(qqnorm( df$TARGET_WINS , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TARGET_WINS")

QQ-Plot for both variables show data is not normally distributed.

Correlation Tests

While examining a relationship between two quantitative variables, we are looking for

The strength of the relationship(\(\rho\)) - correlation coefficient. Value of the coefficient ranges from -1 to 1.
Direction of the relationship - positive or negative. -1 indicates very strong inverse relation and 1 indicates a very strong proportional relation between variables.

Null hypothesis for correlation test is there is no relationship between the variables. Alternative is there is some relationship between variables.

\(H_0: \rho = 0\) Hits allowed during pitching has no impact on output of a baseball game.

\(H_A: \rho \ne 0\) Hits allowed during pitching does impact the output of a baseball game.

Pearson’s correlation test

#Pearson correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "pearson")
cortest

## 
##  Pearson's product-moment correlation
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## t = -5.2745, df = 2274, p-value = 0.0000001457
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.15034489 -0.06916252
## sample estimates:
##        cor 
## -0.1099371

p-value of Pearson's correlation test is 0.0000001457, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Spearman’s correlation test

#Spearman correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "spearman")
cortest

## 
##  Spearman's rank correlation rho
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## S = 1545200000, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2136212

p-value of Spearman's correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Kendall’s Tau correlation test

#Kendall's Tau correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "kendall")
cortest

## 
##  Kendall's rank correlation tau
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## z = 10.719, p-value < 0.00000000000000022
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.1514105

p-value of Kendall's correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Comparision between `Pearson`, `Spearman` and `Kendall's Tau` correlation coefficients.

Pearson's correlation coefficient is best when data is linear. Spearman's and Kendall's correlation coefficient explains relationship better when data is monotonic, meaning data correlated in a nonlinear manner may be exponential patterns.
Pearson's coefficient is effected by outliers. Whereas Spearman's and Kendall's coefficient are resistant to the existence of outliers in the data, they are calculated using rank and not actual data.
Correlation tests only provide insights into if a relationship exists or not. They do not suggest which one is better.
Finally, to analyze the relationships of the dataset we have used, baseball domain knowledge is essential. Even though Spearman's(0.21) and Kendall's(0.15) correlation coefficient show better value in terms relationship and direction (positive) they only explain observed co-variation. It does not provide evidence for causal relationship between TEAM_PITCHING_H and TARGET_WINS. In other words, allowing more hits while pitching does not improve chances of winning the game. In fact opposite is true, it decreases chances of winning.
I lean towards using Pearson's correlation coefficient, even though -0.11 is very week inverse relation. It explains proper relationship between variables, that is variables are inversely related. I also suggest conducting all three test before drawing any inference.

Formulas

Pearson’s coefficent,

\[r_p = \frac{N\sum XY - (\sum X)(\sum Y)}{\sqrt{ [N\sum X^2 - (\sum X)^2] [N\sum Y^2 - (\sum Y)^2]} }\]

Spearman’s coefficent, where the \(d_i\) difference between ranks.

\[r_s = 1 - \frac{6\sum {d_i}^2}{n(n^2 - 1)}\]

Kendall’s Tau coefficent,

\[\tau = \frac{C - D}{\sqrt{(C+D+X_0)(C+D+Y_0)}} \]

Where, \[C = \sum Concordance\ pairs,\ D = \sum Discordance\ pairs,\ X_0 = \sum X-ties, ,\ Y_0 = \sum Y-ties\]

Concordance pair(\(C\)) is defined as \(X_j > X_i\) and \(Y_j > Y_i\) or \(X_j < X_i\) and \(Y_j < Y_i\), where \(j > i\)

Discordance pair(\(D\)) is \(X_j > X_i\) and \(Y_j < Y_i\) or \(X_j < X_i\) and \(Y_j > Y_i\), where \(j > i\)

X-ties(\(X_0\)) is \(X_j = X_i\) and \(Y_j \ne Y_i\), where \(j > i\)

Y-ties(\(Y_0\)) is \(X_j \ne X_i\) and \(Y_j = Y_i\), where \(j > i\)

\(n\) = Observations

Correlation

Pavan Akula

March 16, 2018

Data Normality Tests

Correlation Tests

Comparision between `Pearson`, `Spearman` and `Kendall's Tau` correlation coefficients.

Formulas

References

Correlation

Pavan Akula

March 16, 2018

Data Normality Tests

Correlation Tests

Comparision between Pearson, Spearman and Kendall's Tau correlation coefficients.

Formulas

References

Comparision between `Pearson`, `Spearman` and `Kendall's Tau` correlation coefficients.