options(scipen=999)
library(ggplot2)
library(dplyr)
library(reshape2)
library(DMwR)
library(knitr)
library(kableExtra)
BaseballDf <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
BaseballDf$TEAM_BATTING_1B = BaseballDf$TEAM_BATTING_H - BaseballDf$TEAM_BATTING_2B - BaseballDf$TEAM_BATTING_3B - BaseballDf$TEAM_BATTING_HR
BaseballDf_kNN <- BaseballDf %>%
select(TARGET_WINS, TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_PITCHING_H,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP,TEAM_BATTING_1B)
BaseballDf_kNN <- knnImputation(BaseballDf_kNN, 15, meth='weighAvg')
for(i in 1:ncol(BaseballDf_kNN)){
BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i] <- as.numeric(as.character(BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i]))
}
df <- BaseballDf_kNN %>%
select(TEAM_PITCHING_H, TARGET_WINS)
cor.df <- data.frame(Name = "Pearson's", "cor" = round(cor(df, method = "pearson", use = "pairwise.complete.obs")[2],2))
cor.df <- rbind(cor.df, data.frame(Name = "Spearman's", "cor" = round(cor(df, method = "spearman", use = "pairwise.complete.obs")[2],2)))
cor.df <- rbind(cor.df, data.frame(Name = "Kendall's Tau", "cor" = round(cor(df, method = "kendall", use = "pairwise.complete.obs")[2],2)))
While working on Data621 project-1 using moneyball
dataset, I noticed correlation coefficient
computed using Pearson
, Spearman
and Kendall's Tau
methods is not same. Especially correlation coefficient
between variables TEAM_PITCHING_H
and TARGET_WINS
is different.
cor.df %>%
kable("html",caption = "Correlation Coefficient", aligh="c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
Name | cor |
---|---|
Pearson’s | -0.11 |
Spearman’s | 0.21 |
Kendall’s Tau | 0.15 |
Though correlation between variables is not strong, it does vary from method to method. Pearson's
method show negative
correlation whereas Spearman's
and Kendall's
method show positive
correlation between variables. This observation geared me towards writing this blog.
In this blog post, I will discuss
Pearson
, Spearman
and Kendall's Tau
correlation coefficients.Step one, inspect the data distribution using scatter-plot. The pattern we are trying to identify does TARGET_WINS
value increase or decrease with a change in TEAM_PITCHING_H
variable.
df <- BaseballDf_kNN %>%
select(TEAM_PITCHING_H, TARGET_WINS)
df.lm <-lm(TARGET_WINS~TEAM_PITCHING_H, data=df)
if (coef(df.lm)[2] < 0) {
z <- list(xx = format(coef(df.lm)[1], digits = 4),
yy = format(abs(coef(df.lm)[2]), digits = 4),
r2 = format(summary(df.lm)$r.squared, digits = 3));
eq <- substitute(italic(hat(y)) == xx - yy %.% italic(x)*","~~italic(r)^2~"="~r2,z)
} else {
z <- list(xx = format(coef(df.lm)[1], digits = 4),
yy = format(abs(coef(df.lm)[2]), digits = 4),
r2 = format(summary(df.lm)$r.squared, digits = 3));
eq <- substitute(italic(hat(y)) == xx + yy %.% italic(x)*","~~italic(R)^2~"="~r2,z)
}
df.lmeq <- as.character(as.expression(eq))
ggplot(df, aes(x=TEAM_PITCHING_H,y=TARGET_WINS)) +
geom_point(shape=1, size=1, color="black", alpha=1/2) +
geom_smooth(method=lm, se=T, color="red", size=0.5) +
labs(title = sprintf("Scatter Plot"), subtitle = "TEAM_PITCHING_H Vs. TARGET_WINS") + xlab("TEAM_PITCHING_H") + ylab("TARGET_WINS") +
annotate("text", x = 20000, y = 75, label = df.lmeq, colour="red", size = 3.5, parse=T) + theme(
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white"))
Looking at above plot, one can notice
x-axis
. Shaded
area suggests relation is not linear.x-axis
.TARGET_WINS
.However, \(\beta_1\) value is -0.001231, indicates for every hit
allowed during pitching
, chances of winning the game are decreased by 0.001231.
Step two,
Let’s test if two variables are normally distributed. I will be using Shapiro-Wilk test. The null hypothesis of this test is that the population is normally distributed.
\(H_0:\) Data is normally distributed.
\(H_A:\) Data is not normally distributed.
#Test TEAM_PITCHING_H
shapiro.test(df$TEAM_PITCHING_H)
##
## Shapiro-Wilk normality test
##
## data: df$TEAM_PITCHING_H
## W = 0.24611, p-value < 0.00000000000000022
#Test TARGET_WINS
shapiro.test(df$TARGET_WINS)
##
## Shapiro-Wilk normality test
##
## data: df$TARGET_WINS
## W = 0.98825, p-value = 0.000000000001007
For both variables p-value
is less than significance level 0.05, indicating data is not normally distributed.
#Test TEAM_PITCHING_H
ggplot(data=as.data.frame(qqnorm( df$TEAM_PITCHING_H , plot=F)), mapping=aes(x=x, y=y)) +
geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TEAM_PITCHING_H")
#Test TARGET_WINS
ggplot(data=as.data.frame(qqnorm( df$TARGET_WINS , plot=F)), mapping=aes(x=x, y=y)) +
geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TARGET_WINS")
QQ-Plot
for both variables show data is not normally distributed.
While examining a relationship between two quantitative variables, we are looking for
The strength of the relationship(\(\rho\)) - correlation coefficient. Value of the coefficient ranges from -1
to 1
.
Direction of the relationship - positive or negative. -1
indicates very strong inverse relation and 1
indicates a very strong proportional relation between variables.
Null hypothesis for correlation test is there is no relationship between the variables. Alternative is there is some relationship between variables.
\(H_0: \rho = 0\) Hits
allowed during pitching
has no impact on output of a baseball game.
\(H_A: \rho \ne 0\) Hits
allowed during pitching
does impact the output of a baseball game.
Pearson’s correlation test
#Pearson correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "pearson")
cortest
##
## Pearson's product-moment correlation
##
## data: df$TEAM_PITCHING_H and df$TARGET_WINS
## t = -5.2745, df = 2274, p-value = 0.0000001457
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.15034489 -0.06916252
## sample estimates:
## cor
## -0.1099371
p-value
of Pearson's
correlation test is 0.0000001457, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.
Spearman’s correlation test
#Spearman correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "spearman")
cortest
##
## Spearman's rank correlation rho
##
## data: df$TEAM_PITCHING_H and df$TARGET_WINS
## S = 1545200000, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2136212
p-value
of Spearman's
correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.
Kendall’s Tau correlation test
#Kendall's Tau correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "kendall")
cortest
##
## Kendall's rank correlation tau
##
## data: df$TEAM_PITCHING_H and df$TARGET_WINS
## z = 10.719, p-value < 0.00000000000000022
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.1514105
p-value
of Kendall's
correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.
Pearson
, Spearman
and Kendall's Tau
correlation coefficients.Pearson's
correlation coefficient is best when data is linear. Spearman's
and Kendall's
correlation coefficient explains relationship better when data is monotonic, meaning data correlated in a nonlinear manner may be exponential patterns.Pearson's
coefficient is effected by outliers. Whereas Spearman's
and Kendall's
coefficient are resistant to the existence of outliers in the data, they are calculated using rank and not actual data.Spearman's
(0.21) and Kendall's
(0.15) correlation coefficient show better value in terms relationship and direction (positive) they only explain observed co-variation. It does not provide evidence for causal relationship between TEAM_PITCHING_H
and TARGET_WINS
. In other words, allowing more hits
while pitching
does not improve chances of winning the game. In fact opposite is true, it decreases chances of winning.Pearson's
correlation coefficient, even though -0.11 is very week inverse relation. It explains proper relationship between variables, that is variables are inversely related. I also suggest conducting all three test before drawing any inference.Pearson’s coefficent,
\[r_p = \frac{N\sum XY - (\sum X)(\sum Y)}{\sqrt{ [N\sum X^2 - (\sum X)^2] [N\sum Y^2 - (\sum Y)^2]} }\]
Spearman’s coefficent, where the \(d_i\) difference between ranks.
\[r_s = 1 - \frac{6\sum {d_i}^2}{n(n^2 - 1)}\]
Kendall’s Tau coefficent,
\[\tau = \frac{C - D}{\sqrt{(C+D+X_0)(C+D+Y_0)}} \]
Where, \[C = \sum Concordance\ pairs,\ D = \sum Discordance\ pairs,\ X_0 = \sum X-ties, ,\ Y_0 = \sum Y-ties\]
Concordance pair(\(C\)) is defined as \(X_j > X_i\) and \(Y_j > Y_i\) or \(X_j < X_i\) and \(Y_j < Y_i\), where \(j > i\)
Discordance pair(\(D\)) is \(X_j > X_i\) and \(Y_j < Y_i\) or \(X_j < X_i\) and \(Y_j > Y_i\), where \(j > i\)
X-ties(\(X_0\)) is \(X_j = X_i\) and \(Y_j \ne Y_i\), where \(j > i\)
Y-ties(\(Y_0\)) is \(X_j \ne X_i\) and \(Y_j = Y_i\), where \(j > i\)
\(n\) = Observations