options(scipen=999)
library(ggplot2)
library(dplyr)
library(reshape2)
library(DMwR)
library(knitr)
library(kableExtra)

BaseballDf <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
BaseballDf$TEAM_BATTING_1B = BaseballDf$TEAM_BATTING_H - BaseballDf$TEAM_BATTING_2B - BaseballDf$TEAM_BATTING_3B - BaseballDf$TEAM_BATTING_HR

BaseballDf_kNN <- BaseballDf %>% 
  select(TARGET_WINS, TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_PITCHING_H,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP,TEAM_BATTING_1B)

BaseballDf_kNN <- knnImputation(BaseballDf_kNN, 15, meth='weighAvg')

for(i in 1:ncol(BaseballDf_kNN)){
  BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i] <- as.numeric(as.character(BaseballDf_kNN[is.na(BaseballDf_kNN[,i]), i]))
}

df <- BaseballDf_kNN %>% 
  select(TEAM_PITCHING_H, TARGET_WINS)

cor.df <- data.frame(Name = "Pearson's", "cor" = round(cor(df, method = "pearson", use = "pairwise.complete.obs")[2],2))
cor.df <- rbind(cor.df, data.frame(Name = "Spearman's", "cor" = round(cor(df, method = "spearman", use = "pairwise.complete.obs")[2],2)))
cor.df <- rbind(cor.df, data.frame(Name = "Kendall's Tau", "cor" = round(cor(df, method = "kendall", use = "pairwise.complete.obs")[2],2)))

While working on Data621 project-1 using moneyball dataset, I noticed correlation coefficient computed using Pearson, Spearman and Kendall's Tau methods is not same. Especially correlation coefficient between variables TEAM_PITCHING_H and TARGET_WINS is different.

cor.df %>% 
  kable("html",caption = "Correlation Coefficient", aligh="c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
Correlation Coefficient
Name cor
Pearson’s -0.11
Spearman’s 0.21
Kendall’s Tau 0.15

Though correlation between variables is not strong, it does vary from method to method. Pearson's method show negative correlation whereas Spearman's and Kendall's method show positive correlation between variables. This observation geared me towards writing this blog.

In this blog post, I will discuss

Data Normality Tests

Step one, inspect the data distribution using scatter-plot. The pattern we are trying to identify does TARGET_WINS value increase or decrease with a change in TEAM_PITCHING_H variable.

df <- BaseballDf_kNN %>% 
  select(TEAM_PITCHING_H, TARGET_WINS)

df.lm <-lm(TARGET_WINS~TEAM_PITCHING_H, data=df)


if (coef(df.lm)[2] < 0) {
  z <- list(xx = format(coef(df.lm)[1], digits = 4),
          yy = format(abs(coef(df.lm)[2]), digits = 4),
          r2 = format(summary(df.lm)$r.squared, digits = 3));
  eq <- substitute(italic(hat(y)) == xx - yy %.% italic(x)*","~~italic(r)^2~"="~r2,z)  
} else {
    z <- list(xx = format(coef(df.lm)[1], digits = 4),
        yy = format(abs(coef(df.lm)[2]), digits = 4),
        r2 = format(summary(df.lm)$r.squared, digits = 3));
    
  eq <- substitute(italic(hat(y)) == xx + yy %.% italic(x)*","~~italic(R)^2~"="~r2,z) 
}


df.lmeq <- as.character(as.expression(eq))

ggplot(df, aes(x=TEAM_PITCHING_H,y=TARGET_WINS)) + 
  geom_point(shape=1, size=1, color="black", alpha=1/2) +
    geom_smooth(method=lm, se=T, color="red", size=0.5) +
  labs(title = sprintf("Scatter Plot"), subtitle = "TEAM_PITCHING_H Vs. TARGET_WINS") + xlab("TEAM_PITCHING_H") + ylab("TARGET_WINS") +
  annotate("text", x = 20000, y = 75, label = df.lmeq, colour="red", size = 3.5, parse=T) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white"))

Looking at above plot, one can notice

However, \(\beta_1\) value is -0.001231, indicates for every hit allowed during pitching, chances of winning the game are decreased by 0.001231.

Step two,

Let’s test if two variables are normally distributed. I will be using Shapiro-Wilk test. The null hypothesis of this test is that the population is normally distributed.

\(H_0:\) Data is normally distributed.

\(H_A:\) Data is not normally distributed.

#Test TEAM_PITCHING_H
shapiro.test(df$TEAM_PITCHING_H)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$TEAM_PITCHING_H
## W = 0.24611, p-value < 0.00000000000000022
#Test TARGET_WINS
shapiro.test(df$TARGET_WINS)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$TARGET_WINS
## W = 0.98825, p-value = 0.000000000001007

For both variables p-value is less than significance level 0.05, indicating data is not normally distributed.

#Test TEAM_PITCHING_H

ggplot(data=as.data.frame(qqnorm( df$TEAM_PITCHING_H , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TEAM_PITCHING_H")

#Test TARGET_WINS
ggplot(data=as.data.frame(qqnorm( df$TARGET_WINS , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point(shape=1, size=1, color="black", alpha=1/2) + geom_smooth(method="lm", se=FALSE, color = "red", size = 0.5) + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + labs(title = sprintf("Q-Q Normal Plot"), subtitle = "TARGET_WINS")

QQ-Plot for both variables show data is not normally distributed.

Correlation Tests

While examining a relationship between two quantitative variables, we are looking for

Null hypothesis for correlation test is there is no relationship between the variables. Alternative is there is some relationship between variables.

\(H_0: \rho = 0\) Hits allowed during pitching has no impact on output of a baseball game.

\(H_A: \rho \ne 0\) Hits allowed during pitching does impact the output of a baseball game.

Pearson’s correlation test

#Pearson correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "pearson")
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## t = -5.2745, df = 2274, p-value = 0.0000001457
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.15034489 -0.06916252
## sample estimates:
##        cor 
## -0.1099371

p-value of Pearson's correlation test is 0.0000001457, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Spearman’s correlation test

#Spearman correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "spearman")
cortest
## 
##  Spearman's rank correlation rho
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## S = 1545200000, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2136212

p-value of Spearman's correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Kendall’s Tau correlation test

#Kendall's Tau correlation test:
cortest <- cor.test(df$TEAM_PITCHING_H, df$TARGET_WINS, method = "kendall")
cortest
## 
##  Kendall's rank correlation tau
## 
## data:  df$TEAM_PITCHING_H and df$TARGET_WINS
## z = 10.719, p-value < 0.00000000000000022
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.1514105

p-value of Kendall's correlation test is 0.00000000000000022, which is less than significance level 0.05. We reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis.

Comparision between Pearson, Spearman and Kendall's Tau correlation coefficients.

Formulas

Pearson’s coefficent,

\[r_p = \frac{N\sum XY - (\sum X)(\sum Y)}{\sqrt{ [N\sum X^2 - (\sum X)^2] [N\sum Y^2 - (\sum Y)^2]} }\]

Spearman’s coefficent, where the \(d_i\) difference between ranks.

\[r_s = 1 - \frac{6\sum {d_i}^2}{n(n^2 - 1)}\]

Kendall’s Tau coefficent,

\[\tau = \frac{C - D}{\sqrt{(C+D+X_0)(C+D+Y_0)}} \]

Where, \[C = \sum Concordance\ pairs,\ D = \sum Discordance\ pairs,\ X_0 = \sum X-ties, ,\ Y_0 = \sum Y-ties\]

Concordance pair(\(C\)) is defined as \(X_j > X_i\) and \(Y_j > Y_i\) or \(X_j < X_i\) and \(Y_j < Y_i\), where \(j > i\)

Discordance pair(\(D\)) is \(X_j > X_i\) and \(Y_j < Y_i\) or \(X_j < X_i\) and \(Y_j > Y_i\), where \(j > i\)

X-ties(\(X_0\)) is \(X_j = X_i\) and \(Y_j \ne Y_i\), where \(j > i\)

Y-ties(\(Y_0\)) is \(X_j \ne X_i\) and \(Y_j = Y_i\), where \(j > i\)

\(n\) = Observations

References