This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Week 8 Data Dive
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
Rows: 98796 Columns: 16── Column specification ───────────────────────────────────
Delimiter: ","
chr (8): name_common, player_ID, team_ID, lg_ID, def_po...
dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player
df |>
filter(year_ID>2004 & year_ID<2019)|>
mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
group_by(def_pos)|> #Grouping based on defensive position
summarize(count1=n(),#count of player by year
mean(WAR162))
df
df <- df |>
mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))
# Check if changes were applied
df |> filter(def_pos == "UTL") |> head()
NA
The two columns I will be looking at this week are the “WAR162” and
“def_pos” columns. WAR162 stand Wins Above Replacement per 162 team
games. The def-pos column represents what position they player plays.
Because some players have multiple positions, I changed the pertaining
values to “UTL” (utility player).
WAR162 is the response variable and def_pos is the explanatory
variable. I am interested in whether the position played has any effect
on WAR162.
Null hypothesis: Average WAR162 is equal across all positions.
df <- df |> drop_na() #getting ride of rows with missing values
df_new <- df |> filter(year_ID > 2004)
df_new |>
filter(year_ID>2004 )|>
ggplot() +
geom_boxplot(mapping = aes(y = WAR162, x = def_pos)) +
#scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
annotation_logticks(sides = 'l') +
labs(x = "Position",
y = "WAR162")

Above, we can see that some positions typically have higher WAR162
statistics that others. I want to know if the differences are
significant enough to challenge the hypothesis that they are all
equal.
df_new |> ggplot(aes(x = WAR162)) +
geom_histogram(bins = 40, fill = "blue", alpha = 0.5) +
labs(title = "Histogram of WAR162")

The histogram above shows a WAR162 distribution. It seems be centered
at 0, and have a slight right skew. This means most players contribute a
neutral value to their ream, and there are more players with very high
WAR162 stats than low ones.
m <- aov(WAR162 ~ def_pos, data = df_new)
summary(m)
Df Sum Sq Mean Sq F value Pr(>F)
def_pos 9 2822 313.60 159.1 <2e-16 ***
Residuals 21277 41927 1.97
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
With the above code we get a very small p-value of essentially zero,
which indicates that is unlikely that all of the mean WAR162’s by
position are the same and at least one is different.
Because the p-value is so small I reject the null hypothesis that
average WAR162 is equal across all positions. This suggests that one or
more position has significantly different WAR162 compared to other.
This is interesting as defensive position is associated with their
WAR162 statistics. When constructing a team, certain positions may have
less value than others. It also ties into salary, as teams may be
willing to pay more for a “premium position”.
pairwise.t.test(df_new$WAR162, df_new$def_pos, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: df_new$WAR162 and df_new$def_pos
1B 2B 3B C CF LF
2B 2.7e-05 - - - - -
3B 1.8e-14 0.14524 - - - -
C < 2e-16 < 2e-16 < 2e-16 - - -
CF < 2e-16 0.00065 1.00000 < 2e-16 - -
LF 1.00000 3.9e-05 1.2e-12 3.8e-11 < 2e-16 -
P < 2e-16 < 2e-16 < 2e-16 0.07561 < 2e-16 1.7e-08
RF 0.02550 1.00000 0.01222 < 2e-16 3.8e-05 0.01178
SS 6.6e-16 0.19170 1.00000 < 2e-16 1.00000 2.6e-13
UTL < 2e-16 < 2e-16 < 2e-16 0.00358 < 2e-16 5.9e-07
P RF SS
2B - - -
3B - - -
C - - -
CF - - -
LF - - -
P - - -
RF < 2e-16 - -
SS < 2e-16 0.01509 -
UTL 1.00000 < 2e-16 < 2e-16
P value adjustment method: bonferroni
Above, a multiple pairwise t-tests to compares each grouping of row
and column. It tells us that there are definitely some significant
differences such as:
Catcher is significantly different from every other
position.
Pitcher is significantly different from all roles except
“utility” players and catchers.
Utility players are significantly different from all roles except
pitcher.
Left Field is significantly different from all positions except
First Base
First Base is significantly different from all positions except
Left Field
# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
# the function we want to run on each iteration i
boot_func <- \(x, i) func(x[i])
# the boot instance, running the function on each iteration
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
# return just the lower and upper values
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ci <- df_new |>
group_by(def_pos) |>
summarise(ci_lower = boot_ci(WAR162, mean)['lower'],
mean_war = mean(WAR162),
ci_upper = boot_ci(WAR162, mean)['upper'])
df_ci
df_ci |>
ggplot() +
geom_errorbarh(mapping = aes(y = def_pos,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_war, y = def_pos,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'orange')) +
labs(title = "WAR162 By Defensive Position",
x = "WAR162",
y = "Defensive Position",
color = '')

The visual above shows us the WAR162’s by position. It makes it much
easier to spot some of the differences in WAR162 by position. The
visualization supports the statements above:
Catcher is significantly different from every other
position.
Pitcher is significantly different from all roles except
“utility” players and catchers.
Utility players are significantly different from all roles except
pitcher.
Left Field is significantly different from all positions except
First Base
First Base is significantly different from all positions except
Left Field
LINEAR REGRESSION
Next I will build a linear regression model for the variables share
of playing time and “WAR162”. This will estimate a relationship between
the two variables, showing how much WAR162 changes with playing
time.
df_new |>
ggplot(mapping = aes(x = pct_PT, y = WAR162)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
labs(title = "Linear Relationship Between Playing Time and WAR162",
x = "Percent of Playing Time",
y = "WAR162") +
theme_minimal()

model <- lm(WAR162 ~ pct_PT, data = df_new)
summary(model)
Call:
lm(formula = WAR162 ~ pct_PT, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-5.2702 -0.4408 0.0741 0.4078 7.9911
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.443377 0.010095 -43.92 <2e-16 ***
pct_PT 0.543718 0.003525 154.26 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9963 on 21285 degrees of freedom
Multiple R-squared: 0.5279, Adjusted R-squared: 0.5278
F-statistic: 2.38e+04 on 1 and 21285 DF, p-value: < 2.2e-16
There does appear to be a linear association between share of playing
time and WAR162. There is an estimated slope of 0.544, which means that
for every 1 percent point in increase in pct_PT, WAR162 increases by
0.544. The small p-value also shows me that pct_PT (playing time) has a
statistically significant relationship with WAR162. Additionally, we can
see that Multiple R-Squared value is about 0.53. This means that
approximately 53% of the variance in WAR162 is explained by pct_PT.
# the line estimated above
lm_ <- \(x) (1 / 2) * x + 3
df_new |>
ggplot(mapping = aes(x = pct_PT, y = WAR162)) +
geom_point(size = 2) +
geom_segment(mapping = aes(xend = pct_PT, y = WAR162, yend = lm_(WAR162),
color = 'error'), linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = 'red', linewidth = 0.5) +
scale_color_manual(values = c('darkblue')) +
labs(color = '')

There does appear to be quite a bit of error in the visualization
above.
beta_0 <- 3
beta_1 <- 1/2
lm_ <- \(x) beta_1 * x + beta_0
df_new |>
ggplot() +
geom_point(mapping = aes(x = pct_PT, y = WAR162), size = 2) +
geom_smooth(mapping = aes(x = pct_PT, y = WAR162), method = "lm", se = FALSE, color = 'red', linewidth = 0.5) +
geom_rect(mapping = aes(xmin = pct_PT,
xmax = pct_PT + abs(WAR162 - lm_(pct_PT)),
ymin = lm_(pct_PT),
ymax = WAR162),
fill = NA, color = 'darkblue') +
labs(color = '')

There are so many data points which makes it very hard to understand
the two visualizations above.
model <- lm(WAR162 ~ pct_PT, df_new)
model$coefficients
(Intercept) pct_PT
-0.4433769 0.5437183
From this output, we see that the expected WAR162, when the share of
playing time is zero, is -0.44. For every 1 percent point in increase in
pct_PT, WAR162 increases by 0.544.
I believe this is a good fit because the slope of 0.544 indicates a
meaning relationship. Additionally, there was a very small p-value which
shows that the predictor (pct_PT) is statistically significant.
However, the multiple R-squared value of 0.53 tells us that pct_PT
explains 53% of the variance in WAR162. This leaves the rest of the
variance unexplained, and there could be other factors influencing
WAR162.
I think that using multiple explanatory variables would have be an
optimal way of building this model. Other factors such as team and age
could also influence WAR162. Originally I was going to look at the
relationship with age and WAR162, but it was not linear. As players get
older, their WAR162 stats start to become lower.
---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.

## Week 8 Data Dive

```{r}
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
```

```{r}
library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player




df |>
  filter(year_ID>2004 & year_ID<2019)|>
  mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
  group_by(def_pos)|> #Grouping based on defensive position
  summarize(count1=n(),#count of player by year
  mean(WAR162))
df
```

```{r}
df <- df |> 
  mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))

# Check if changes were applied
df |> filter(def_pos == "UTL") |> head()

```

The two columns I will be looking at this week are the "WAR162" and "def_pos" columns. WAR162 stand Wins Above Replacement per 162 team games. The def-pos column represents what position they player plays. Because some players have multiple positions, I changed the pertaining values to "UTL" (utility player).

WAR162 is the response variable and def_pos is the explanatory variable. I am interested in whether the position played has any effect on WAR162.

Null hypothesis: Average WAR162 is equal across all positions.

```{r}
df <- df |> drop_na() #getting ride of rows with missing values
df_new <- df |> filter(year_ID > 2004)

```

```{r}
df_new |>
  filter(year_ID>2004 )|>
  ggplot() +
  geom_boxplot(mapping = aes(y = WAR162, x = def_pos)) +
  #scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
  annotation_logticks(sides = 'l') +
  labs(x = "Position",
       y = "WAR162")
```

Above, we can see that some positions typically have higher WAR162 statistics that others. I want to know if the differences are significant enough to challenge the hypothesis that they are all equal.

```{r}
df_new |> ggplot(aes(x = WAR162)) + 
  geom_histogram(bins = 40, fill = "blue", alpha = 0.5) +
  labs(title = "Histogram of WAR162")
```

The histogram above shows a WAR162 distribution. It seems be centered at 0, and have a slight right skew. This means most players contribute a neutral value to their ream, and there are more players with very high WAR162 stats than low ones.

```{r}
m <- aov(WAR162 ~ def_pos, data = df_new)
summary(m)
```

With the above code we get a very small p-value of essentially zero, which indicates that is unlikely that all of the mean WAR162's by position are the same and at least one is different.

Because the p-value is so small I reject the null hypothesis that average WAR162 is equal across all positions. This suggests that one or more position has significantly different WAR162 compared to other.

This is interesting as defensive position is associated with their WAR162 statistics. When constructing a team, certain positions may have less value than others. It also ties into salary, as teams may be willing to pay more for a "premium position".

```{r}
pairwise.t.test(df_new$WAR162, df_new$def_pos, p.adjust.method = "bonferroni")
```

Above, a multiple pairwise t-tests to compares each grouping of row and column. It tells us that there are definitely some significant differences such as:

1)  Catcher is significantly different from every other position.

2)  Pitcher is significantly different from all roles except "utility" players and catchers.

3)  Utility players are significantly different from all roles except pitcher.

4)  Left Field is significantly different from all positions except First Base

5)  First Base is significantly different from all positions except Left Field

```{r}
# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  # the function we want to run on each iteration i
  boot_func <- \(x, i) func(x[i])
  
  # the boot instance, running the function on each iteration
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  # return just the lower and upper values
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}
```

```{r}
df_ci <- df_new |>
  group_by(def_pos) |>
  summarise(ci_lower = boot_ci(WAR162, mean)['lower'],
            mean_war = mean(WAR162),
            ci_upper = boot_ci(WAR162, mean)['upper'])

df_ci
```

```{r}
df_ci |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = def_pos, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_war, y = def_pos,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'orange')) +
  labs(title = "WAR162 By Defensive Position",
       x = "WAR162",
       y = "Defensive Position",
       color = '')
```

The visual above shows us the WAR162's by position. It makes it much easier to spot some of the differences in WAR162 by position. The visualization supports the statements above:

1)  Catcher is significantly different from every other position.

2)  Pitcher is significantly different from all roles except "utility" players and catchers.

3)  Utility players are significantly different from all roles except pitcher.

4)  Left Field is significantly different from all positions except First Base

5)  First Base is significantly different from all positions except Left Field

## LINEAR REGRESSION

Next I will build a linear regression model for the variables share of playing time and "WAR162". This will estimate a relationship between the two variables, showing how much WAR162 changes with playing time.

```{r}
df_new |>  
  ggplot(mapping = aes(x = pct_PT, y = WAR162)) +  
  geom_point(size = 2) +  
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +  
  labs(title = "Linear Relationship Between Playing Time and WAR162",  
       x = "Percent of Playing Time",  
       y = "WAR162") +  
  theme_minimal()
```

```{r}
model <- lm(WAR162 ~ pct_PT, data = df_new)
summary(model)
```

There does appear to be a linear association between share of playing time and WAR162. There is an estimated slope of 0.544, which means that for every 1 percent point in increase in pct_PT, WAR162 increases by 0.544. The small p-value also shows me that pct_PT (playing time) has a statistically significant relationship with WAR162. Additionally, we can see that Multiple R-Squared value is about 0.53. This means that approximately 53% of the variance in WAR162 is explained by pct_PT.

```{r}
# the line estimated above
lm_ <- \(x) (1 / 2) * x + 3

df_new |>
  ggplot(mapping = aes(x = pct_PT, y = WAR162)) +
  geom_point(size = 2) +
  geom_segment(mapping = aes(xend = pct_PT, y = WAR162, yend = lm_(WAR162), 
                             color = 'error'), linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = 'red', linewidth = 0.5) + 
  scale_color_manual(values = c('darkblue')) +
  labs(color = '')
```

There does appear to be quite a bit of error in the visualization above.

```{r}
beta_0 <- 3
beta_1 <- 1/2

lm_ <- \(x) beta_1 * x + beta_0

df_new |>
  ggplot() +
  geom_point(mapping = aes(x = pct_PT, y = WAR162), size = 2) +
  geom_smooth(mapping = aes(x = pct_PT, y = WAR162), method = "lm", se = FALSE, color = 'red', linewidth = 0.5) + 
  geom_rect(mapping = aes(xmin = pct_PT, 
                          xmax = pct_PT + abs(WAR162 - lm_(pct_PT)),
                          ymin = lm_(pct_PT), 
                          ymax = WAR162), 
            fill = NA, color = 'darkblue') +
  labs(color = '')
```

There are so many data points which makes it very hard to understand the two visualizations above.

```{r}
model <- lm(WAR162 ~ pct_PT, df_new)
model$coefficients
```

From this output, we see that the expected WAR162, when the share of playing time is zero, is -0.44. For every 1 percent point in increase in pct_PT, WAR162 increases by 0.544.

I believe this is a good fit because the slope of 0.544 indicates a meaning relationship. Additionally, there was a very small p-value which shows that the predictor (pct_PT) is statistically significant.

However, the multiple R-squared value of 0.53 tells us that pct_PT explains 53% of the variance in WAR162. This leaves the rest of the variance unexplained, and there could be other factors influencing WAR162.

I think that using multiple explanatory variables would have be an optimal way of building this model. Other factors such as team and age could also influence WAR162. Originally I was going to look at the relationship with age and WAR162, but it was not linear. As players get older, their WAR162 stats start to become lower.
