{dataset<- read_csv("/Volumes/MegaZ/Statistics/Week2_Data_Dive.Rmd")}
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(dataset) #getting a quick summary of the dataset
## name_common age player_ID year_ID
## Length:98796 Min. : 0.00 Length:98796 Min. :1901
## Class :character 1st Qu.:25.00 Class :character 1st Qu.:1947
## Mode :character Median :27.00 Mode :character Median :1980
## Mean :27.87 Mean :1973
## 3rd Qu.:31.00 3rd Qu.:2002
## Max. :58.00 Max. :2019
## NA's :1
## team_ID lg_ID pct_PT WAR162
## Length:98796 Length:98796 Min. : 0.0000 Min. :-4.2035
## Class :character Class :character 1st Qu.: 0.5101 1st Qu.:-0.1376
## Mode :character Mode :character Median : 1.8586 Median : 0.1921
## Mean : 2.5568 Mean : 0.8487
## 3rd Qu.: 4.3258 3rd Qu.: 1.3448
## Max. :14.6878 Max. :15.4913
##
## def_pos quasi_ws stint_ID franch_id
## Length:98796 Min. : 0.000 Min. :1.000 Length:98796
## Class :character 1st Qu.: 0.000 1st Qu.:1.000 Class :character
## Mode :character Median : 3.000 Median :1.000 Mode :character
## Mean : 6.187 Mean :1.076
## 3rd Qu.:10.000 3rd Qu.:1.000
## Max. :58.000 Max. :5.000
##
## prev_franch year_acq year_left next_franch
## Length:98796 Min. :1901 Min. :1901 Length:98796
## Class :character 1st Qu.:1945 1st Qu.:1949 Class :character
## Mode :character Median :1978 Median :1982 Mode :character
## Mean :1972 Mean :1975
## 3rd Qu.:2001 3rd Qu.:2003
## Max. :2019 Max. :2019
##
In the code below, we will find the mean WAR162 of players who are younger than 35 years old.
dataset|>
filter(age<35)|> #players younger than 35
pluck("WAR162") |> #get the WAR162
mean()
## [1] 0.8514536
In the code below, we will find the oldest player in this data set that had an above average WAR162
dataset|>
filter(WAR162>mean(WAR162))|> #filter where WAR162 is greater than avg
pluck("age")|> #pluck the age
max() #oldest age to have above average WAR162
## [1] 48
Would filtering out observations that outside of the last 15 years be an appropriate way of filtering this data set? There are around 98,000 rows in this dataset, and I need to bring this number down to about 20,000 in order to make this data set more manageable.
Would changing the def_position of players who had multiple positions to “UTY” for utility be a good way to clean that column?
Do statistics like WAR162 or quasi_win (quasi win shares) tend to be increasing or decreasing as more seasons are played?
dataset|>
filter(year_ID>2004)|> #filter dataset to last 15 years only
nrow() #how many rows are would be included
## [1] 21414
Based on this filtering method, we would still have 21414 rows, which is much more manageable than the almost 100,000 we had before.
In the visual below, we will take a look at how WAR162 and quasi_ws are related. WAR162 measures a players value by comparing them to their replacement (Wins Above Replacement), and Quasi Winshare is
dataset|>
filter(year_ID>2004)|>
ggplot()+
geom_point(mapping=aes(x=WAR162, y= quasi_ws), color= 'darkred')+
labs(title=" WAR162 vs. Quasi Winshare",
x= 'WAR162', y= 'Quasi Winshare')+
theme_classic()
In the visual above, we can see there appears to be a positive correlation between WAR162 and Quasi Winshare from 2004 to 2019. This is significant, and higher WAR162 is associated with higher Quasi Winshares. This makes me wonder if performance in these columns and the number of years played with a franchise are associated.