Week 2 Data Dive

{dataset<- read_csv("/Volumes/MegaZ/Statistics/Week2_Data_Dive.Rmd")}

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary Statistics

summary(dataset) #getting a quick summary of the dataset
##  name_common             age         player_ID            year_ID    
##  Length:98796       Min.   : 0.00   Length:98796       Min.   :1901  
##  Class :character   1st Qu.:25.00   Class :character   1st Qu.:1947  
##  Mode  :character   Median :27.00   Mode  :character   Median :1980  
##                     Mean   :27.87                      Mean   :1973  
##                     3rd Qu.:31.00                      3rd Qu.:2002  
##                     Max.   :58.00                      Max.   :2019  
##                     NA's   :1                                        
##    team_ID             lg_ID               pct_PT            WAR162       
##  Length:98796       Length:98796       Min.   : 0.0000   Min.   :-4.2035  
##  Class :character   Class :character   1st Qu.: 0.5101   1st Qu.:-0.1376  
##  Mode  :character   Mode  :character   Median : 1.8586   Median : 0.1921  
##                                        Mean   : 2.5568   Mean   : 0.8487  
##                                        3rd Qu.: 4.3258   3rd Qu.: 1.3448  
##                                        Max.   :14.6878   Max.   :15.4913  
##                                                                           
##    def_pos             quasi_ws         stint_ID      franch_id        
##  Length:98796       Min.   : 0.000   Min.   :1.000   Length:98796      
##  Class :character   1st Qu.: 0.000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median : 3.000   Median :1.000   Mode  :character  
##                     Mean   : 6.187   Mean   :1.076                     
##                     3rd Qu.:10.000   3rd Qu.:1.000                     
##                     Max.   :58.000   Max.   :5.000                     
##                                                                        
##  prev_franch           year_acq      year_left    next_franch       
##  Length:98796       Min.   :1901   Min.   :1901   Length:98796      
##  Class :character   1st Qu.:1945   1st Qu.:1949   Class :character  
##  Mode  :character   Median :1978   Median :1982   Mode  :character  
##                     Mean   :1972   Mean   :1975                     
##                     3rd Qu.:2001   3rd Qu.:2003                     
##                     Max.   :2019   Max.   :2019                     
## 

In the code below, we will find the mean WAR162 of players who are younger than 35 years old.

dataset|>   
  filter(age<35)|> #players younger than 35
  pluck("WAR162") |>          #get the WAR162
  mean()
## [1] 0.8514536

In the code below, we will find the oldest player in this data set that had an above average WAR162

dataset|>
  filter(WAR162>mean(WAR162))|> #filter where WAR162 is greater than avg
  pluck("age")|> #pluck the age
  max() #oldest age to have above average WAR162
## [1] 48

Three Novel Questions

  1. Would filtering out observations that outside of the last 15 years be an appropriate way of filtering this data set? There are around 98,000 rows in this dataset, and I need to bring this number down to about 20,000 in order to make this data set more manageable.

  2. Would changing the def_position of players who had multiple positions to “UTY” for utility be a good way to clean that column?

  3. Do statistics like WAR162 or quasi_win (quasi win shares) tend to be increasing or decreasing as more seasons are played?

    dataset|>
      filter(year_ID>2004)|> #filter dataset to last 15 years only
      nrow() #how many rows are would be included
    ## [1] 21414

    Based on this filtering method, we would still have 21414 rows, which is much more manageable than the almost 100,000 we had before.

    Visual Summaries

In the visual below, we will take a look at how WAR162 and quasi_ws are related. WAR162 measures a players value by comparing them to their replacement (Wins Above Replacement), and Quasi Winshare is

dataset|>
  filter(year_ID>2004)|>
  ggplot()+
  geom_point(mapping=aes(x=WAR162, y= quasi_ws), color= 'darkred')+
  labs(title=" WAR162 vs. Quasi Winshare",
       x= 'WAR162', y= 'Quasi Winshare')+
  theme_classic()

In the visual above, we can see there appears to be a positive correlation between WAR162 and Quasi Winshare from 2004 to 2019. This is significant, and higher WAR162 is associated with higher Quasi Winshares. This makes me wonder if performance in these columns and the number of years played with a franchise are associated.