R Final

R Markdown

I want to find out what the best predictor of a teams success is. Is it offensive stats or defensive stats. Then I want to dive in and see which statistic is the best predictor, is it On-base, slugging pct, runs allowed? I hope to be able to find one or two stats that can help me predict

library (readr)
library(ggplot2)
library(ggthemes)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

urlfile2="https://vincentarelbundock.github.io/Rdatasets/csv/openintro/mlb_teams.csv"

#Team Data
teamdata<-read_csv(url(urlfile2))

## New names:
## • `` -> `...1`

## Rows: 2784 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): league_id, division_id, division_winner, wild_card_winner, league_...
## dbl (34): ...1, year, rank, games_played, home_games, wins, losses, runs_sco...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(teamdata)

## # A tibble: 6 × 42
##    ...1  year leagu…¹ divis…²  rank games…³ home_…⁴  wins losses divis…⁵ wild_…⁶
##   <dbl> <dbl> <chr>   <chr>   <dbl>   <dbl>   <dbl> <dbl>  <dbl> <chr>   <chr>  
## 1     1  1876 NL      <NA>        4      70      NA    39     31 <NA>    <NA>   
## 2     2  1876 NL      <NA>        1      66      NA    52     14 <NA>    <NA>   
## 3     3  1876 NL      <NA>        8      65      NA     9     56 <NA>    <NA>   
## 4     4  1876 NL      <NA>        2      69      NA    47     21 <NA>    <NA>   
## 5     5  1876 NL      <NA>        5      69      NA    30     36 <NA>    <NA>   
## 6     6  1876 NL      <NA>        6      57      NA    21     35 <NA>    <NA>   
## # … with 31 more variables: league_winner <chr>, world_series_winner <chr>,
## #   runs_scored <dbl>, at_bats <dbl>, hits <dbl>, doubles <dbl>, triples <dbl>,
## #   homeruns <dbl>, walks <dbl>, strikeouts_by_batters <dbl>,
## #   stolen_bases <dbl>, caught_stealing <dbl>, batters_hit_by_pitch <dbl>,
## #   sacrifice_flies <dbl>, opponents_runs_scored <dbl>,
## #   earned_runs_allowed <dbl>, earned_run_average <dbl>, complete_games <dbl>,
## #   shutouts <dbl>, saves <dbl>, outs_pitches <dbl>, hits_allowed <dbl>, …
## # ℹ Use `colnames()` to see all variable names

##Step 1 I want to only look at season data from the World Series winning team, for now. So I will filter the world_series_winner column to show only Y values.

WSwinner<- teamdata %>% filter(world_series_winner == "Y")
#WSwinner

#Step 2 Now I want to clean the data, I’m not interested in columns like rank or div winner. We know this team won the world series so we can get rid of some of the irrelevant data. Also I want to only show data from 1970 to present, a few important data points are null for seasons prior to 1970.

WSwinner1<- WSwinner[,c(2,3,6:9,14:40,42)]

WS1970_2020<-WSwinner1%>% filter(year >= 1970)

#WS1970_2020

#Step 3 Now we have cleaned the dataset we can look at the summary and start analyzing what may be contributing to a teams success.

#summary(WS1970_2020)

#Step 3 cont. Looking at the data we see an outlier. 2020 was severely impacted by the COVID-19 pandemic and led to a shortened season which will impact our findings. I am arguing that the results should not be included, because a lack of fan attendance, shortened season, starters being pulled due to infection, etc. it just doesn’t make sense to include. So we should filter out 2020 and run the summary again.

WS1970_2019<-WS1970_2020%>% filter(year != 2020)

#summary(WS1970_2019)

#Step 4 Now I want to create a subset of all seasons and teams and compare the summaries.

NWSwinner1<- teamdata[,c(2,3,6:9,14:40,42)]
nWS1970_2020<-NWSwinner1%>% filter(year >= 1970)
Seasons1970_2019<-nWS1970_2020%>% filter(year != 2020)
#summary(Seasons1970_2019)

#Step 5 I will now create some variables that create mean and medians of 3 offensive stats and 3 defensive.

offWS<-{
WS1970_2019%>%
  summarize(wsavgruns=mean(runs_scored),
            wsmedruns=median(runs_scored),
            wsavghits=mean(hits),
            wsmedhits=median(hits),
            wsavghr=mean(homeruns),
            wsmedhr=median(homeruns))}
defWS<-{
WS1970_2019%>%
  summarize(wsavgstrikeout=mean(strikeouts_by_pitchers),
            wsmedstrikeout=median(strikeouts_by_pitchers),
            wsavghra=mean(homeruns_allowed),
            wsmedhra=median(homeruns_allowed),
            wsavgors=mean(opponents_runs_scored),
            wsmedors=median(opponents_runs_scored))}
offnWS<-{
Seasons1970_2019%>%
  summarize(nwsavgruns=mean(runs_scored),
            nwsmedruns=median(runs_scored),
            nwsavghits=mean(hits),
            nwsmedhits=median(hits),
            nwsavghr=mean(homeruns),
            nwsmedhr=median(homeruns))}
defnwS<-{
Seasons1970_2019%>%
  summarize(nwsavgstrikeout=mean(strikeouts_by_pitchers),
            nwsmedstrikeout=median(strikeouts_by_pitchers),
            nwsavghra=mean(homeruns_allowed),
            nwsmedhra=median(homeruns_allowed),
            nwsavgors=mean(opponents_runs_scored),
            nwsmedors=median(opponents_runs_scored))}
#offWS
#defWS
#offnWS
#defnwS

#Step 6

avgrunbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avgruns=mean(runs_scored))
}
avghitsbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avghit=mean(hits))
} 
  
avghrbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avghr=mean(homeruns))
}

avghrabyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avghra=mean(homeruns_allowed))
}

avgsobyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avgso=mean(strikeouts_by_pitchers))
}

avgorsbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avgors=mean(opponents_runs_scored))
}


g1<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=runs_scored), color='blue') + 
  geom_line(data = avgrunbyyear, aes(x=year, y=avgruns), color='red')
g1

g2<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=hits), color='blue') + 
  geom_line(data = avghitsbyyear, aes(x=year, y=avghit), color='red')
g2

g3<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=homeruns), color='blue') + 
  geom_line(data = avghrbyyear, aes(x=year, y=avghr), color='red')
g3

g4<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=homeruns_allowed), color='blue') + 
  geom_line(data = avghrabyyear, aes(x=year, y=avghra), color='red')
g4

g5<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=strikeouts_by_pitchers), color='blue') + 
  geom_line(data = avgsobyyear, aes(x=year, y=avgso), color='red')
g5

g6<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=opponents_runs_scored), color='blue') + 
  geom_line(data = avgorsbyyear, aes(x=year, y=avgors), color='red')
g6

If we look at the last graph we can see that the avg opponents runs scored is above most of the actual opponents runs scored for the World Series champions.I want to do another defensive stat and an offensive stat. Right now it appears that defense is key to winning championships.

avghitsallowedbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avgha=mean(hits_allowed))
}

avgsobbyyear<-{
  Seasons1970_2019%>%
    group_by(year)%>%
    summarize(avgsobb=mean(strikeouts_by_batters))
}

g7<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=hits_allowed), color='blue') + 
  geom_line(data = avghitsallowedbyyear, aes(x=year, y=avgha), color='red')
g7

g8<-ggplot()+
  geom_point(data = WS1970_2019, aes(x=year, y=strikeouts_by_batters), color='blue') + 
  geom_line(data = avgsobbyyear, aes(x=year, y=avgsobb), color='red')
g8

It appears that defensive stats are really good indicators of a teams chances at winning the World Series. Interesting finding for strikeouts by batters, from the mid 90’s on it would suggest that it is an important variable. I think that a future dive would need to include looking at walks then comparing star hitters numbers.

g7

g6

I would suggest based off these findings that the best indicator for a teams chances at winning the World Series is a defensive stat that looks at opponents runs scored. As we can see in graph 6 (g6) only 2 teams since 1970 allowed more runs to be scored than the league average.

R Final

Neil Hodgkinson

2022-07-28

R Markdown