knitr::opts_chunk$set(echo = TRUE)
Some columns are self explanatory (Cather and Called PItches) but SAA stands for Strikes above average. This means that, for Gregg Zaun’s case 159 more strikes are called when he is cathcing on aveage in his career. SAA/150 pitches means that for 150 pitches, on average, 5.3 more strikes are called. The higher the number this is the better it is assumed the pitcher is. Finally Funs/150 pitches means that for every 150 pitches, in Gregg’s case .85 runs are scored by the opposing team. The top three cathers at framing appear to be Gregg Zaun, Jeff Mathis, and Yadier Molina.
I think that pitch framing was overvalued. The correlation, while above average, is relatively low and the fact that the number of pitches ranges between 2400 to 6000 the SAA for catchers who recieve the least amount of pitches could be inflated.
In the graph “Draft prospect survival curves, time until nth career NHL game” shows how many games until players essenaill plateuau in regard to their skill development. Players who players who start playing in the NHL (the red line) have less of a probability of developing their skills than other players. This impliesthat they are already good when the entered the league. Players start their first NHL game later have a higher change of developing their skills because they started at a lower skill level than the 1st players. It appears to be that 10% of players just drafted play the first game.
In regards to drafting players, if you are a North American player you want to be drafted earlier rather than later because if you are drafted later you have less of a change of playing in a NHL game. Since the formula for hazard ratio is (Probability pic HR time t)/(Proabability pic HR time t-1), the probabiliyt pic HR time t decresases as hazard ratio decreases. The NHL is undervaluing North American players.
#install.packages('RCurl');install.packages('tidyverse')
library(RCurl); library(tidyverse)
gitURL<- "https://raw.githubusercontent.com/statsbylopez/StatsSports/master/Data/pbp_data_hockey.rds"
nhl_shots <- readRDS(gzcon(url(gitURL)))
names(nhl_shots)
## [1] "season" "game_id" "game_date" "session"
## [5] "event_index" "game_period" "game_seconds" "event_type"
## [9] "home_team" "away_team" "home_skaters" "away_skaters"
## [13] "home_score" "away_score" "event_detail" "event_team"
## [17] "event_player_1" "event_player_2" "coords_x" "coords_y"
## [21] "home_goalie" "away_goalie" "event_circle" "event_distance"
## [25] "event_angle" "shot_prob"
#dim(nhl_shots)
head(nhl_shots)
## season game_id game_date session event_index game_period game_seconds
## 1 20172018 2017020001 2017-10-04 R 11 1 38
## 2 20172018 2017020001 2017-10-04 R 15 1 49
## 3 20172018 2017020001 2017-10-04 R 23 1 63
## 4 20172018 2017020001 2017-10-04 R 31 1 75
## 5 20172018 2017020001 2017-10-04 R 36 1 106
## 6 20172018 2017020001 2017-10-04 R 46 1 127
## event_type home_team away_team home_skaters away_skaters home_score
## 1 SHOT WPG TOR 5 5 0
## 2 SHOT WPG TOR 5 5 0
## 3 SHOT WPG TOR 5 5 0
## 4 MISS WPG TOR 5 5 0
## 5 SHOT WPG TOR 5 5 0
## 6 MISS WPG TOR 5 5 0
## away_score event_detail event_team event_player_1 event_player_2 coords_x
## 1 0 Wrist WPG JOSH.MORRISSEY <NA> -37
## 2 0 Wrist WPG SHAWN.MATTHIAS <NA> -76
## 3 0 Backhand WPG BRYAN.LITTLE <NA> -74
## 4 0 Wrist WPG TOBIAS.ENSTROM <NA> -34
## 5 0 Wrist TOR ERIC.FEHR <NA> 79
## 6 0 Slap TOR RON.HAINSEY <NA> 36
## coords_y home_goalie away_goalie event_circle event_distance
## 1 -27 STEVE.MASON FREDERIK.ANDERSEN 2 58.6
## 2 2 STEVE.MASON FREDERIK.ANDERSEN 1 13.2
## 3 11 STEVE.MASON FREDERIK.ANDERSEN 1 18.6
## 4 -18 STEVE.MASON FREDERIK.ANDERSEN 2 57.9
## 5 -2 STEVE.MASON FREDERIK.ANDERSEN 9 10.2
## 6 27 STEVE.MASON FREDERIK.ANDERSEN 8 59.5
## event_angle shot_prob
## 1 27.4 0.006120646
## 2 8.7 0.091498807
## 3 36.3 0.053753931
## 4 18.1 0.014817776
## 5 11.3 0.156606615
## 6 27.0 0.012393842
nhl_shots%>%
group_by(event_player_1)%>%
summarise(num_shots=sum(event_type=="SHOT"))%>%
arrange(-num_shots)
## # A tibble: 1,051 x 2
## event_player_1 num_shots
## <chr> <int>
## 1 BRENT.BURNS 602
## 2 ALEX.OVECHKIN 591
## 3 TYLER.SEGUIN 586
## 4 NATHAN.MACKINNON 569
## 5 PATRICK.KANE 555
## 6 BRENDAN.GALLAGHER 515
## 7 VLADIMIR.TARASENKO 514
## 8 EVANDER.KANE 513
## 9 JACK.EICHEL 495
## 10 DOUGIE.HAMILTON 494
## # ... with 1,041 more rows
nhl_shots%>%
group_by(away_goalie)%>%
filter(event_team!=away_team)%>%
summarise(goalieA_shots=sum(event_type=="SHOT"))%>%
arrange(-goalieA_shots)
## # A tibble: 112 x 2
## away_goalie goalieA_shots
## <chr> <int>
## 1 FREDERIK.ANDERSEN 1849
## 2 HENRIK.LUNDQVIST 1794
## 3 CONNOR.HELLEBUYCK 1749
## 4 SERGEI.BOBROVSKY 1703
## 5 DEVAN.DUBNYK 1575
## 6 JACOB.MARKSTROM 1575
## 7 ANDREI.VASILEVSKIY 1555
## 8 CAREY.PRICE 1530
## 9 PEKKA.RINNE 1521
## 10 JAKE.ALLEN 1512
## # ... with 102 more rows
nhl_shots%>%
group_by(home_goalie)%>%
filter(event_team==away_team)%>%
summarise(goalieH_shots=sum(event_type=="SHOT"))%>%
arrange(-goalieH_shots)
## # A tibble: 107 x 2
## home_goalie goalieH_shots
## <chr> <int>
## 1 CONNOR.HELLEBUYCK 2011
## 2 FREDERIK.ANDERSEN 1956
## 3 JOHN.GIBSON 1947
## 4 ANDREI.VASILEVSKIY 1916
## 5 JIMMY.HOWARD 1810
## 6 BRADEN.HOLTBY 1756
## 7 DEVAN.DUBNYK 1754
## 8 JACOB.MARKSTROM 1747
## 9 SERGEI.BOBROVSKY 1723
## 10 CRAIG.ANDERSON 1696
## # ... with 97 more rows
nhl_shots%>%
group_by(event_detail)%>%
filter(event_type=="GOAL")%>%
summarise(mean_prob=mean(shot_prob))%>%
arrange(-mean_prob)
## # A tibble: 8 x 2
## event_detail mean_prob
## <chr> <dbl>
## 1 <NA> 0.475
## 2 Backhand 0.203
## 3 Wrist 0.185
## 4 Tip-In 0.153
## 5 Deflected 0.148
## 6 Snap 0.140
## 7 Wrap-around 0.0998
## 8 Slap 0.0821
The shooter who took the most shots was Darnell Nurse with 374
Connor Hellebuyck faced the highest number of shots
the Backhand has the highest probability of making a goal while the slap shot has the lowest probability. A wrist shot has the second highest probability and the wrap-around has the second lowest probability.
first_shots <- filter(nhl_shots, game_id <= 2017020500)
current_shots <- first_shots %>%
group_by(event_player_1) %>%
summarise(n_shots_past = n(),
shot_p_past = mean(event_type == "GOAL")) %>%
filter(n_shots_past >= 150)
future_shots <- nhl_shots %>%
filter(game_id > 2017020500, event_player_1 %in% current_shots$event_player_1)%>%
group_by(event_player_1) %>%
summarise(n_shots_future = n(),
shot_p_future = mean(event_type == "GOAL"))
nhl_players <- current_shots %>% inner_join(future_shots)
## Joining, by = "event_player_1"
head(current_shots)
## # A tibble: 6 x 3
## event_player_1 n_shots_past shot_p_past
## <chr> <int> <dbl>
## 1 ALEX.OVECHKIN 202 0.109
## 2 BRENT.BURNS 175 0.0286
## 3 EVANDER.KANE 173 0.0809
## 4 JACK.EICHEL 155 0.0710
## 5 JEFF.SKINNER 157 0.0701
## 6 MAX.PACIORETTY 177 0.0452
p_bar<-mean(current_shots$shot_p_past)
#p_bar
p_hat<-current_shots$shot_p_past
#p_hat
N<-current_shots$n_shots_past
#N
sigma_sq<-sd(p_hat)^2
#sigma_sq
#estimator
c<-(N/0.025)/(N/0.025+1/sigma_sq)
mean(c)
## [1] 0.8423165
current_shots$shot_p_past_MLE<-current_shots$shot_p_past
current_shots$shot_p_past_JS<-p_bar+c*(p_hat-p_bar)
mean(current_shots$shot_p_past_MLE)
## [1] 0.07557165
current_shots[1:3,]
## # A tibble: 3 x 5
## event_player_1 n_shots_past shot_p_past shot_p_past_MLE shot_p_past_JS
## <chr> <int> <dbl> <dbl> <dbl>
## 1 ALEX.OVECHKIN 202 0.109 0.109 0.104
## 2 BRENT.BURNS 175 0.0286 0.0286 0.0359
## 3 EVANDER.KANE 173 0.0809 0.0809 0.0801
current_shots%>%
ungroup()%>%
mutate(abs_error_mle=abs(shot_p_past_MLE-future_shots$shot_p_future),
abs_error_js=abs(shot_p_past_JS-future_shots$shot_p_future))%>%
summarise(mae_mle=mean(abs_error_mle),
mae_js=mean(abs_error_js))
## # A tibble: 1 x 2
## mae_mle mae_js
## <dbl> <dbl>
## 1 0.0165 0.0147
url <- getURL("https://raw.githubusercontent.com/statsbylopez/StatsSports/master/Data/sb_shot_data.csv")
wwc_shot <- read.csv(text = url)
names(wwc_shot)
## [1] "period" "minute" "second"
## [4] "possession" "duration" "possession_team.name"
## [7] "play_pattern.id" "play_pattern.name" "player.id"
## [10] "player.name" "position.name" "shot.statsbomb_xg"
## [13] "shot.first_time" "shot.technique.name" "shot.outcome.name"
## [16] "shot.type.name" "shot.body_part.name" "match_id"
## [19] "location.x" "location.y" "location.x.GK"
## [22] "location.y.GK" "player.name.GK" "DistToGoal"
## [25] "DistToKeeper" "AngleToGoal" "avevelocity"
## [28] "distance.ToD1" "DefendersBehindBall" "TimeInPoss"
dim(wwc_shot)
## [1] 1314 30
wwc_shot%>%
group_by(player.name)%>%
filter(possession_team.name=="England Women's")%>%
summarise(goals_made=sum(shot.outcome.name=="Goal"),
shooting_pct=sum(shot.statsbomb_xg))%>%
arrange(-shooting_pct)
## # A tibble: 18 x 3
## player.name goals_made shooting_pct
## <fct> <int> <dbl>
## 1 Ellen White 6 3.59
## 2 Nikita Parris 1 2.71
## 3 Jodie Taylor 1 1.65
## 4 Jill Scott 1 1.01
## 5 Stephanie Houghton 1 0.909
## 6 Bethany Mead 0 0.325
## 7 Rachel Daly 0 0.317
## 8 Toni Duggan 0 0.247
## 9 Georgia Stanway 0 0.220
## 10 Francesca Kirby 1 0.175
## 11 Lucy Bronze 1 0.130
## 12 Julia Zigiotti-Olme 0 0.125
## 13 Jade Moore 0 0.0943
## 14 Alex Greenwood 1 0.0808
## 15 Keira Walsh 0 0.0449
## 16 Erin Cuthbert 0 0.0379
## 17 Karen Julia Carney 0 0.0222
## 18 Yuika Sugasawa 0 0.0125
wwc_shot<-wwc_shot%>%
mutate(is_goal = shot.outcome.name == "Goal")
England_shot <- wwc_shot %>%
filter(possession_team.name == "England Women's",
player.name=="Ellen White")
p1 <- ggplot(England_shot, aes(location.x, location.y, shape=is_goal )) +
geom_point()
p1
Ellen White had the highest expected goals with 3.59. She over perfomred by 2.41 goals along with Franscesca Kirby, Lucy Bronze, and Alex Greenwood
library(splines)
wwc_shot <- wwc_shot %>%
mutate(is_goal = shot.outcome.name == "Goal")
fit1 <- glm(is_goal ~ avevelocity + minute, data = wwc_shot, family = "binomial")
fit2 <- glm(is_goal ~ ns(avevelocity, 5) + minute, data = wwc_shot, family = "binomial")
fit3 <- glm(is_goal ~ avevelocity + ns(minute, 5), data = wwc_shot, family = "binomial")
fit4 <- glm(is_goal ~ ns(avevelocity, 5) + ns(minute, 5), data = wwc_shot, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit1
##
## Call: glm(formula = is_goal ~ avevelocity + minute, family = "binomial",
## data = wwc_shot)
##
## Coefficients:
## (Intercept) avevelocity minute
## -2.193336 -0.002242 0.002626
##
## Degrees of Freedom: 1313 Total (i.e. Null); 1311 Residual
## Null Deviance: 904.2
## Residual Deviance: 902.9 AIC: 908.9
fit2
##
## Call: glm(formula = is_goal ~ ns(avevelocity, 5) + minute, family = "binomial",
## data = wwc_shot)
##
## Coefficients:
## (Intercept) ns(avevelocity, 5)1 ns(avevelocity, 5)2
## -1.898402 0.363549 -0.828138
## ns(avevelocity, 5)3 ns(avevelocity, 5)4 ns(avevelocity, 5)5
## 4.333052 -10.340908 -24.026911
## minute
## 0.002746
##
## Degrees of Freedom: 1313 Total (i.e. Null); 1307 Residual
## Null Deviance: 904.2
## Residual Deviance: 889.2 AIC: 903.2
fit3
##
## Call: glm(formula = is_goal ~ avevelocity + ns(minute, 5), family = "binomial",
## data = wwc_shot)
##
## Coefficients:
## (Intercept) avevelocity ns(minute, 5)1 ns(minute, 5)2 ns(minute, 5)3
## -2.113520 -0.002279 -0.145179 -0.012127 -0.176490
## ns(minute, 5)4 ns(minute, 5)5
## 0.889296 1.170555
##
## Degrees of Freedom: 1313 Total (i.e. Null); 1307 Residual
## Null Deviance: 904.2
## Residual Deviance: 898.9 AIC: 912.9
fit4
##
## Call: glm(formula = is_goal ~ ns(avevelocity, 5) + ns(minute, 5), family = "binomial",
## data = wwc_shot)
##
## Coefficients:
## (Intercept) ns(avevelocity, 5)1 ns(avevelocity, 5)2
## -1.80465 0.29724 -0.89697
## ns(avevelocity, 5)3 ns(avevelocity, 5)4 ns(avevelocity, 5)5
## 5.05708 -16.56670 -37.04217
## ns(minute, 5)1 ns(minute, 5)2 ns(minute, 5)3
## -0.13779 0.02824 -0.13456
## ns(minute, 5)4 ns(minute, 5)5
## 1.04613 1.14499
##
## Degrees of Freedom: 1313 Total (i.e. Null); 1303 Residual
## Null Deviance: 904.2
## Residual Deviance: 885 AIC: 907
Based on the AIC scores fit 2 is the better model to show to the coach. The average velocity is broken up into 5 segments each. Since average velocity is most likely not linear using a spline term for that variable is benefitial.
#install.packages('ggpubr')
library("ggpubr")
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
ggplot(wwc_shot,aes(x=is_goal,y=avevelocity))+
coord_cartesian(ylim=c(0,75))+
geom_boxplot()
wwc_shot%>%
summarise(cor_var=cor(is_goal,avevelocity),
r_sq=cor_var^2)
## cor_var r_sq
## 1 -0.01356582 0.0001840315
From the box plot above most goals are made when the average belocity is between 10 and 20. Any shot more powerful than that tends to miss the goal. Not under the spline model as velocity increases by 1 meter/second (or foot/second depending on how it is measured) the odds of a goal going in decreases by .2%. However, when the splice model is applied, under the first segment and the third segment as the velocity increase there is a better odds of the shot to be a goal.
#install.packages('generalhoslem')
library(generalhoslem)
## Loading required package: reshape
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
logitgof(wwc_shot$is_goal,fitted(fit2))
##
## Hosmer and Lemeshow test (binary model)
##
## data: wwc_shot$is_goal, fitted(fit2)
## X-squared = 9.1857, df = 8, p-value = 0.3269
The p value is 0.32 which is above 0.05 so this model is a good fit.
#install.packages('BradleyTerry2')
library(broom)
library(BradleyTerry2)
##
## Attaching package: 'BradleyTerry2'
## The following object is masked from 'package:MASS':
##
## glmmPQL
head(icehockey)
## date visitor v_goals opponent o_goals conference result
## 1 20091008 Quinnipiac 4 Ohio State 2 NC 1
## 2 20091008 Rensselaer 2 Massachusetts 5 NC 0
## 3 20091009 Air Force 1 Bemidji State 3 NC 0
## 4 20091009 Alab-Huntsville 3 Notre Dame 2 NC 1
## 5 20091009 Army 4 Nebraska-Omaha 6 NC 0
## 6 20091009 Bowling Green 2 Minnesota State 3 NC 0
## home.ice
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
dim(icehockey)
## [1] 1083 8
homeBT <- BTm(result,
data.frame(team = visitor, home.ice = 0),
data.frame(team = opponent, home.ice = home.ice),
~ team + home.ice,
id = "team", data = icehockey)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tidy(homeBT)
## # A tibble: 58 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 teamAir Force -1.17 0.665 -1.75 0.0796
## 2 teamAlab-Huntsville -0.449 0.609 -0.738 0.461
## 3 teamAmerican Int'l -2.66 0.721 -3.68 0.000230
## 4 teamAlaska 0.747 0.563 1.33 0.185
## 5 teamArmy -1.60 0.671 -2.38 0.0173
## 6 teamBoston College 1.33 0.622 2.14 0.0325
## 7 teamBowling Green -0.839 0.617 -1.36 0.174
## 8 teamBrown -0.578 0.613 -0.943 0.346
## 9 teamBemidji State 0.885 0.589 1.50 0.133
## 10 teamBoston University 0.540 0.605 0.893 0.372
## # ... with 48 more rows
tidy(homeBT) %>% tail()
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 teamUnion 0.260 0.593 0.438 0.661
## 2 teamVermont 0.786 0.604 1.30 0.193
## 3 teamWisconsin 1.51 0.514 2.94 0.00328
## 4 teamWestern Michigan -0.304 0.597 -0.509 0.610
## 5 teamYale 0.519 0.629 0.825 0.409
## 6 home.ice 0.403 0.0709 5.69 0.0000000131
head(BTabilities(homeBT), 10)
## ability s.e.
## Alaska Anchorage 0.0000000 0.0000000
## Air Force -1.1654793 0.6647895
## Alab-Huntsville -0.4490054 0.6085037
## American Int'l -2.6554773 0.7208618
## Alaska 0.7468260 0.5630714
## Army -1.5974521 0.6713962
## Boston College 1.3303922 0.6222599
## Bowling Green -0.8385947 0.6168775
## Brown -0.5780410 0.6128835
## Bemidji State 0.8853896 0.5886569
exp(BTabilities(homeBT))
## ability s.e.
## Alaska Anchorage 1.00000000 1.000000
## Air Force 0.31177320 1.944081
## Alab-Huntsville 0.63826266 1.837680
## American Int'l 0.07026529 2.056205
## Alaska 2.11029137 1.756058
## Army 0.20241159 1.956968
## Boston College 3.78252664 1.863134
## Bowling Green 0.43231762 1.853133
## Brown 0.56099626 1.845746
## Bemidji State 2.42392853 1.801567
## Boston University 1.71680397 1.831684
## Bentley 0.16902278 1.966297
## Canisius 0.28381333 1.936726
## Colorado College 2.21945309 1.641583
## Colgate 0.72643606 1.803560
## Clarkson 0.36277622 1.835664
## Cornell 2.03035360 1.852776
## Connecticut 0.08856109 2.019265
## Dartmouth 0.43441365 1.867028
## Denver 5.24912503 1.679575
## Ferris State 2.10379459 1.782883
## Harvard 0.43657336 1.872348
## Holy Cross 0.19282274 1.956077
## Lake Superior 1.22286150 1.770265
## Massachusetts 1.68483975 1.842970
## Minnesota Duluth 2.29803336 1.648588
## Maine 1.85807072 1.814684
## Mercyhurst 0.20491604 1.937892
## Michigan 2.31275488 1.730413
## Minnesota State 1.53035947 1.642416
## UMass Lowell 1.80335896 1.828004
## Miami 4.93906019 1.790529
## Minnesota 2.08858213 1.638189
## Merrimack 1.49842386 1.848816
## Michigan State 1.89648770 1.771067
## Michigan Tech 0.40614831 1.804866
## North Dakota 3.99634938 1.654418
## Northeastern 1.64146457 1.854773
## New Hampshire 2.32847182 1.824355
## Niagara 0.63901348 1.817804
## Northern Michigan 2.35766614 1.755086
## Nebraska-Omaha 1.85642965 1.748304
## Notre Dame 1.15725100 1.774632
## Ohio State 1.56486396 1.770957
## Princeton 0.58045243 1.876601
## Providence 0.80385126 1.862217
## Quinnipiac 0.84248308 1.793050
## Robert Morris 0.65582605 1.808401
## Rensselaer 0.80698602 1.790481
## RIT 0.65859283 1.934463
## St. Cloud State 3.25880926 1.647483
## Sacred Heart 0.46356873 1.938510
## St. Lawrence 0.87097950 1.793989
## Union 1.29646572 1.809295
## Vermont 2.19434490 1.829474
## Wisconsin 4.53576033 1.672390
## Western Michigan 0.73765741 1.817049
## Yale 1.67954887 1.874806
The odds of a team winning at home is .402 times higher.
The probability of Alabama Huntsville beating Air Force with no home-ice advantage is 67.3%. The probability of Alabama Huntsville beating Air Force with home-ice advantage is 75.5%. The probability of Alabama Huntsville beating Air Force with Air Force having home-ice advantage is 58%