Note: This lab is something of a preview to the third chapter of Scorecasting, How Competitive are Competitive Sports. I’ll give you one or two questions on that chapter on the Tuesday back from break.
Let’s start by loading the data and the packages. You know the drill:
p <- read.csv('/home/rstudioshared/shared_files/data/premierleague.csv')
library(dplyr); library(ggplot2)
Now, take a look at the data. It contains seaonal data for Premier League teams including wins, losses and draws, goals for and goals against and points (which is simply three points for a win and one point for a draw) and goes back to the year 2000.
The Big Questions
There are two questions that we are going to try to answer today. The first is how well a team’s record one year predicts their record the next year. In other words, are the same teams good time and again? The second question is whether there’s a better way to predict a team’s record.
To answer either of these questions we’ll need to join data from one year to data from the year after. In other words, we want to join this dataset to itself but we need to stagger it. We’ll do this in two steps:
p <- p %>% mutate(Season_after = Season+1)
p <- left_join(p, p, by=c("Team"="Team", "Season_after"="Season"))
Next, since the 2017 season is not yet complete, let’s remove all rows where the “season after” is 2017 or 2018:
p <- p %>% filter(Season_after <= 2016)
Now is a good time to take another good look at your dataset and make sure that you understand what you’ve created.
You can probably guess what comes next…
ggplot(p, aes(Pts.x, Pts.y))+geom_point()+geom_smooth(method="lm")
or, if we’re interested in points per game:
ggplot(p, aes(Pts.x/G.x, Pts.y/G.y))+geom_point()+geom_smooth(method="lm")
Now, to quantify this relationship:
cor(p$Pts.x/p$G.x, p$Pts.y/p$G.y, use="complete.obs")
Let’s also create a linear model (in other words, a best fit line). In the code below, pay attention to the I() function. When creating linear models, we can wrap this around calculations so that R doesn’t get confused by our syntax.
lm(I(Pts.y/G.y) ~ I(Pts.x/G.x), data=p)
According to the model, if a team lost every game one year(getting 0 points per game), how many points per game would they be expected to get the next year?
According to the model, if a team won every game (getting 3 points per game), how many points per game would they be expected to get the next year?
Do teams that are high scoring tend to allow more goals or fewer goals?
Do teams that are high scoring one year tend to be high scoring the next? Or is preventing goals more consistent from one year to the next?
How consistent is goal differential (goals scores less goals allowed) from one year to the next? How does this compare to the consistency in team points? is this what you would expect?
Can you devise a way to predict points per game in one year based on numbers from the previous year that’s better than simply using points per game from the previous year? Devise the best method you can for predicting team success based on prior year performance.