DATA607 Week 1 assignment: choose one of the provided datasets on fivethirtyeight.com that you find interesting:
FiveThirtyEight’s Club Soccer Prediction site was designed to forecast Futbol match wins. The data is broken down into several variables including a Soccer Power Index, which is an overall measure of a teams strength, as well as the Home or Visiting team’s odds of winning and their probable score.
The following lists the variables related to the Dataset, as provided by the creators github:
| Attributes | Description |
|---|---|
| season | Season Year |
| date | Date of Match |
| league_id | Unique Id # for League |
| league | League Name |
| team1 | Home Team |
| team2 | Visiting Team |
| spi1 | FiveThirtyEight’s Soccer Power Index (Overall Strength) for Home Team |
| spi2 | FiveThirtyEight’s Soccer Power Index (Overall Strength) for Visiting Team |
| prob1 | Probability Home Team Wins |
| prob2 | Probability Visiting Team Wins |
| probtie | Probability of a Tie |
| proj_score1 | Projected points scored by Home Team |
| proj_score2 | Projected Points scored by Visiting Team |
| importance1 | A measure of how much the outcome of the match will change Home team’s statistical outlook on the season |
| importance2 | A measure of how much the outcome of the match will change Visiting team’s statistical outlook on the season |
| score1 | Actual points scored by Home Team |
| score2 | Actual points scored by Visiting Team |
| xg1 | An estimate of how many goals Home Team “should” have scored, given the shots they took in that match |
| xg2 | An estimate of how many goals Visiting Team “should” have scored, given the shots they took in that match |
| nsxg1 | an estimate of how many goals Home team “should” have scored based on non-shooting actions they took around the opposing team’s goal5: passes, interceptions, take-ons and tackles |
| nsxg2 | an estimate of how many goals Visiting team “should” have scored based on non-shooting actions they took around the opposing team’s goal5: passes, interceptions, take-ons and tackles |
| adj_score1 | accounts for the conditions under which Home goal was scored: we reduce the value of goals scored when a team has more players on the field, as well as goals scored late in a match when a team is already leading. |
| adj_score2 | accounts for the conditions under which Visitor goal was scored: we reduce the value of goals scored when a team has more players on the field, as well as goals scored late in a match when a team is already leading. |
spi_matches<-read.csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv", header= TRUE, sep=",")## season date league_id league team1
## 1 2016 2016-07-09 7921 FA Women's Super League Liverpool Women
## 2 2016 2016-07-10 7921 FA Women's Super League Arsenal Women
## 3 2016 2016-07-10 7921 FA Women's Super League Chelsea FC Women
## 4 2016 2016-07-16 7921 FA Women's Super League Liverpool Women
## 5 2016 2016-07-17 7921 FA Women's Super League Chelsea FC Women
## 6 2016 2016-07-24 7921 FA Women's Super League Reading
## team2 spi1 spi2 prob1 prob2 probtie proj_score1 proj_score2
## 1 Reading 51.56 50.42 0.4389 0.2767 0.2844 1.39 1.05
## 2 Notts County Ladies 46.61 54.03 0.3572 0.3608 0.2819 1.27 1.28
## 3 Birmingham City 59.85 54.64 0.4799 0.2487 0.2714 1.53 1.03
## 4 Notts County Ladies 53.00 52.35 0.4289 0.2699 0.3013 1.27 0.94
## 5 Arsenal Women 59.43 60.99 0.4124 0.3157 0.2719 1.45 1.24
## 6 Birmingham City 50.75 55.03 0.3821 0.3200 0.2979 1.22 1.09
## importance1 importance2 score1 score2 xg1 xg2 nsxg1 nsxg2 adj_score1
## 1 NA NA 2 0 NA NA NA NA NA
## 2 NA NA 2 0 NA NA NA NA NA
## 3 NA NA 1 1 NA NA NA NA NA
## 4 NA NA 0 0 NA NA NA NA NA
## 5 NA NA 1 2 NA NA NA NA NA
## 6 NA NA 1 1 NA NA NA NA NA
## adj_score2
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
In order to decide what match(es) can be considered the greatest upset, I chose to make use of the Home Teams SPI and probability of winning vs. the actual final score. My subset will include games where the Home team’s Soccer Power Index and Probability of winning is less than that of their opponent. Additionally, my subset will only include games where the final score has the Home team winning despite the latter statistics.
#Identifying Match upsets
upset <-subset(spi_matches, (spi_matches$spi1)<spi_matches$spi2 & spi_matches$prob1<spi_matches$prob2 & spi_matches$score1>spi_matches$score2, select = c(season, team1, team2,spi1,spi2, score1, score2, prob1))
#Concatenating spi and score
upset$score1 <-paste(upset$score1,"-",upset$score2)
#Removing unneeded column
upset$score2 <- NULL
head(upset,3)## season team1 team2 spi1 spi2 score1 prob1
## 2 2016 Arsenal Women Notts County Ladies 46.61 54.03 2 - 0 0.3572
## 13 2016 Hull City Leicester City 53.57 66.81 2 - 1 0.3459
## 22 2016 Metz Lille 54.34 66.10 3 - 2 0.3444
The data upset accounts for matches where Home Teams having a lower SPI. I am interested in seeing the difference in SPI per match for a clearer assessment. Scores will be concatenated since it is not used for any calculations. Columns will be renamed for better understanding.
upset$spi3<- abs(upset$spi1 - upset$spi2)
#Removing unneeded column
upset$spi1 <- NULL
upset$spi2 <- NULL
colnames(upset)<-c ("Season","Home Team","Visiting Team","Score","Odds","SPI Difference")
head(upset,4)## Season Home Team Visiting Team Score Odds SPI Difference
## 2 2016 Arsenal Women Notts County Ladies 2 - 0 0.3572 7.42
## 13 2016 Hull City Leicester City 2 - 1 0.3459 13.24
## 22 2016 Metz Lille 3 - 2 0.3444 11.76
## 40 2016 Burnley Liverpool 2 - 0 0.2079 21.02
## [1] "The average Odds for games with upsets is: 0.283556659765355 with a standard deviation of 0.0646828335253701"
## [1] "The average difference in SPI for games with upsets is: 13.89983436853 with a standard deviation of 7.35144345861862"
In my perspective, the biggest Home Team upsets in the past 4 years will include matches where:
NOTE: The subset is already sorted to only reflect matches where the Home team is at a disadvantage both in probability of winning and SPI
I believe ten matches below reflect the biggest upsets in futbol over the last 4 years. I will acknowledge, my final data set biggest_upsets included 161 observations and my top 10 ten is based on my decision to sort by probability of winning, as I believe it carries more weight than SPI.
| Season | Home Team | Visiting Team | Score | Odds | SPI Difference | |
|---|---|---|---|---|---|---|
| 2096 | 2017 | Boston Breakers | Seattle Reign FC | 3 - 0 | 0.0497 | 20.16 |
| 2020 | 2017 | Boston Breakers | Sky Blue FC | 1 - 0 | 0.1268 | 10.60 |
| 18237 | 2018 | Newport County | Mansfield Town | 1 - 0 | 0.1293 | 15.97 |
| 33217 | 2020 | Atlanta United 2 | Miami FC | 4 - 3 | 0.1383 | 20.61 |
| 19266 | 2018 | Port Vale | Mansfield Town | 2 - 1 | 0.1444 | 14.18 |
| 14280 | 2018 | Alavés | Real Madrid | 1 - 0 | 0.1453 | 21.13 |
| 25286 | 2019 | Atlanta United 2 | Indy Eleven | 2 - 1 | 0.1462 | 18.47 |
| 17132 | 2018 | Cheltenham Town | Milton Keynes Dons | 3 - 1 | 0.1463 | 15.74 |
| 18447 | 2018 | Notts County | Mansfield Town | 1 - 0 | 0.1488 | 15.82 |
| 15960 | 2018 | Stevenage | Milton Keynes Dons | 3 - 2 | 0.1500 | 15.89 |