Assignment Requirements

DATA607 Week 1 assignment: choose one of the provided datasets on fivethirtyeight.com that you find interesting:

  1. Take the data, and create one or more code blocks.
    • You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent)variable, you should include this in your set of columns.
    • You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example: if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.”
    • Your deliverable is the R code to perform these transformation tasks.
  2. Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!
  3. Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article.
  4. Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article.
  5. Each of your text blocks should minimally include at least one header, and additional non-header text.
  6. You’re of course welcome—but not required–to include additional information, such as exploratory data analysis graphics (which we will cover later in the course).
  7. Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com.
  8. Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repository and your rpubs.com file in your assignment link.

Overview

FiveThirtyEight’s Club Soccer Prediction site was designed to forecast Futbol match wins. The data is broken down into several variables including a Soccer Power Index, which is an overall measure of a teams strength, as well as the Home or Visiting team’s odds of winning and their probable score.

Data Dictionary

The following lists the variables related to the Dataset, as provided by the creators github:

Attributes Description
season Season Year
date Date of Match
league_id Unique Id # for League
league League Name
team1 Home Team
team2 Visiting Team
spi1 FiveThirtyEight’s Soccer Power Index (Overall Strength) for Home Team
spi2 FiveThirtyEight’s Soccer Power Index (Overall Strength) for Visiting Team
prob1 Probability Home Team Wins
prob2 Probability Visiting Team Wins
probtie Probability of a Tie
proj_score1 Projected points scored by Home Team
proj_score2 Projected Points scored by Visiting Team
importance1 A measure of how much the outcome of the match will change Home team’s statistical outlook on the season
importance2 A measure of how much the outcome of the match will change Visiting team’s statistical outlook on the season
score1 Actual points scored by Home Team
score2 Actual points scored by Visiting Team
xg1 An estimate of how many goals Home Team “should” have scored, given the shots they took in that match
xg2 An estimate of how many goals Visiting Team “should” have scored, given the shots they took in that match
nsxg1 an estimate of how many goals Home team “should” have scored based on non-shooting actions they took around the opposing team’s goal5: passes, interceptions, take-ons and tackles
nsxg2 an estimate of how many goals Visiting team “should” have scored based on non-shooting actions they took around the opposing team’s goal5: passes, interceptions, take-ons and tackles
adj_score1 accounts for the conditions under which Home goal was scored: we reduce the value of goals scored when a team has more players on the field, as well as goals scored late in a match when a team is already leading.
adj_score2 accounts for the conditions under which Visitor goal was scored: we reduce the value of goals scored when a team has more players on the field, as well as goals scored late in a match when a team is already leading.

Load Data

Data Set Title: spi_matches

Source Information:

  1. Creators: Jay Boice
  2. Github: jayb
  3. Most recent update: 6/2/2020
spi_matches<-read.csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv", header= TRUE, sep=",")
##   season       date league_id                  league            team1
## 1   2016 2016-07-09      7921 FA Women's Super League  Liverpool Women
## 2   2016 2016-07-10      7921 FA Women's Super League    Arsenal Women
## 3   2016 2016-07-10      7921 FA Women's Super League Chelsea FC Women
## 4   2016 2016-07-16      7921 FA Women's Super League  Liverpool Women
## 5   2016 2016-07-17      7921 FA Women's Super League Chelsea FC Women
## 6   2016 2016-07-24      7921 FA Women's Super League          Reading
##                 team2  spi1  spi2  prob1  prob2 probtie proj_score1 proj_score2
## 1             Reading 51.56 50.42 0.4389 0.2767  0.2844        1.39        1.05
## 2 Notts County Ladies 46.61 54.03 0.3572 0.3608  0.2819        1.27        1.28
## 3     Birmingham City 59.85 54.64 0.4799 0.2487  0.2714        1.53        1.03
## 4 Notts County Ladies 53.00 52.35 0.4289 0.2699  0.3013        1.27        0.94
## 5       Arsenal Women 59.43 60.99 0.4124 0.3157  0.2719        1.45        1.24
## 6     Birmingham City 50.75 55.03 0.3821 0.3200  0.2979        1.22        1.09
##   importance1 importance2 score1 score2 xg1 xg2 nsxg1 nsxg2 adj_score1
## 1          NA          NA      2      0  NA  NA    NA    NA         NA
## 2          NA          NA      2      0  NA  NA    NA    NA         NA
## 3          NA          NA      1      1  NA  NA    NA    NA         NA
## 4          NA          NA      0      0  NA  NA    NA    NA         NA
## 5          NA          NA      1      2  NA  NA    NA    NA         NA
## 6          NA          NA      1      1  NA  NA    NA    NA         NA
##   adj_score2
## 1         NA
## 2         NA
## 3         NA
## 4         NA
## 5         NA
## 6         NA

Subsetting the Data Set

What was the biggest Home Team upset over the past 4 years?

Initial subset

In order to decide what match(es) can be considered the greatest upset, I chose to make use of the Home Teams SPI and probability of winning vs. the actual final score. My subset will include games where the Home team’s Soccer Power Index and Probability of winning is less than that of their opponent. Additionally, my subset will only include games where the final score has the Home team winning despite the latter statistics.

#Identifying Match upsets
upset <-subset(spi_matches, (spi_matches$spi1)<spi_matches$spi2 & spi_matches$prob1<spi_matches$prob2 & spi_matches$score1>spi_matches$score2, select = c(season, team1, team2,spi1,spi2, score1, score2, prob1))
#Concatenating spi and score
upset$score1 <-paste(upset$score1,"-",upset$score2)
#Removing unneeded column
upset$score2 <- NULL

head(upset,3)
##    season         team1               team2  spi1  spi2 score1  prob1
## 2    2016 Arsenal Women Notts County Ladies 46.61 54.03  2 - 0 0.3572
## 13   2016     Hull City      Leicester City 53.57 66.81  2 - 1 0.3459
## 22   2016          Metz               Lille 54.34 66.10  3 - 2 0.3444

Calculations and Renaming

The data upset accounts for matches where Home Teams having a lower SPI. I am interested in seeing the difference in SPI per match for a clearer assessment. Scores will be concatenated since it is not used for any calculations. Columns will be renamed for better understanding.

upset$spi3<- abs(upset$spi1 - upset$spi2)
#Removing unneeded column
upset$spi1 <- NULL
upset$spi2 <- NULL
colnames(upset)<-c ("Season","Home Team","Visiting Team","Score","Odds","SPI Difference")
head(upset,4)
##    Season     Home Team       Visiting Team Score   Odds SPI Difference
## 2    2016 Arsenal Women Notts County Ladies 2 - 0 0.3572           7.42
## 13   2016     Hull City      Leicester City 2 - 1 0.3459          13.24
## 22   2016          Metz               Lille 3 - 2 0.3444          11.76
## 40   2016       Burnley           Liverpool 2 - 0 0.2079          21.02
## [1] "The average Odds for games with upsets is: 0.283556659765355 with a standard deviation of 0.0646828335253701"
## [1] "The average difference in SPI for games with upsets is: 13.89983436853 with a standard deviation of 7.35144345861862"

In my perspective, the biggest Home Team upsets in the past 4 years will include matches where:

  1. The Home Teams probability of winning is at least one standard deviation lower than the average among our upsets.
  2. The difference in SPI between the teams must be at least 1 standard deviation HIGHER than the mean.

NOTE: The subset is already sorted to only reflect matches where the Home team is at a disadvantage both in probability of winning and SPI

biggest_upsets <- subset(upset,upset$Odds<(mean(upset$Odds)-sd(upset$Odds)) & 
                             upset$`SPI Difference`<(mean(upset$`SPI Difference`)+sd(upset$`SPI Difference`)))
biggest_upsets <- biggest_upsets[with(biggest_upsets, order(biggest_upsets$Odds)), ]

Conclusions

I believe ten matches below reflect the biggest upsets in futbol over the last 4 years. I will acknowledge, my final data set biggest_upsets included 161 observations and my top 10 ten is based on my decision to sort by probability of winning, as I believe it carries more weight than SPI.

Season Home Team Visiting Team Score Odds SPI Difference
2096 2017 Boston Breakers Seattle Reign FC 3 - 0 0.0497 20.16
2020 2017 Boston Breakers Sky Blue FC 1 - 0 0.1268 10.60
18237 2018 Newport County Mansfield Town 1 - 0 0.1293 15.97
33217 2020 Atlanta United 2 Miami FC 4 - 3 0.1383 20.61
19266 2018 Port Vale Mansfield Town 2 - 1 0.1444 14.18
14280 2018 Alavés Real Madrid 1 - 0 0.1453 21.13
25286 2019 Atlanta United 2 Indy Eleven 2 - 1 0.1462 18.47
17132 2018 Cheltenham Town Milton Keynes Dons 3 - 1 0.1463 15.74
18447 2018 Notts County Mansfield Town 1 - 0 0.1488 15.82
15960 2018 Stevenage Milton Keynes Dons 3 - 2 0.1500 15.89