##Introduction: The data set offers predictions of soccer match outcomes based on FiveThirtyEight’s Soccer Power Index (SPI). This index is used to rate teams’ offensive and defensive capabilities, which are then combined to generate win probabilities and expected scores for upcoming matches.

spi_matches.csv contains match-by-match SPI ratings and forecasts back to 2016. Article link: https://projects.fivethirtyeight.com/soccer-predictions/

I have selected meaningful columns, including match date, team names, Soccer Power Index (SPI) for both teams, and their respective win probabilities. Columns were renamed to make them more intuitive.

spi_matches <- read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv")
## Rows: 68913 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): league, team1, team2
## dbl  (19): season, league_id, spi1, spi2, prob1, prob2, probtie, proj_score1...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Select a subset of the columns (Date, Team 1, Team 2, Team 1 SPI, Team 2 SPI, Win Probabilities)
spi_matches_subset <- dplyr::select(spi_matches, date, team1, team2, spi1, spi2, prob1, prob2, probtie)

# Rename columns for better clarity, explicitly calling dplyr's rename function
spi_matches_subset <- dplyr::rename(spi_matches_subset,
    Date = date,
    Team_1 = team1,
    Team_2 = team2,
    Team_1_SPI = spi1,
    Team_2_SPI = spi2,
    Team_1_Win_Prob = prob1,
    Team_2_Win_Prob = prob2,
    Tie_Prob = probtie
)

# Display first few rows of the cleaned dataset
head(spi_matches_subset)
## # A tibble: 6 × 8
##   Date       Team_1 Team_2 Team_1_SPI Team_2_SPI Team_1_Win_Prob Team_2_Win_Prob
##   <date>     <chr>  <chr>       <dbl>      <dbl>           <dbl>           <dbl>
## 1 2016-07-09 Liver… Readi…       51.6       50.4           0.439           0.277
## 2 2016-07-10 Arsen… Notts…       46.6       54.0           0.357           0.361
## 3 2016-07-10 Chels… Birmi…       59.8       54.6           0.480           0.249
## 4 2016-07-16 Liver… Notts…       53         52.4           0.429           0.270
## 5 2016-07-17 Chels… Arsen…       59.4       61.0           0.412           0.316
## 6 2016-07-24 Readi… Birmi…       50.8       55.0           0.382           0.32 
## # ℹ 1 more variable: Tie_Prob <dbl>

##Conculusion: We can use this dataset in future by implementing machine learning techniques to understand how SPI corelate with actual match outcome.