What and why

For the first assignment of DATA607, I selected one of the databases from https://data.fivethirtyeight.com/ and followed the 8-step instruction on the assignment page. I chose the 2022 World Cup Predictions database, which has two CSV files: one for the match data and one for the forecast data. I am interested in soccer, so this database appealed to me.

The zip file containing the data can be downloaded from this link, which I will use to load the data into my project.

Here is the address to download the zip file from the website, this address will be used to pull in the data as needed.

GitHub: data/world-cup-2022 at master · fivethirtyeight/data (github.com)
Match: https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_matches.csv
Forcast: https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_forecasts.csv

Code to reproducibility purposes

This code snippet shows how to download data from the web and store it in data frames for both forecast and match variables. The data can be accessed by anyone in the world who has an internet connection.:

#Written by Koohyar Pooladvand
#Semister Spring 2024
#what: First assignment for DATA607, 

#Run the follwing command in case, function are not found  
#install.packages("RCurl")
#install.packages("tidyverse")
#install.packages("data.table")

library(RCurl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

#load forcast
Int_data_handle <- getURL("https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_forecasts.csv")
Forcast_data<-data.frame(read.csv(text=Int_data_handle, header=TRUE))

print("The size of Forcast dataframen and let's take a look at it using glimpse")

## [1] "The size of Forcast dataframen and let's take a look at it using glimpse"

dim(Forcast_data)

## [1] 256  22

glimpse(Forcast_data)

## Rows: 256
## Columns: 22
## $ ï..forecast_timestamp <chr> "2022-12-18 17:56:03 UTC", "2022-12-18 17:56:03 …
## $ team                  <chr> "Argentina", "France", "Morocco", "Croatia", "En…
## $ group                 <chr> "C", "D", "F", "F", "B", "A", "H", "G", "E", "A"…
## $ spi                   <dbl> 89.64860, 88.30043, 73.16416, 78.82038, 87.82131…
## $ global_o              <dbl> 2.83610, 2.96765, 1.74313, 2.20264, 2.71564, 2.5…
## $ global_d              <dbl> 0.39397, 0.54381, 0.53433, 0.60290, 0.44261, 0.5…
## $ sim_wins              <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ sim_ties              <dbl> 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, …
## $ sim_losses            <dbl> 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ sim_goal_diff         <dbl> 3, 3, 3, 3, 7, 4, 2, 2, 1, 1, 1, -1, 1, 6, 0, 0,…
## $ goals_scored          <dbl> 5, 6, 4, 4, 9, 5, 6, 3, 4, 5, 4, 3, 2, 9, 4, 2, …
## $ goals_against         <dbl> 2, 3, 1, 1, 2, 1, 4, 1, 3, 4, 3, 4, 1, 3, 4, 2, …
## $ group_1               <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ group_2               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ group_3               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ group_4               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_round_of_16      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ make_quarters         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_semis            <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_final            <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ win_league            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ timestamp             <chr> "2022-12-18 17:56:44 UTC", "2022-12-18 17:56:44 …

#change column name, I want to change of the column spi to soccer_power_index
colnames(Forcast_data)[4] <- "soccer_power_index"



# load the match data 
Int_data_handle <- getURL("https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_matches.csv")
Match_data<-data.frame(read.csv(text=Int_data_handle, header=TRUE))
print("The size of Match dataframe, and let's take a look at it using glimpse")

## [1] "The size of Match dataframe, and let's take a look at it using glimpse"

dim(Match_data)

## [1] 64 20

glimpse(Match_data)

## Rows: 64
## Columns: 20
## $ date        <chr> "2022-11-20", "2022-11-21", "2022-11-21", "2022-11-21", "2…
## $ league_id   <int> 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908…
## $ league      <chr> "FIFA World Cup", "FIFA World Cup", "FIFA World Cup", "FIF…
## $ team1       <chr> "Qatar", "England", "Senegal", "USA", "Argentina", "Denmar…
## $ team2       <chr> "Ecuador", "Iran", "Netherlands", "Wales", "Saudi Arabia",…
## $ spi1        <dbl> 51.00, 85.96, 73.84, 74.83, 87.21, 80.02, 74.30, 87.71, 75…
## $ spi2        <dbl> 72.74, 62.17, 86.01, 65.58, 56.87, 65.85, 68.28, 60.83, 78…
## $ prob1       <dbl> 0.2369, 0.6274, 0.2235, 0.4489, 0.7228, 0.5001, 0.4238, 0.…
## $ prob2       <dbl> 0.5045, 0.1187, 0.5053, 0.2591, 0.0807, 0.2054, 0.2802, 0.…
## $ probtie     <dbl> 0.2586, 0.2539, 0.2712, 0.2920, 0.1966, 0.2945, 0.2960, 0.…
## $ proj_score1 <dbl> 1.13, 1.70, 0.99, 1.42, 2.11, 1.44, 1.37, 2.09, 1.18, 2.14…
## $ proj_score2 <dbl> 1.75, 0.58, 1.63, 1.01, 0.54, 0.82, 1.06, 0.65, 1.34, 1.06…
## $ score1      <int> 0, 6, 0, 1, 1, 0, 0, 4, 0, 1, 7, 1, 1, 0, 3, 2, 0, 1, 1, 0…
## $ score2      <int> 2, 2, 2, 1, 2, 0, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 2, 3, 1, 0…
## $ xg1         <dbl> 0.23, 1.04, 0.70, 0.33, 1.63, 0.66, 0.45, 3.03, 0.28, 3.10…
## $ xg2         <dbl> 1.14, 1.45, 0.68, 1.78, 0.15, 1.16, 1.02, 0.26, 0.88, 1.20…
## $ nsxg1       <dbl> 0.24, 1.50, 1.22, 0.48, 2.40, 1.33, 1.19, 3.01, 0.54, 3.10…
## $ nsxg2       <dbl> 1.35, 0.32, 1.83, 0.95, 0.53, 0.69, 0.49, 0.30, 0.64, 0.85…
## $ adj_score1  <dbl> 0.00, 5.78, 0.00, 1.05, 1.05, 0.00, 0.00, 4.18, 0.00, 1.05…
## $ adj_score2  <dbl> 2.10, 2.10, 1.58, 1.05, 2.10, 0.00, 0.00, 1.05, 0.00, 2.10…

Including Plots

This plot is added to the report, it is only a plot of the 4 countries taht made it to the semi-final of world cup:

#find the countries that made it to the final match for worldcup semi-final and plot their SPI.
world_cup_semi <- Match_data[c(dim(Match_data)[1]-1,dim(Match_data)[1]),c(seq(4,7,1))]

barplot(c(world_cup_semi$spi1,world_cup_semi$spi2),
        names.arg=c(world_cup_semi$team1,world_cup_semi$team2),
        main = "Worldcup semi final team Soccer Power Index (SPI)", 
        xlab = "Country", ylab = "SPI", col = "blue")

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

First Assignment DATA607

Koohyar Pooladvand

2024-02-06

What and why

Code to reproducibility purposes

Including Plots

Conclusion