What and why

For the first assignment of DATA607, I selected one of the databases from https://data.fivethirtyeight.com/ and followed the 8-step instruction on the assignment page. I chose the 2022 World Cup Predictions database, which has two CSV files: one for the match data and one for the forecast data. I am interested in soccer, so this database appealed to me.

The zip file containing the data can be downloaded from this link, which I will use to load the data into my project.

Here is the address to download the zip file from the website, this address will be used to pull in the data as needed.

Code to reproducibility purposes

This code snippet shows how to download data from the web and store it in data frames for both forecast and match variables. The data can be accessed by anyone in the world who has an internet connection.:

#Written by Koohyar Pooladvand
#Semister Spring 2024
#what: First assignment for DATA607, 

#Run the follwing command in case, function are not found  
#install.packages("RCurl")
#install.packages("tidyverse")
#install.packages("data.table")

library(RCurl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
#load forcast
Int_data_handle <- getURL("https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_forecasts.csv")
Forcast_data<-data.frame(read.csv(text=Int_data_handle, header=TRUE))

print("The size of Forcast dataframen and let's take a look at it using glimpse")
## [1] "The size of Forcast dataframen and let's take a look at it using glimpse"
dim(Forcast_data)
## [1] 256  22
glimpse(Forcast_data)
## Rows: 256
## Columns: 22
## $ ï..forecast_timestamp <chr> "2022-12-18 17:56:03 UTC", "2022-12-18 17:56:03 …
## $ team                  <chr> "Argentina", "France", "Morocco", "Croatia", "En…
## $ group                 <chr> "C", "D", "F", "F", "B", "A", "H", "G", "E", "A"…
## $ spi                   <dbl> 89.64860, 88.30043, 73.16416, 78.82038, 87.82131…
## $ global_o              <dbl> 2.83610, 2.96765, 1.74313, 2.20264, 2.71564, 2.5…
## $ global_d              <dbl> 0.39397, 0.54381, 0.53433, 0.60290, 0.44261, 0.5…
## $ sim_wins              <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ sim_ties              <dbl> 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, …
## $ sim_losses            <dbl> 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ sim_goal_diff         <dbl> 3, 3, 3, 3, 7, 4, 2, 2, 1, 1, 1, -1, 1, 6, 0, 0,…
## $ goals_scored          <dbl> 5, 6, 4, 4, 9, 5, 6, 3, 4, 5, 4, 3, 2, 9, 4, 2, …
## $ goals_against         <dbl> 2, 3, 1, 1, 2, 1, 4, 1, 3, 4, 3, 4, 1, 3, 4, 2, …
## $ group_1               <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ group_2               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ group_3               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ group_4               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_round_of_16      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ make_quarters         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_semis            <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ make_final            <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ win_league            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ timestamp             <chr> "2022-12-18 17:56:44 UTC", "2022-12-18 17:56:44 …
#change column name, I want to change of the column spi to soccer_power_index
colnames(Forcast_data)[4] <- "soccer_power_index"



# load the match data 
Int_data_handle <- getURL("https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_matches.csv")
Match_data<-data.frame(read.csv(text=Int_data_handle, header=TRUE))
print("The size of Match dataframe, and let's take a look at it using glimpse")
## [1] "The size of Match dataframe, and let's take a look at it using glimpse"
dim(Match_data)
## [1] 64 20
glimpse(Match_data)
## Rows: 64
## Columns: 20
## $ date        <chr> "2022-11-20", "2022-11-21", "2022-11-21", "2022-11-21", "2…
## $ league_id   <int> 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908, 1908…
## $ league      <chr> "FIFA World Cup", "FIFA World Cup", "FIFA World Cup", "FIF…
## $ team1       <chr> "Qatar", "England", "Senegal", "USA", "Argentina", "Denmar…
## $ team2       <chr> "Ecuador", "Iran", "Netherlands", "Wales", "Saudi Arabia",…
## $ spi1        <dbl> 51.00, 85.96, 73.84, 74.83, 87.21, 80.02, 74.30, 87.71, 75…
## $ spi2        <dbl> 72.74, 62.17, 86.01, 65.58, 56.87, 65.85, 68.28, 60.83, 78…
## $ prob1       <dbl> 0.2369, 0.6274, 0.2235, 0.4489, 0.7228, 0.5001, 0.4238, 0.…
## $ prob2       <dbl> 0.5045, 0.1187, 0.5053, 0.2591, 0.0807, 0.2054, 0.2802, 0.…
## $ probtie     <dbl> 0.2586, 0.2539, 0.2712, 0.2920, 0.1966, 0.2945, 0.2960, 0.…
## $ proj_score1 <dbl> 1.13, 1.70, 0.99, 1.42, 2.11, 1.44, 1.37, 2.09, 1.18, 2.14…
## $ proj_score2 <dbl> 1.75, 0.58, 1.63, 1.01, 0.54, 0.82, 1.06, 0.65, 1.34, 1.06…
## $ score1      <int> 0, 6, 0, 1, 1, 0, 0, 4, 0, 1, 7, 1, 1, 0, 3, 2, 0, 1, 1, 0…
## $ score2      <int> 2, 2, 2, 1, 2, 0, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 2, 3, 1, 0…
## $ xg1         <dbl> 0.23, 1.04, 0.70, 0.33, 1.63, 0.66, 0.45, 3.03, 0.28, 3.10…
## $ xg2         <dbl> 1.14, 1.45, 0.68, 1.78, 0.15, 1.16, 1.02, 0.26, 0.88, 1.20…
## $ nsxg1       <dbl> 0.24, 1.50, 1.22, 0.48, 2.40, 1.33, 1.19, 3.01, 0.54, 3.10…
## $ nsxg2       <dbl> 1.35, 0.32, 1.83, 0.95, 0.53, 0.69, 0.49, 0.30, 0.64, 0.85…
## $ adj_score1  <dbl> 0.00, 5.78, 0.00, 1.05, 1.05, 0.00, 0.00, 4.18, 0.00, 1.05…
## $ adj_score2  <dbl> 2.10, 2.10, 1.58, 1.05, 2.10, 0.00, 0.00, 1.05, 0.00, 2.10…

Including Plots

This plot is added to the report, it is only a plot of the 4 countries taht made it to the semi-final of world cup:

#find the countries that made it to the final match for worldcup semi-final and plot their SPI.
world_cup_semi <- Match_data[c(dim(Match_data)[1]-1,dim(Match_data)[1]),c(seq(4,7,1))]

barplot(c(world_cup_semi$spi1,world_cup_semi$spi2),
        names.arg=c(world_cup_semi$team1,world_cup_semi$team2),
        main = "Worldcup semi final team Soccer Power Index (SPI)", 
        xlab = "Country", ylab = "SPI", col = "blue")

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Conclusion

This project demonstrates how to create a bar chart that shows the SPI of the four countries that reached the semi-final stage in the 2023 World Cup. The data is obtained from an online source and stored in a data frame using R. I also learned how to use R-markdown to document my code, how to upload my project to GitHub, and how to publish it on RPub.

To read the data, I used the read.csv function from the base package of R, but I also loaded the data.table package in case I wanted to use the fread function instead. I alos used R base barplot to plot the final result.