This R Markdown Notebook is my project prosposal for the Data Wrangling with R I’m looking to explore European Soccer Database using dplyr and ggplot2 to predict the result of games using historical data.
The data used is provided through Kaggle datasets.
The data gives information on matches played across 11 different leagues.
How leagues work?
The data consists of multiple tables. You can find a full description of variables being used here.
library("RSQLite")
library("dplyr")
library("readr")
library("tidyverse")
library("tibble")
library('stringr')
library('lubridate')
temp <- tempfile()
download.file("https://kaggle2.blob.core.windows.net/datasets/63/589/database.sqlite.zip",
temp, method = 'curl')
# Extract to get the sqlite file
data <- unzip(temp)
filename <- str_c(temp, str_sub(data, 2))
# Connect to the sqlite file
con = dbConnect(RSQLite::SQLite(), dbname = unzip(temp))
# get a list of all tables
dbListTables(con)
## [1] "Country" "League" "Match"
## [4] "Player" "Player_Attributes" "Team"
## [7] "Team_Attributes" "sqlite_sequence"
# Read all tables
myQuery <- dbSendQuery(con, "SELECT * FROM Player")
Player <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM Country")
Country <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM League")
League <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM Player_Attributes")
Player_Attributes <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM Match")
Match <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM Team_Attributes")
Team_Attributes <- dbFetch(myQuery, n = -1)
myQuery <- dbSendQuery(con, "SELECT * FROM Team")
Team <- dbFetch(myQuery, n = -1)
# Clean up the environment
rm(temp, filename, data, con, myQuery)
# Find the country id for England's Premier League
England_id <- League %>%
filter(name == "England Premier League") %>%
select(id) %>%
as.double()
# Filter the matches data for only England
Match <- Match %>%
as_tibble %>%
filter(league_id == England_id)
# Match table has:
# season column - this should be a factor variable
# date column is a string - this should be a date variable
# stage column - this should be a factor variable
# result column - this new column is defined as:
# if home_goal > away_goal : home
# else if home_goal == away_goal : draw
# else away
getResult <- function(home_goal, away_goal){
return (ifelse(home_goal > away_goal,
'home',
(ifelse(home_goal == away_goal,
'draw', 'away'))
))
}
tidy_Match <- Match %>%
mutate(season = (season %>% factor(levels = unique(season))),
date = as.Date(date),
stage = (stage %>% factor(levels = c(1:38))),
result = getResult(home_team_goal, away_team_goal)) %>%
mutate(result = factor(result, levels = c("home", "draw", "away"))) %>%
select(matches("id|season|stage|date|result|team_goal|goal"))
The goal field in the Match table is in xml form. Extracting useful information from this is not simple.