They encoded the data this way because the dataset is UK based, so it would make sense that they use UK odds since it is a UK based dataset. If I had not found out this information, I would not be able to use or interpret the betting data at all.
2.After looking through documentation I was still confused by some of the Asian handicap columns, like AHh has a lot of negative values. None of the UK betting sites have negative numbers so I am not sure what it means.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
neg <- pl |>
summarise(across(where(is.numeric), ~sum(.x < 0)))|>
pivot_longer(everything(), names_to = "col", values_to = "neg") |>
filter(neg > 0)
neg
## # A tibble: 2 × 2
## col neg
## <chr> <int>
## 1 AHh 216
## 2 AHCh 216
The two columns with negative values have 216 each. Some graphs below show this.
ggplot(neg, aes(x = reorder(col, neg), y = neg)) +
geom_col() +
coord_flip() +
labs(x = "Column", y = "Number of negative values", title = "Negative values by column")
hist(pl$AHh,
breaks = 30,
main = "Distribution of AHh column",
xlab = "AHh")
Looking at how many negative values there are with this histogram it becomes evident that the Asian market odds will not be comparable with the rest. If I really wanted to look at them in comparison, I would have to do a lot of work and research to figure it out.
Isolating qualitative columns
cat <- names(pl)[sapply(pl, function(x) is.character(x) || is.factor(x))]
cat
## [1] "Div" "Date" "HomeTeam" "AwayTeam" "FTR" "HTR" "Referee"
sapply(pl[cat], function(x) sum(is.na(x)))
## Div Date HomeTeam AwayTeam FTR HTR Referee
## 0 0 0 0 0 0 0
sapply(pl[cat], function(x) sum(trimws(as.character(x)) == "", na.rm = TRUE))
## Div Date HomeTeam AwayTeam FTR HTR Referee
## 0 0 0 0 0 0 0
There are no missing values explicit or implicit. I checked through the dataset and everything is very clean. The qualitative values are essential towards telling the story of each row, with any missing data it would be difficult to contextualize the rest.The information is also pretty basic there is no real way to miss any data from those columns without a huge mistake. There is also no empty groups.
max(pl$MaxA)
## [1] 50
The maxA column has a value of 50.00, giving Bournemouth a 50/1 (2%) chance to beat Man City. This is the biggest outlier of this column. In a match between only 2 teams, giving one club only a 2% chance is absurdly low.