ddw5

For the second half, I did not really understand the numbers behind the betting odds. For example, B365H, B365A, and B365D are all formatted like x.xx. I do not sports bet in the first place, but even so I am familiar with the American betting odds that go by +200 (bet 100 to win 200). After reading the documentation and going to the bet365 website, I found that if the odds are x and the money put in is y, profit = (x * y) - y.

They encoded the data this way because the dataset is UK based, so it would make sense that they use UK odds since it is a UK based dataset. If I had not found out this information, I would not be able to use or interpret the betting data at all.

2.After looking through documentation I was still confused by some of the Asian handicap columns, like AHh has a lot of negative values. None of the UK betting sites have negative numbers so I am not sure what it means.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")

## Rows: 380 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl  (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

neg <- pl |>
  summarise(across(where(is.numeric), ~sum(.x < 0)))|>
  pivot_longer(everything(), names_to = "col", values_to = "neg") |>
  filter(neg > 0)
neg

## # A tibble: 2 × 2
##   col     neg
##   <chr> <int>
## 1 AHh     216
## 2 AHCh    216

The two columns with negative values have 216 each. Some graphs below show this.

ggplot(neg, aes(x = reorder(col, neg), y = neg)) +
  geom_col() +
  coord_flip() +
  labs(x = "Column", y = "Number of negative values", title = "Negative values by column")

hist(pl$AHh,
     breaks = 30,
     main = "Distribution of AHh column",
     xlab = "AHh")

Looking at how many negative values there are with this histogram it becomes evident that the Asian market odds will not be comparable with the rest. If I really wanted to look at them in comparison, I would have to do a lot of work and research to figure it out.

Isolating qualitative columns

cat <- names(pl)[sapply(pl, function(x) is.character(x) || is.factor(x))]
cat

## [1] "Div"      "Date"     "HomeTeam" "AwayTeam" "FTR"      "HTR"      "Referee"

sapply(pl[cat], function(x) sum(is.na(x)))

##      Div     Date HomeTeam AwayTeam      FTR      HTR  Referee 
##        0        0        0        0        0        0        0

sapply(pl[cat], function(x) sum(trimws(as.character(x)) == "", na.rm = TRUE))

##      Div     Date HomeTeam AwayTeam      FTR      HTR  Referee 
##        0        0        0        0        0        0        0

There are no missing values explicit or implicit. I checked through the dataset and everything is very clean. The qualitative values are essential towards telling the story of each row, with any missing data it would be difficult to contextualize the rest.The information is also pretty basic there is no real way to miss any data from those columns without a huge mistake. There is also no empty groups.

max(pl$MaxA)

## [1] 50

The maxA column has a value of 50.00, giving Bournemouth a 50/1 (2%) chance to beat Man City. This is the biggest outlier of this column. In a match between only 2 teams, giving one club only a 2% chance is absurdly low.

ddw5

2026-02-13