As a soccer fan, I picked three books related to data analysis in soccer.
1) The first book is titled ” The Numbers Game: Why Everything You Know About is Wrong” and written by Chris Anderson an ex-professional goalkeeper and David Sally a behavioral analyst.The book investigates which numbers and statistics actually matters the most in predicting a soccer game winner and was published in 2013.
2) The second one is called “Soccermatics:Mathematical Adventures in the Beautiful Game” and originally published in 2016 by David Sumpter. The author with the collaboration of World’s famous soccer players is using mathematical modeling to explain the tactical aspect of the game.
3) The last book was published in 202 by Rory Smith and is titled “Expected Goals:The story of how data conquered soccer and Changed the Game Forever”.Here, Rory is explaining how expected-goals metrics were developed in soccer and how clubs started using them to enhance their performance.
library(rvest) # package for HTML file
html_data <- read_html("books.html")
titles_h <- html_data %>% html_elements("tbody tr td:nth-child(1)") %>% html_text(trim = TRUE)
authors_h <- html_data %>% html_elements("tbody tr td:nth-child(2)") %>% html_text(trim = TRUE)
attributes_h <- html_data %>% html_elements("tbody tr td:nth-child(3)") %>% html_text(trim = TRUE)
df_html <- data.frame(
title = titles_h,
authors = authors_h,
attributes = attributes_h,
stringsAsFactors = FALSE
)
print(df_html)
## title
## 1 The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2 Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
## authors
## 1 Chris Anderson, David Sally
## 2 David Sumpter
## 3 Rory Smith
## attributes
## 1 Myth-busting, data-driven look at which statistics actually matter in footballAccessible to fans and practitioners; example-rich, ~400 pages
## 2 Applies mathematical modelling, networks and probability to explain tacticsReadable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analyticsCovers the modern analytics era with contemporary examples (editions ~2022–2023)
library(xml2) # package for XML file
xml_data <- read_xml("books.xml")
titles <- xml_find_all(xml_data, "//book/title") %>% xml_text()
authors <- xml_find_all(xml_data, "//book/authors") %>%
lapply(function(node) xml_find_all(node, "./author") %>% xml_text() %>% paste(collapse = ", ")) %>%
unlist()
attributes <- xml_find_all(xml_data, "//book/attributes") %>%
lapply(function(node) xml_find_all(node, "./attribute") %>% xml_text() %>% paste(collapse = "; ")) %>%
unlist()
df_xml <- data.frame(
title = titles,
authors = authors,
attributes = attributes,
stringsAsFactors = FALSE
)
print(df_xml)
## title
## 1 The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2 Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
## authors
## 1 Chris Anderson, David Sally
## 2 David Sumpter
## 3 Rory Smith
## attributes
## 1 Myth-busting, data-driven look at which statistics actually matter in football; Accessible to fans and practitioners; example-rich, ~400 pages
## 2 Applies mathematical modelling, networks and probability to explain tactics; Readable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analytics; Covers the modern analytics era with contemporary examples (editions ~2022–2023)
library(jsonlite) # package for JSON file
books_json <- fromJSON("books.json")
df_json <- as.data.frame(books_json)
print(df_json)
## title
## 1 The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2 Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
## authors
## 1 Chris Anderson, David Sally
## 2 David Sumpter
## 3 Rory Smith
## attributes
## 1 Myth-busting, data-driven look at which statistics actually matter in football, Accessible to fans and practitioners; example-rich, ~400 pages
## 2 Applies mathematical modelling, networks and probability to explain tactics, Readable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analytics, Covers the modern analytics era with contemporary examples (editions ~2022–2023)
ANSWER: Since we have three data frames, the best way to compare them will be to do it by pairs.
# Load the comparison library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Let's proceed to the comparison by pairs
compare_json_xml <- identical(df_json, df_xml)
compare_json_html <- identical(df_json, df_html)
compare_xml_html <- identical(df_xml, df_html)
cat("Are JSON and XML identical? ", compare_json_xml, "\n")
## Are JSON and XML identical? FALSE
cat("Are JSON and HTML identical? ", compare_json_html, "\n")
## Are JSON and HTML identical? FALSE
cat("Are XML and HTML identical? ", compare_xml_html, "\n")
## Are XML and HTML identical? FALSE
The result of the comparison shows that although all three sources describes the same data, their structure are not identical when loaded. Indeed, the XML and HTML data frame format will possess character strings whereas the JSON data frame format will likely have list columns.