WEEK 7 ASSIGNMENT

Books selections

As a soccer fan, I picked three books related to data analysis in soccer.

1) The first book is titled ” The Numbers Game: Why Everything You Know About is Wrong” and written by Chris Anderson an ex-professional goalkeeper and David Sally a behavioral analyst.The book investigates which numbers and statistics actually matters the most in predicting a soccer game winner and was published in 2013.

2) The second one is called “Soccermatics:Mathematical Adventures in the Beautiful Game” and originally published in 2016 by David Sumpter. The author with the collaboration of World’s famous soccer players is using mathematical modeling to explain the tactical aspect of the game.

3) The last book was published in 202 by Rory Smith and is titled “Expected Goals:The story of how data conquered soccer and Changed the Game Forever”.Here, Rory is explaining how expected-goals metrics were developed in soccer and how clubs started using them to enhance their performance.

Let’s load separately the three files with the books’ information in HTML, XML and JSON formats.

A) Let’s start with the HTLM Format

library(rvest)      # package for HTML file

html_data <- read_html("books.html")

titles_h <- html_data %>% html_elements("tbody tr td:nth-child(1)") %>% html_text(trim = TRUE)
authors_h <- html_data %>% html_elements("tbody tr td:nth-child(2)") %>% html_text(trim = TRUE)
attributes_h <- html_data %>% html_elements("tbody tr td:nth-child(3)") %>% html_text(trim = TRUE)

df_html <- data.frame(
  title = titles_h,
  authors = authors_h,
  attributes = attributes_h,
  stringsAsFactors = FALSE
)

print(df_html)

##                                                                                   title
## 1                       The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2                           Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
##                       authors
## 1 Chris Anderson, David Sally
## 2               David Sumpter
## 3                  Rory Smith
##                                                                                                                                                        attributes
## 1                    Myth-busting, data-driven look at which statistics actually matter in footballAccessible to fans and practitioners; example-rich, ~400 pages
## 2                     Applies mathematical modelling, networks and probability to explain tacticsReadable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analyticsCovers the modern analytics era with contemporary examples (editions ~2022–2023)

B) Next we will load the books’ XLM Format.

library(xml2)       # package for XML file 

xml_data <- read_xml("books.xml")

titles <- xml_find_all(xml_data, "//book/title") %>% xml_text()
authors <- xml_find_all(xml_data, "//book/authors") %>%
  lapply(function(node) xml_find_all(node, "./author") %>% xml_text() %>% paste(collapse = ", ")) %>%
  unlist()
attributes <- xml_find_all(xml_data, "//book/attributes") %>%
  lapply(function(node) xml_find_all(node, "./attribute") %>% xml_text() %>% paste(collapse = "; ")) %>%
  unlist()

df_xml <- data.frame(
  title = titles,
  authors = authors,
  attributes = attributes,
  stringsAsFactors = FALSE
)

print(df_xml)

##                                                                                   title
## 1                       The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2                           Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
##                       authors
## 1 Chris Anderson, David Sally
## 2               David Sumpter
## 3                  Rory Smith
##                                                                                                                                                          attributes
## 1                    Myth-busting, data-driven look at which statistics actually matter in football; Accessible to fans and practitioners; example-rich, ~400 pages
## 2                     Applies mathematical modelling, networks and probability to explain tactics; Readable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analytics; Covers the modern analytics era with contemporary examples (editions ~2022–2023)

C) Next we will load the books’ JSON Format.

library(jsonlite)   # package for JSON file

books_json <- fromJSON("books.json")

df_json <- as.data.frame(books_json)

print(df_json)

##                                                                                   title
## 1                       The Numbers Game: Why Everything You Know About Soccer Is Wrong
## 2                           Soccermatics: Mathematical Adventures in the Beautiful Game
## 3 Expected Goals: The Story of How Data Conquered Football and Changed the Game Forever
##                       authors
## 1 Chris Anderson, David Sally
## 2               David Sumpter
## 3                  Rory Smith
##                                                                                                                                                          attributes
## 1                    Myth-busting, data-driven look at which statistics actually matter in football, Accessible to fans and practitioners; example-rich, ~400 pages
## 2                     Applies mathematical modelling, networks and probability to explain tactics, Readable for non-specialists but includes diagrams and equations
## 3 Journalistic history of the xG revolution and how clubs/media adopted analytics, Covers the modern analytics era with contemporary examples (editions ~2022–2023)

COMPARISON OF ALL THREE DATA FRAMES FORMATS

ANSWER: Since we have three data frames, the best way to compare them will be to do it by pairs.

# Load the comparison library

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Let's proceed to the comparison by pairs

compare_json_xml  <- identical(df_json, df_xml)

compare_json_html <- identical(df_json, df_html)

compare_xml_html  <- identical(df_xml, df_html)

cat("Are JSON and XML identical? ", compare_json_xml, "\n")

## Are JSON and XML identical?  FALSE

cat("Are JSON and HTML identical? ", compare_json_html, "\n")

## Are JSON and HTML identical?  FALSE

cat("Are XML and HTML identical? ", compare_xml_html, "\n")

## Are XML and HTML identical?  FALSE

CONCLUSION

The result of the comparison shows that although all three sources describes the same data, their structure are not identical when loaded. Indeed, the XML and HTML data frame format will possess character strings whereas the JSON data frame format will likely have list columns.