This assignment demonstrates how to manually create structured data files in HTML and JSON formats, then load them into R data frames, and rigorously compare whether both sources yield identical data.In order to implement it, i chose to investigate the evolution of striker performance metrics in the Premier League since 2020 through a comparative analysis of 3 selected sports analytics literature.
APPROACH
We will design a structured data science approach to accomplish our goal using the following steps.
Identify three litteratures about soccer published since 2020 and the ones i chose are Football Hackers (2020), The Expected Goals Philosophy (2021), and Soccermatics (2022).
Manually compile and structure the books’ data in two separate formats: an HTML and a JSON files representing the same data.
Load both HTML and JSON data into two separate R data frame using “rvest” and “jsonlite” packages.
Perform a logical comparison to ensure the information remained identical across both architectures.
LOADING USEFUL LIBRARIES
library(rvest) # HTML parsinglibrary(jsonlite) # JSON parsinglibrary(dplyr) # Data manipulation
Warning: package 'dplyr' was built under R version 4.5.2
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tibble) # Modern data frames
Warning: package 'tibble' was built under R version 4.5.2
library(knitr) # Table rendering
Warning: package 'knitr' was built under R version 4.5.2
BOOKS SELECTIONS
I chose these three books because they represent the evolution of data-driven soccer analysis, with direct relevance to striker performance metrics used in the Premier League post-2020.
Football Hackers (2020) — Biermann chronicles the global data revolution inside football clubs, with extensive coverage of expected goals (xG) and striker evaluation metrics pioneered by clubs like Liverpool and Brentford. The updated 2020 edition incorporates post-2018 Premier League data.
The Expected Goals Philosophy (2021) — This multi-author work provides the most rigorous statistical treatment of xG as a striker evaluation tool. Tippett, Ramineni, and Raman break down shot quality, positional play, and finishing efficiency across Premier League seasons.
Soccermatics (2022) — Sumpter’s updated edition applies mathematical modelling (network theory, Markov chains, Poisson models) to explain how Premier League strikers create and convert chances, making it essential reading for analytically-minded practitioners.
Let’s manually create an HTML and JSON files containing a table with the books data.
Let’s start by creating an HTML format
Since i am not comfortable with this format, i requested the help of LLMs to manually create the data files. Here is the prompt that was used: Used this table to manually create an HTML file Below is the table that i provided:
Result from Claude Sonnet 4.6
The HTML file was hand-crafted with a semantic <table> element. Key structural decisions:
books.html
├── <thead> → Column labels (title, authors, year, genre, rating)
├── <tbody> → One <tr> per book
│ ├── <td id="1"> → Football Hackers
│ ├── <td id="2"> → Expected Goals Philosophy ← multi-author row
│ └── <td id="3"> → Soccermatics
└── Styling → CSS for Premier League visual identity
The multi-author book (Book 2) is marked with class="multi-author" for visual distinction.
Result from Gemini 3
<!DOCTYPE html>
Soccer Analytics Books Selected
ID
Title
Author(s)
Year
Genre
Rating
1
Football Hackers: The Science and Art of a Data Revolution
Christoph Biermann
2020
Sports Analytics
4.6 / 5.0
2
The Expected Goals Philosophy
James Tippett, Ravi Ramineni & Ashwin Raman
2021
Statistical Analysis
4.3 / 5.0
3
Soccermatics: Mathematical Adventures in the Beautiful Game
David Sumpter
2022
Mathematical Modelling
4.5 / 5.0
Let’s start by creating a JSON format
I am not comfortable with this format, i requested the help of LLMs to manually create the data files. with same the prompt used earlier by changing the format.
Each book object contains: id, title, authors, multiple_authors (boolean), year, genre, rating.
b) contruction data file
{"dataset":{"title":"Soccer Analytics Books Selected","description":"Manually curated list of books relevant to football data science, xG modelling, and mathematical analysis of the game.","created":"2025-03-15","total_books":3},"books":\[{"id":1,"title":"The Expected Goals Philosophy","authors":\["James Tippett"\],"multiple_authors":false,"year":2020,"genre":"Sports Analytics","rating_numeric":4.4,"rating_max":5.0},{"id":2,"title":"Net Gains","authors":\["Ryan O'Hanlon"\],"multiple_authors":false,"year":2022,"genre":"Sports Science","rating_numeric":4.2,"rating_max":5.0},{"id":3,"title":"Soccer Analytics: An Introduction Using R","authors":\["Clive Beggs"\],"multiple_authors":false,"year":2024,"genre":"Data Science","rating_numeric":4.5,"rating_max":5.0}\]}
Result from Gemini 3
{ “books_selected”:[ { “id”: 1, “title”: “Football Hackers: The Science and Art of a Data Revolution”, “authors”: “Christoph Biermann”, “year”: 2020, “genre”: “Sports Analytics”, “rating”: 4.6 }, { “id”: 2, “title”: “The Expected Goals Philosophy: A Game-Changing Way of Analysing Football”, “authors”: “James Tippett, Ravi Ramineni & Ashwin Raman”, “year”: 2021, “genre”: “Statistical Analysis”, “rating”: 4.3 }, { “id”: 3, “title”: “Soccermatics: Mathematical Adventures in the Beautiful Game”, “authors”: “David Sumpter”, “year”: 2022, “genre”: “Mathematical Modelling”, “rating”: 4.5 }] } ————————————————————————
Let’s Load the HTML data into an R data frame
# Read the HTML file html_url <-"https://raw.githubusercontent.com/Pascaltafo2025/Week-7--Premier-League-Strikers-Performance-Analysis-using-HTML-and-JSON-Data/refs/heads/main/books.html"Books_html_raw <-read_html(html_url)# Extract the <table> and parse into a data framedf_books_html <- Books_html_raw |>html_element("table#books-table") |># target the specific table by idhtml_table(trim =TRUE) # trim whitespace automaticallydf_books_html
# A tibble: 3 × 6
`#` Title `Author(s)` Year Genre Rating
<int> <chr> <chr> <int> <chr> <chr>
1 1 Football Hackers: The Science and Art of… Christoph … 2020 Spor… "★★★★…
2 2 The Expected Goals Philosophy: A Game-Ch… James Tipp… 2021 Stat… "★★★★…
3 3 Soccermatics: Mathematical Adventures in… David Sump… 2022 Math… "★★★★…
Let’s Load the JSON data into a separate R data frame
# ── Read the JSON file ─────────────────────────────────────────────────────────json_url <-"https://raw.githubusercontent.com/Pascaltafo2025/Week-7--Premier-League-Strikers-Performance-Analysis-using-HTML-and-JSON-Data/refs/heads/main/books.json"Books_json <-fromJSON(json_url)Books_json
$dataset
$dataset$title
[1] "Premier League Striker Analytics — Reference Books"
$dataset$description
[1] "Manually curated list of books relevant to football data science, xG modelling, and mathematical analysis of the game."
$dataset$created
[1] "2025-03-15"
$dataset$total_books
[1] 3
$books
id title
1 1 Football Hackers: The Science and Art of a Data Revolution
2 2 The Expected Goals Philosophy: A Game-Changing Way of Analysing Football
3 3 Soccermatics: Mathematical Adventures in the Beautiful Game
authors multiple_authors year
1 Christoph Biermann FALSE 2020
2 James Tippett, Ravi Ramineni, Ashwin Raman TRUE 2021
3 David Sumpter FALSE 2022
genre rating_stars rating_numeric rating_max
1 Sports Analytics 5 4.6 5
2 Statistical Analysis 4 4.3 5
3 Mathematical Modelling 5 4.5 5
# ── The books array sits under json_raw$books ─────────────────────────────────cat("JSON top-level keys:", paste(names(Books_json), collapse =", "), "\n")
JSON top-level keys: dataset, books
cat("Dataset metadata:\n")
Dataset metadata:
str(Books_json$dataset)
List of 4
$ title : chr "Premier League Striker Analytics — Reference Books"
$ description: chr "Manually curated list of books relevant to football data science, xG modelling, and mathematical analysis of the game."
$ created : chr "2025-03-15"
$ total_books: int 3
Compare the two data frames and determine whether they are identical
In this step, i used the help of Claude Sonnet 4.6 to show me how to compare two dataframes from an HTML and JSON data files format.
First, we need to clean both data frames and rename their column so that they could match and facilitate to comparison.
Let’s clean the HTML data frame
# ── Clean and rename to match JSON schema ────────────────────────────────────df_Books_html_Clean <- df_books_html |>select(-1) |># drop the row-number column "#"rename(title =`Title`,authors =`Author(s)`,year =`Year`,genre =`Genre`,rating =`Rating` ) |>mutate(# Rating column in HTML contains "★★★★★\n4.6 / 5.0" — extract numericrating =as.numeric(sub(".*(\\d\\.\\d)\\s*/\\s*5\\.0.*", "\\1", rating) ),year =as.integer(year) ) |>as_tibble()df_Books_html_Clean
# A tibble: 3 × 5
title authors year genre rating
<chr> <chr> <int> <chr> <dbl>
1 Football Hackers: The Science and Art of a Data Re… Christ… 2020 Spor… 4.6
2 The Expected Goals Philosophy: A Game-Changing Way… James … 2021 Stat… 4.3
3 Soccermatics: Mathematical Adventures in the Beaut… David … 2022 Math… 4.5
Let’s clean the JSON data frame
# Extract and tidy the books data frame df_Books_json_clean <- Books_json$books |>as_tibble() |>select(title, authors, year, genre, rating_numeric) |># keep same cols as HTML dfmutate(year =as.integer(year),rating_numeric =as.numeric(rating_numeric) )df_Books_json_clean
# A tibble: 3 × 5
title authors year genre rating_numeric
<chr> <list> <int> <chr> <dbl>
1 Football Hackers: The Science and Art of a… <chr> 2020 Spor… 4.6
2 The Expected Goals Philosophy: A Game-Chan… <chr> 2021 Stat… 4.3
3 Soccermatics: Mathematical Adventures in t… <chr> 2022 Math… 4.5
Let’s compare both data frame.
We will proceed at a definitive test to find the degree of similarity. This technique consists of sorting the table by title to allow an ordered comparison differences and then using the ‘identical’ function to test our result.
# Sort both by title to neutralize any ordering differencesdf_html_sorted <- df_Books_html_Clean |>arrange(title)df_json_sorted <- df_Books_json_clean |>arrange(title)test_result <-identical(df_html_sorted, df_json_sorted)test_result
[1] FALSE
Interpretation:
The result of our comparison is False which means the elements of two dataframe are identical when compare to each by their position. In that case, let’s further our analysis by comparing their structure.
title authors year genre rating_numeric
"character" "list" "integer" "character" "numeric"
Interpretation:
This comparison result shows that both data frame have the same dimension. However, their columns names and types are not identical. In fact, the HTML Data frame fifth column is named “rating” whereas the JSON fifth column is “rating_numeric”. The same difference is observed on the columns types where the author column types in the HTML Data frame is a “Character” whereas it is a “list” in the JSON Data frame.
CONCLUSION
The two HTML and JSON data files that were constructed manually represent the same three books on the soccer data analytics. The rvest function successfully parsed the HTML <table> into a clean R data frame and the jsonlite parsed the nested JSON structure with zero extra cleaning needed but to a tidy version. Finally, i compare both data frames using the “identical” function in R and the result of the comparison was “False”. This result shows that the two data files that were constructed manually are not identical. A further analysis for this assignment could rely on searching a way to either construct the data files or clean them in a manner of making them identical.