The goal of today’s lab is to load the same data from three different sources. From this we can learn the differences between each file type and find out if there are any differences in what the loaded data looks like.
## Warning: package 'xml2' was built under R version 4.4.3
## Warning: package 'jsonlite' was built under R version 4.4.3
## Warning: package 'arsenal' was built under R version 4.4.3
The data extraction portion of this lab is important because it depends on the form of the data: html, xml, or json. Loading the html and using “html_table” on it creates an untidy, but usable data frame. Loading the xml using “read_xml”, “xmlParse”, and “xmlToList” turns it into a wide data frame. The “xmlToDataFrame” function did not work here because of the two author tags. The json file turned into a properly formed data frame at first glance, but it actually combined any books with multiple authors into rows where authors is a vector.
html_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/assignment6/books.html'
xml_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/assignment6/books.xml'
json_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/assignment6/books.json'
raw_html_df <- read_html(html_url) %>% html_table() %>% as.data.frame()
raw_html_df
## title
## 1 American Woman's Home: Or, Principles of Domestic Science
## 2
## 3 The handbook of soap manufacture
## 4
## 5 Elevator Systems of the Eiffel Tower
## author language ebook_release_year copyright_status
## 1 Catharine Esther Beecher English 2004 Public domain in the USA
## 2 Harriet Beecher Stowe NA
## 3 W. H. Simmons English 2007 Public domain in the USA
## 4 H. A. Appleton NA
## 5 Robert M. Vogel English 2010 Public domain in the USA
raw_xml <- read_xml(xml_url) %>% xmlParse(xml_url)
raw_xml_df <- xmlToList(raw_xml) %>% ldply(data.frame)
raw_xml_df
## .id title
## 1 row American Woman's Home: Or, Principles of Domestic Science
## 2 row The handbook of soap manufacture
## 3 row Elevator Systems of the Eiffel Tower
## author author.1 language ebook_release_year
## 1 Catharine Esther Beecher Harriet Beecher Stowe English 2004
## 2 W. H. Simmons H. A. Appleton English 2007
## 3 Robert M. Vogel <NA> English 2010
## copyright_status
## 1 Public domain in the USA
## 2 Public domain in the USA
## 3 Public domain in the USA
## title
## 1 American Woman's Home: Or, Principles of Domestic Science
## 2 The handbook of soap manufacture
## 3 Elevator Systems of the Eiffel Tower
## author language ebook_release_year
## 1 Catharine Esther Beecher, Harriet Beecher Stowe English 2004
## 2 W. H. Simmons, H. A. Appleton English 2007
## 3 Robert M. Vogel English 2010
## copyright_status
## 1 Public domain in the USA
## 2 Public domain in the USA
## 3 Public domain in the USA
Let’s compare the loaded data. I’m choosing now to do a comparison because this is roughly the first instance where all three instances of the data can be compared since they are now data frames. There are three data frames, thus three comparisons to compare them all.
## $frame.summary.table
## version arg ncol nrow
## 1 x raw_html_df 5 5
## 2 y raw_xml_df 7 3
##
## $comparison.summary.table
## statistic value
## 1 Number of by-variables 0
## 2 Number of non-by variables in common 5
## 3 Number of variables compared 4
## 4 Number of variables in x but not y 0
## 5 Number of variables in y but not x 2
## 6 Number of variables compared with some values unequal 4
## 7 Number of variables compared with all values equal 0
## 8 Number of observations in common 3
## 9 Number of observations in x but not y 2
## 10 Number of observations in y but not x 0
## 11 Number of observations with some compared variables unequal 2
## 12 Number of observations with all compared variables equal 1
## 13 Number of values unequal 6
##
## $vars.ns.table
## version variable position class
## 7 y .id 1 character
## 8 y author.1 4 character
##
## $vars.nc.table
## var.x pos.x class.x var.y pos.y class.y
## 4 ebook_release_year 4 integer ebook_release_year 6 character
##
## $obs.table
## version ..row.names.. observation
## 4 x 4 4
## 5 x 5 5
##
## $diffs.byvar.table
## var.x var.y n NAs
## 1 title title 2 0
## 2 author author 2 0
## 3 language language 1 0
## 4 copyright_status copyright_status 1 0
## Compare Object
##
## Function Call:
## comparedf(x = raw_html_df, y = raw_xml_df)
##
## Shared: 5 non-by variables and 3 observations.
## Not shared: 2 variables and 2 observations.
##
## Differences found in 4/4 variables compared.
## 0 variables compared have non-identical attributes.
## Compare Object
##
## Function Call:
## comparedf(x = raw_html_df, y = raw_json_df)
##
## Shared: 5 non-by variables and 3 observations.
## Not shared: 0 variables and 2 observations.
##
## Differences found in 4/4 variables compared.
## 0 variables compared have non-identical attributes.
## Compare Object
##
## Function Call:
## comparedf(x = raw_xml_df, y = raw_json_df)
##
## Shared: 5 non-by variables and 3 observations.
## Not shared: 2 variables and 0 observations.
##
## Differences found in 0/3 variables compared.
## 0 variables compared have non-identical attributes.
The results of the comparisons are heavily dependent on the loading processes chosen for each file format. In this case, it is not surprising that all three data frames are very different. However, there are some similarities. “Shared: 5 non-by variables and 3 observations.” The observations were consistently found. A snippet of a summary is shown as well. For larger data frames where it may be difficult to see differences, the “summary” function in tandem with “comparef” provides far more details that can be analyzed.
In order to be complete with this example, let’s see the process of fixing the data frames so that they are tidy and actually comparable.
The html data frame needs values to be imputed to not have missing data if some books have multiple authors.
html_df <- raw_html_df %>% mutate_at(c('title', 'language', 'copyright_status'), ~na_if(., '')) %>%
fill(c(title, language, ebook_release_year, copyright_status), .direction = 'down')
html_df
## title
## 1 American Woman's Home: Or, Principles of Domestic Science
## 2 American Woman's Home: Or, Principles of Domestic Science
## 3 The handbook of soap manufacture
## 4 The handbook of soap manufacture
## 5 Elevator Systems of the Eiffel Tower
## author language ebook_release_year copyright_status
## 1 Catharine Esther Beecher English 2004 Public domain in the USA
## 2 Harriet Beecher Stowe English 2004 Public domain in the USA
## 3 W. H. Simmons English 2007 Public domain in the USA
## 4 H. A. Appleton English 2007 Public domain in the USA
## 5 Robert M. Vogel English 2010 Public domain in the USA
Change the xml data frame to a longer form using “melt” and exclude rows that expect extra authors.
xml_df <- raw_xml_df %>%
select(colnames(raw_xml_df)[2:7]) %>%
melt(id.vars = c('title', 'language', 'ebook_release_year', 'copyright_status'), value.name = 'author', na.rm = TRUE)
xml_df$variable = NULL
xml_df
## title language
## 1 American Woman's Home: Or, Principles of Domestic Science English
## 2 The handbook of soap manufacture English
## 3 Elevator Systems of the Eiffel Tower English
## 4 American Woman's Home: Or, Principles of Domestic Science English
## 5 The handbook of soap manufacture English
## ebook_release_year copyright_status author
## 1 2004 Public domain in the USA Catharine Esther Beecher
## 2 2007 Public domain in the USA W. H. Simmons
## 3 2010 Public domain in the USA Robert M. Vogel
## 4 2004 Public domain in the USA Harriet Beecher Stowe
## 5 2007 Public domain in the USA H. A. Appleton
Simply unnest the author column for the json data frame to tidy it.
## # A tibble: 5 × 5
## title author language ebook_release_year copyright_status
## <chr> <chr> <chr> <int> <chr>
## 1 American Woman's Home: Or… Catha… English 2004 Public domain i…
## 2 American Woman's Home: Or… Harri… English 2004 Public domain i…
## 3 The handbook of soap manu… W. H.… English 2007 Public domain i…
## 4 The handbook of soap manu… H. A.… English 2007 Public domain i…
## 5 Elevator Systems of the E… Rober… English 2010 Public domain i…
The data frames are now effectively the same for most meaningful analyses. The html and json based data frames are equal. The xml data frame shows a slight difference from the other two because the column order is different.
## $frame.summary.table
## version arg ncol nrow
## 1 x html_df 5 5
## 2 y xml_df 5 5
##
## $comparison.summary.table
## statistic value
## 1 Number of by-variables 0
## 2 Number of non-by variables in common 5
## 3 Number of variables compared 4
## 4 Number of variables in x but not y 0
## 5 Number of variables in y but not x 0
## 6 Number of variables compared with some values unequal 2
## 7 Number of variables compared with all values equal 2
## 8 Number of observations in common 5
## 9 Number of observations in x but not y 0
## 10 Number of observations in y but not x 0
## 11 Number of observations with some compared variables unequal 4
## 12 Number of observations with all compared variables equal 1
## 13 Number of values unequal 8
##
## $vars.ns.table
## [1] version variable position class
## <0 rows> (or 0-length row.names)
##
## $vars.nc.table
## var.x pos.x class.x var.y pos.y class.y
## 4 ebook_release_year 4 integer ebook_release_year 3 character
##
## $obs.table
## [1] version ..row.names.. observation
## <0 rows> (or 0-length row.names)
##
## $diffs.byvar.table
## var.x var.y n NAs
## 1 title title 4 0
## 2 author author 4 0
## 3 language language 0 0
## 4 copyright_status copyright_status 0 0
## Compare Object
##
## Function Call:
## comparedf(x = html_df, y = xml_df)
##
## Shared: 5 non-by variables and 5 observations.
## Not shared: 0 variables and 0 observations.
##
## Differences found in 2/4 variables compared.
## 0 variables compared have non-identical attributes.
## Compare Object
##
## Function Call:
## comparedf(x = html_df, y = json_df)
##
## Shared: 5 non-by variables and 5 observations.
## Not shared: 0 variables and 0 observations.
##
## Differences found in 0/5 variables compared.
## 0 variables compared have non-identical attributes.
## Compare Object
##
## Function Call:
## comparedf(x = xml_df, y = json_df)
##
## Shared: 5 non-by variables and 5 observations.
## Not shared: 0 variables and 0 observations.
##
## Differences found in 2/4 variables compared.
## 0 variables compared have non-identical attributes.
There are many ways to load data stored in different file types into R. Some methods are able to create data frames immediately. The “xmlToDataFrame” had some issues dealing with there being two author tags. All options loaded the data, just in different forms. With some tidying, each data file could eventually become equivalent data frames.
There were also multiple forms the data could have within the original files. These were choices that I had to make myself. I could have nested some of the data deeper into the HTML or XML files. This would be akin to the work needed to derive such data from websites where I do not have control over the form of the data hosted. Further work could be to practice this process with various websites.