Analyzing XML Sitemaps and URLs with advertools (a Python package)

This is quick overview and example usage of two functions that can help in an SEO, or website content analysis process.

We will go through two functions:

sitemap_to_df: As the name suggests, it converts XML sitemaps to data frames. It retreives all available tags, and also works recursively with sitemap index files.
url_to_df: Once the sitemap is downloaded, you might be interested in exploring the URLs further. This function splits URLs into their components, and allows you to analyze them more easily, and check whatever makes sense to you.

The functions are independent and don’t need to be used together.

Please check the reticulate package documentation on how to use Python in R. There is not much setup beyond ensuring that you have Python installed, and specifying its path or environment. You can easily install advertools by running the following from the command line

pip install advertools

# OR:

pip3 install advertools

Let’s start by exploring the cooking sitemaps of the New York Times.

library(reticulate)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(stringr)


adv <- import("advertools")

Convert an XML sitemap to a data.frame:

You can provide a normal sitemap URL, a sitemap index URL, and it can be zipped or not. If you provide a sitemap index, the function will recursively retrieve all the sitemaps it points to. In all cases, you simply provide the URL of the sitemap*.

nyt_cooking <- "https://www.nytimes.com/sitemaps/new/cooking.xml.gz"
cooking_sitemap <- adv$sitemap_to_df(nyt_cooking)

names(cooking_sitemap)

## [1] "loc"           "lastmod"       "sitemap"       "download_date"

loc: The location of the URL in the sitemap. This is a required tag.
lastmod: If accurate, provides great insights on the publishing trends and activity of the website.
sitemap: This shows which sitemap the URL on this row has been extracted from.
download_date: The date when the sitemap has been downloaded (by you). This can be helpful if you download the same sitemap(s) multiple times, and want to compare changes, if any.

cooking_sitemap

Extract months and weekday names

Assuming (or, to check if) lastmod is accurate, we can get a view of how often/frequently the website publishes new pages. We can extract the years, months, and weekday names as follows:

cooking_sitemap$year_month <- cooking_sitemap$lastmod %>% 
  ymd_hms %>% 
  strftime("%Y-%m")

cooking_sitemap$weekday <- cooking_sitemap$lastmod %>% 
  ymd_hms %>% 
  weekdays()

cooking_sitemap %>% 
  select(lastmod, year_month, weekday) %>% 
  sample_n(10)

Now we can use the year_month column to count how many articles they modified in each of the available months:

cooking_sitemap %>% 
  group_by(year_month) %>% 
  count()

Pretty much all URLs were modified in September 2020.
This can mean two things. Either the lastmod tag is not accurate at all, and not very helpful. Or, they might have actually made a change to all pages, and they were truly last modified in that month. It could be a tag added or some other change to all pages.

We can also check the frequency of updates by weekday:

cooking_sitemap %>% 
  group_by(weekday) %>% 
  count(sort = TRUE)

Remember that we have the sitemap column, which we can check to see if we can get any hints.

Here is a sample of those sitemaps:

sample(cooking_sitemap$sitemap, 15)

##  [1] "https://www.nytimes.com/sitemaps/new/cooking-2013-12.xml.gz"
##  [2] "https://www.nytimes.com/sitemaps/new/cooking-2013-09.xml.gz"
##  [3] "https://www.nytimes.com/sitemaps/new/cooking-2017-07.xml.gz"
##  [4] "https://www.nytimes.com/sitemaps/new/cooking-1996-07.xml.gz"
##  [5] "https://www.nytimes.com/sitemaps/new/cooking-2010-03.xml.gz"
##  [6] "https://www.nytimes.com/sitemaps/new/cooking-1994-07.xml.gz"
##  [7] "https://www.nytimes.com/sitemaps/new/cooking-1984-09.xml.gz"
##  [8] "https://www.nytimes.com/sitemaps/new/cooking-2011-12.xml.gz"
##  [9] "https://www.nytimes.com/sitemaps/new/cooking-2007-01.xml.gz"
## [10] "https://www.nytimes.com/sitemaps/new/cooking-1990-11.xml.gz"
## [11] "https://www.nytimes.com/sitemaps/new/cooking-2013-01.xml.gz"
## [12] "https://www.nytimes.com/sitemaps/new/cooking-2014-07.xml.gz"
## [13] "https://www.nytimes.com/sitemaps/new/cooking-1990-06.xml.gz"
## [14] "https://www.nytimes.com/sitemaps/new/cooking-1992-04.xml.gz"
## [15] "https://www.nytimes.com/sitemaps/new/cooking-2002-03.xml.gz"

Some sitemaps are organized thematically (product, category, etc.), some by date, as in this case, and there is no rule really. People organize them in many different ways.

The good news here, is that we have years and months in the sitemap URLs, which might be useful.

Let’s check if the distribution of those elements looks more natural than the case of lastmod:

cooking_sitemap %>% 
  group_by(sitemap) %>% 
  count() %>% 
  summary()

##    sitemap                n         
##  Length:468         Min.   :  1.00  
##  Class :character   1st Qu.: 29.00  
##  Mode  :character   Median : 44.00  
##                     Mean   : 44.73  
##                     3rd Qu.: 59.00  
##                     Max.   :137.00

So it seems we have 468 different sitemaps.

The number of URLs in each varies between 1 and 137, with an average of 44.71.

This looks natural. So, based on these two observations, it seems that they have a fairly stable publishing pace, and that they probably made a single change to almost all the pages sometime in September 2020.

Analyzing URLs

The url_to_df function splits a list of URLs to their components, each in its respective column.

url_df <- adv$url_to_df(cooking_sitemap$loc) 
url_df

names(url_df)

## [1] "url"      "scheme"   "netloc"   "path"     "query"    "fragment" "dir_1"   
## [8] "dir_2"

The url column shows the original URLs as they are with no change (actually it also decodes them for easier reading, which is an option that can be set to FALSE if you want).

Columns beyond fragment fall into two groups:

dir_<n>: Typically, URLs are split with slashes in the form example.com/dir_1/dir_2/.../dir_n/. Usually, the closer we are to the domain the more general the content of that directory. It’s usually something like /category/sub-category/sub-sub-category/product-name. Sometimes we also have meta data in the URL, in the form of language code, year, month, date, website category (blog, community, etc), or author name. These don’t tell us anything about the content of the page, but they are very useful in categorizing it with these details, if available.
query_*: Query parameters are not always used, but if they are, they can provide interesting insights on the content of pages, and possibly other things. Sometimes you have encrypted (or poorly named) parameters and values, in which case, not much can be inferred. If query parameters exist in the URLs, they would be shown as columns, where each column corresponds to a query parameter, and its value would be in the column. All names are prepended with query_ to make it clear what they are, and to avoid collision with other column names in the data frame.

We can explore the counts of values in each column:

count(url_df, scheme)

count(url_df, netloc)

It seems all URLs are under one subdomain.

Let’s now check the contents of dir_1:

count(url_df, dir_1)

Pretty much everything is under “recipes”.

We can easily count the values of dir_2 in the rows where dir_1 is equal to “guides”:

url_df %>% 
  filter(dir_1 == "guides") %>% 
  select(dir_1, dir_2)

Finally, we can extract slugs, which are the values of dir_2 where dir_1 is not equal to “guides”, i.e. the recipes pages:

slugs <- url_df %>% 
  filter(dir_1 != "guides") %>%
  select(dir_2)
slugs

So we have numbers and then we have the name of the recipe. With some string manipulation and regex, we can remove numbers and hyphens, and then split to get the words. Then paste everything into one string.

Here are the first forty words:

words <-  str_replace_all(slugs$dir_2 , replacement = "", pattern  = "\\d+\\-") %>% 
  str_replace_all(replacement = " ", pattern = "-") %>% 
  paste(collapse = " ") %>% 
  str_split(" ")

head(words[[1]], 40)

##  [1] "tofu"      "makhani"   "indian"    "butter"    "tofu"      "lemon"    
##  [7] "cream"     "cheese"    "cookies"   "pressure"  "cooker"    "chicken"  
## [13] "soup"      "with"      "lemon"     "and"       "rice"      "pressure" 
## [19] "cooker"    "sweet"     "potato"    "coconut"   "curry"     "soup"     
## [25] "pressure"  "cooker"    "miso"      "chicken"   "ramen"     "with"     
## [31] "bok"       "choy"      "baked"     "ziti"      "with"      "sausage"  
## [37] "meatballs" "and"       "spinach"   "speculoos"

We can now use table to get the most frequently used words in the recipes:

words_df <- words[[1]] %>% 
  table() %>%
  sort(decreasing = TRUE) %>% 
  as.data.frame()

names(words_df) <- c("word", "frequency")
words_df

So, it’s a lot of salad, chicken, and sauce!

You can easily add a new column that counts percentages, and cumulative percentages of each of those words for a better perspective on their relative importance in the dataset.

Now, we remove some stopword, and visualize the top twenty words that are used in the recipes website.

top20 <- words_df %>% 
  subset(!(word %in% c("with", "and", "in", "of"))) %>% 
  head(20)

top20 <- top20 %>% 
  arrange(desc(frequency))
top20

ggplot(top20) +
  geom_col(aes(word, frequency)) +
  ggtitle("Most Used Words in cooking.nytimes.com URLs (out of 20,923 URLs)") +
  theme(axis.text.x = element_text(angle = 90))

* Bonus: If you give the function a robots.txt file URL, it will extract the sitemaps in that file, and get everything in a data.frame.