This is quick overview and example usage of two functions that can help in an SEO, or website content analysis process.

We will go through two functions:

The functions are independent and don’t need to be used together.

Please check the reticulate package documentation on how to use Python in R. There is not much setup beyond ensuring that you have Python installed, and specifying its path or environment. You can easily install advertools by running the following from the command line

pip install advertools

# OR:

pip3 install advertools

Let’s start by exploring the cooking sitemaps of the New York Times.

library(reticulate)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringr)


adv <- import("advertools")

Convert an XML sitemap to a data.frame:

You can provide a normal sitemap URL, a sitemap index URL, and it can be zipped or not. If you provide a sitemap index, the function will recursively retrieve all the sitemaps it points to. In all cases, you simply provide the URL of the sitemap*.

nyt_cooking <- "https://www.nytimes.com/sitemaps/new/cooking.xml.gz"
cooking_sitemap <- adv$sitemap_to_df(nyt_cooking)
names(cooking_sitemap)
## [1] "loc"           "lastmod"       "sitemap"       "download_date"
cooking_sitemap

Extract months and weekday names

Assuming (or, to check if) lastmod is accurate, we can get a view of how often/frequently the website publishes new pages. We can extract the years, months, and weekday names as follows:

cooking_sitemap$year_month <- cooking_sitemap$lastmod %>% 
  ymd_hms %>% 
  strftime("%Y-%m")

cooking_sitemap$weekday <- cooking_sitemap$lastmod %>% 
  ymd_hms %>% 
  weekdays()

cooking_sitemap %>% 
  select(lastmod, year_month, weekday) %>% 
  sample_n(10)

Now we can use the year_month column to count how many articles they modified in each of the available months:

cooking_sitemap %>% 
  group_by(year_month) %>% 
  count()

Pretty much all URLs were modified in September 2020.
This can mean two things. Either the lastmod tag is not accurate at all, and not very helpful. Or, they might have actually made a change to all pages, and they were truly last modified in that month. It could be a tag added or some other change to all pages.

We can also check the frequency of updates by weekday:

cooking_sitemap %>% 
  group_by(weekday) %>% 
  count(sort = TRUE)

Remember that we have the sitemap column, which we can check to see if we can get any hints.

Here is a sample of those sitemaps:

sample(cooking_sitemap$sitemap, 15)
##  [1] "https://www.nytimes.com/sitemaps/new/cooking-2013-12.xml.gz"
##  [2] "https://www.nytimes.com/sitemaps/new/cooking-2013-09.xml.gz"
##  [3] "https://www.nytimes.com/sitemaps/new/cooking-2017-07.xml.gz"
##  [4] "https://www.nytimes.com/sitemaps/new/cooking-1996-07.xml.gz"
##  [5] "https://www.nytimes.com/sitemaps/new/cooking-2010-03.xml.gz"
##  [6] "https://www.nytimes.com/sitemaps/new/cooking-1994-07.xml.gz"
##  [7] "https://www.nytimes.com/sitemaps/new/cooking-1984-09.xml.gz"
##  [8] "https://www.nytimes.com/sitemaps/new/cooking-2011-12.xml.gz"
##  [9] "https://www.nytimes.com/sitemaps/new/cooking-2007-01.xml.gz"
## [10] "https://www.nytimes.com/sitemaps/new/cooking-1990-11.xml.gz"
## [11] "https://www.nytimes.com/sitemaps/new/cooking-2013-01.xml.gz"
## [12] "https://www.nytimes.com/sitemaps/new/cooking-2014-07.xml.gz"
## [13] "https://www.nytimes.com/sitemaps/new/cooking-1990-06.xml.gz"
## [14] "https://www.nytimes.com/sitemaps/new/cooking-1992-04.xml.gz"
## [15] "https://www.nytimes.com/sitemaps/new/cooking-2002-03.xml.gz"

Some sitemaps are organized thematically (product, category, etc.), some by date, as in this case, and there is no rule really. People organize them in many different ways.

The good news here, is that we have years and months in the sitemap URLs, which might be useful.

Let’s check if the distribution of those elements looks more natural than the case of lastmod:

cooking_sitemap %>% 
  group_by(sitemap) %>% 
  count() %>% 
  summary()
##    sitemap                n         
##  Length:468         Min.   :  1.00  
##  Class :character   1st Qu.: 29.00  
##  Mode  :character   Median : 44.00  
##                     Mean   : 44.73  
##                     3rd Qu.: 59.00  
##                     Max.   :137.00

So it seems we have 468 different sitemaps.

The number of URLs in each varies between 1 and 137, with an average of 44.71.

This looks natural. So, based on these two observations, it seems that they have a fairly stable publishing pace, and that they probably made a single change to almost all the pages sometime in September 2020.

Analyzing URLs

The url_to_df function splits a list of URLs to their components, each in its respective column.

url_df <- adv$url_to_df(cooking_sitemap$loc) 
url_df
names(url_df)
## [1] "url"      "scheme"   "netloc"   "path"     "query"    "fragment" "dir_1"   
## [8] "dir_2"

The url column shows the original URLs as they are with no change (actually it also decodes them for easier reading, which is an option that can be set to FALSE if you want).

Columns beyond fragment fall into two groups:

We can explore the counts of values in each column:

count(url_df, scheme)
count(url_df, netloc)

It seems all URLs are under one subdomain.

Let’s now check the contents of dir_1:

count(url_df, dir_1)

Pretty much everything is under “recipes”.

We can easily count the values of dir_2 in the rows where dir_1 is equal to “guides”:

url_df %>% 
  filter(dir_1 == "guides") %>% 
  select(dir_1, dir_2)

Finally, we can extract slugs, which are the values of dir_2 where dir_1 is not equal to “guides”, i.e. the recipes pages:

slugs <- url_df %>% 
  filter(dir_1 != "guides") %>%
  select(dir_2)
slugs

So we have numbers and then we have the name of the recipe. With some string manipulation and regex, we can remove numbers and hyphens, and then split to get the words. Then paste everything into one string.

Here are the first forty words:

words <-  str_replace_all(slugs$dir_2 , replacement = "", pattern  = "\\d+\\-") %>% 
  str_replace_all(replacement = " ", pattern = "-") %>% 
  paste(collapse = " ") %>% 
  str_split(" ")

head(words[[1]], 40)
##  [1] "tofu"      "makhani"   "indian"    "butter"    "tofu"      "lemon"    
##  [7] "cream"     "cheese"    "cookies"   "pressure"  "cooker"    "chicken"  
## [13] "soup"      "with"      "lemon"     "and"       "rice"      "pressure" 
## [19] "cooker"    "sweet"     "potato"    "coconut"   "curry"     "soup"     
## [25] "pressure"  "cooker"    "miso"      "chicken"   "ramen"     "with"     
## [31] "bok"       "choy"      "baked"     "ziti"      "with"      "sausage"  
## [37] "meatballs" "and"       "spinach"   "speculoos"

We can now use table to get the most frequently used words in the recipes:

words_df <- words[[1]] %>% 
  table() %>%
  sort(decreasing = TRUE) %>% 
  as.data.frame()

names(words_df) <- c("word", "frequency")
words_df

So, it’s a lot of salad, chicken, and sauce!

You can easily add a new column that counts percentages, and cumulative percentages of each of those words for a better perspective on their relative importance in the dataset.

Now, we remove some stopword, and visualize the top twenty words that are used in the recipes website.

top20 <- words_df %>% 
  subset(!(word %in% c("with", "and", "in", "of"))) %>% 
  head(20)

top20 <- top20 %>% 
  arrange(desc(frequency))
top20
ggplot(top20) +
  geom_col(aes(word, frequency)) +
  ggtitle("Most Used Words in cooking.nytimes.com URLs (out of 20,923 URLs)") +
  theme(axis.text.x = element_text(angle = 90)) 

* Bonus: If you give the function a robots.txt file URL, it will extract the sitemaps in that file, and get everything in a data.frame.