It can be helpful to have an understanding of a website’s page and folder structure. Such an understanding can help when deciding how to organise your site taxononmy, wrangling a sensible navigation, implementing content groups in analytics, visualising your ecommerce funnel, and much more.
However, most websites consist of hundreds, if not thousands of pages. Using R and the collapsibleTree package, we can convert a complicated list of URLs into a more visual representation of our website. This gives us a simpler way to navigate and understand the site’s overall structure.
In the following document, we’ll:
collapsibleTree package.For this example, I am using the website of Simo Ahava, a prolific writer on Google Analytics, Google Tag Manager and leading contributor to the digital analytics community.
To get started, we need a list of URLs from Simo’s website. We can use the SEO spider tool Screaming Frog to do this.
We launch screaming frog, enter in the address to Simo Ahava’s site (http://www.simoahava.com) and click start to spider the site.
The Screaming Frog interface
Once the spider has finished, we are given a list of all URLs found on Simo’s website, which we can export to a CSV file.
For more information on using Screaming Frog, check out the guide at https://www.screamingfrog.co.uk/seo-spider/user-guide/.
The collapsibleTree package requires data in a specific format in order to generate a visualisation. We’re going to generate a mind-map-esque diagram, and will need to specify what the levels of that diagram are. We need a separate column for each level. Within our list of URLs, the folder hierarchy of the website is delimited by the / character.
So we could say that for https://www.simoahava.com/tags/gtmtips/: - tags is the level 1 folder - gtmtips is the level 2 folder
We therefore need to separate our list of URLs into separate columns for each level.
Fortunately, the tidyverse has a few very helpful functions which will make this easy to do.
First of all, we need to read in the file which was exported from Screaming Frog. The code below shows how to do this, then outputs the head of the resulting dataframe.
library(tidyverse)
library(stringr)
library(collapsibleTree)
library(networkD3)
# Download the list of pages
if (!file.exists("simo_pages.csv")) {
download.file("https://storage.googleapis.com/pageflows/simo_pages.csv", destfile = "simo_pages.csv")
}
# Read in the csv
simo_file <- read_csv("simo_pages.csv", skip = 1)
# Remove query strings, remove all before + including .com, remove duplicates
simo_tidied <- simo_file %>%
mutate(add = str_replace_all(Address, c("\\?.*" = ""))) %>%
mutate(add = str_replace_all(add, ".*\\.com\\/", "")) %>%
select(add, title_1 = `Title 1`) %>%
distinct() %>%
arrange(add)
simo_tidied %>% head(50)
Now, we can apply those helpful functions to the add column and create our widened dataframe. We use the separate() function to do that. We also convert all text to lowercase using str_to_lower(), just to avoid any duplication of names which may appear in uppercase in the spider results.
simo_levels <- simo_tidied %>%
separate(add, sep = "\\/",
into = paste0("level", 1:5)) %>%
mutate_at(vars(contains('level')), list(str_to_lower))
simo_levels
Now, we can pass our tidied dataframe to the collapsibleTree() function. We pass the following:
df the name of our dataframehierarchy the columns which define the order and hierarchy of our tree network (levels 1-5).tooltip should we see a tooltip when hovering over a leaf nodefill the colour to fill any nodes which have children.width the width of your resulting diagramheight the heightzoomable can you zoom in and out of the diagram?The resulting output is shown after the code below. You can click on any blue-filled circles to expand into subcategories.
simo_tree <- collapsibleTree(
df = simo_levels,
hierarchy = paste0("level", 1:5),
tooltip = TRUE,
fill = "#64ABC2", # Let's use the highlight colour from Simo's site for parent nodes
width = 1500,
height = 1000,
zoomable = TRUE
)
simo_tree
Our example above is a little tricky to navigate, so you can view a full-width version at https://storage.googleapis.com/pageflows/simo_tree.html.
From the diagram, we’re able to quickly understand a few aspects of Simo’s site: - The most densely-populated leaf nodes (categories/analytics, categories/gtm-tips) - The main areas of top-level navigation
What stands out to me is how focused the content of this site is. Blog content covering GA, GTM and additional areas takes the lions share of the site hierarchy.
I hope you find this a useful method to make sense of the often-complicated structures of websites. Check out the collapsibleTree package for more detail on how to build your own visualisations, the tidyverse site for more information about some of the functions I used, and of course make sure you subscribe to Simo’s Blog!