The Objectives

Preface

As a Public Relations Data Analyst, I constantly deal with vast amounts of scattered media monitoring data. Even though, for a short time now, PR agencies (and hopefully their clients) have not had high expectations regarding metrics and measurements. But I believe that one day, we would be urged to understand deeper our daily media monitoring datasets: to find hidden patterns that have never been figured out before.

At present, the purpose of media monitoring datasets is solely to report all online news articles based on our clients’ PR events. To ensure accountability, we must list the URL links related to each article.

But here’s the thing: media coverages are represented by the ‘number of articles’. However, if we include a dedicated column for media names in advance when building the dataset, we can easily track (count) coverage by each media name later on. If we don’t, it could be challenging to monitor media coverage.

At some point, we might need to manually list each media name based on the article’s URL after we have gathered all the headlines and URLs.

So, we got problems here:

  • We need such an automated workflow to do this task, especially to reduce human error when it comes to manual listing

  • We need to figure out how to track coverages as deep as possible


In search of solutions: Eureka!

The DNS Hierarchy Example (src: educative.io) <br> sorry if I couldn't make this picture centered by default. it's probably because of my chunk setup.... this long caption was made intentionally, hahaha

The DNS Hierarchy Example (src: educative.io)
sorry if I couldn’t make this picture centered by default. it’s probably because of my chunk setup…. this long caption was made intentionally, hahaha

To gain a better understanding, I suggest that we take a look at this, and this documentation, please :)


By observing patterns in the domain/DNS hierarchy, I discovered an approach that can provide us with valuable insights. We can effectively utilize Regular Expressions (Regex) to remove http://, https:// , and slashes (/) from each URL entry. Using Regex, we can also separate all instances that come before dots (.).

After cleaning the data with Regex, we can filter domain names and subdomains by executing several lines of code. I prefer using R-Tidyverse packages due to their higher-level functions and parameters.


Speaking across domain knowledge

In Public Relations Scope, all the subdomains might be considered as “Media Desk” or “Media Region”. Taking The New York Times (http://nytimes.com) as an example, they have several media desks, namely:

  • cooking.nytimes.com
  • parenting.nytimes.com
  • advertising.nytimes.com
  • cityandsuburban.nytimes.com, and so on…

Same here, the Indonesian media are tend to specify their subdomains based on media desks or district/city names. I fancy to take Tribunnews (http://tribunnews.com) as an example. We can find some media regions here, such as:

  • jabar.tribunnews.com (West Java)
  • kaltim.tribunnews.com (East Borneo)
  • jogja.tribunnews.com (Special Region of Yogyakarta), and etc…


Speaking about the workflow

To simplify the explanation of the data manipulation workflow, this portfolio relies heavily on Regex cleansing and Dplyr filtering. Regex cleansing helps us eliminate unwanted instances such as ‘https’ and slashes.

On the same importance, Dplyr really helps me to do ‘subdomain and root domain filtering’; as my goal is to retrieve and classify these two elements for a more in-depth analysis of media desks.


Okayy, without further ado, let’s dive deeper!

Regex Cleansing on http[s]://

Library load

library(dplyr)
library(googlesheets4)
library(tidyr)
library(stringi)
library(stringr)
library(tidyr)
library(purrr)
library(janitor)

Preliminary: Grammatical Manipulation

I created a custom operator that functions as the negation of %in%, which I call %out%. Essentially, %out% can be used within dplyr::filter() to exclude all the values specified in the filter() function

# initiating %out%

`%out%` <- function(a,b) ! a %in% b

Data Load

gs4_deauth()
options(gargle_oauth_email = "gemforwork329@gmail.com")
dat_URL <-
  read_sheet(
    "https://docs.google.com/spreadsheets/d/1mKhCrBM5-uXZ_aRa72tCubgeSTZigipTG4DhLWr7t48/edit#gid=0"
  )

dat_URL


Here are the articles from our client. It was about a mobile phone product launch and prelaunch. As we can see, we have gathered 4,222 different URLs from various media sources. Yep, just the URLs.

url_list <- dat_URL %>%
  select(URL) %>%
  as.list() %>%
  unlist()

Removing http[s]:// pattern


cleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl3[1:5]
#>                                                                                                                          URL1 
#>                 "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all." 
#>                                                                                                                          URL2 
#> "www.idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q=" 
#>                                                                                                                          URL3 
#>                                            "www.antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3" 
#>                                                                                                                          URL4 
#>      "www.jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/" 
#>                                                                                                                          URL5 
#>              "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"


Removing the www. pattern


cleanurl4 <- gsub("www.","",cleanurl3)
cleanurl4[1:5]
#>                                                                                                                      URL1 
#>             "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all." 
#>                                                                                                                      URL2 
#> "idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q=" 
#>                                                                                                                      URL3 
#>                                            "antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3" 
#>                                                                                                                      URL4 
#>      "jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/" 
#>                                                                                                                      URL5 
#>          "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"


Removing all characters which was preceded by slash (/)


pre_result_regex <- sapply(strsplit(cleanurl4,"/"),"[",1)
pre_result_regex[1:5]
#>               URL1               URL2               URL3               URL4 
#> "tekno.kompas.com"     "idntimes.com"   "antaranews.com"      "jawapos.com" 
#>               URL5 
#>     "kumparan.com"


Splitting all instances that was separated by dots (.)


result_regex <- as.list(strsplit(pre_result_regex, "\\."))
result_regex[1:5]
#> $URL1
#> [1] "tekno"  "kompas" "com"   
#> 
#> $URL2
#> [1] "idntimes" "com"     
#> 
#> $URL3
#> [1] "antaranews" "com"       
#> 
#> $URL4
#> [1] "jawapos" "com"    
#> 
#> $URL5
#> [1] "kumparan" "com"


This marks the end of the Regex cleansing process. From the result_regex list/vector, we can convert it into a data frame.

Defining URL Hierarchies

Containing each cleaned data entry into columns (as data frame)

df_media <- as.data.frame(stringi::stri_list2matrix(result_regex, byrow = TRUE), stringsAsFactors=FALSE) ; head(df_media, 30)
nrow(df_media)
#> [1] 4222

Renaming columns (instead of calling it V1 to V4)

names(df_media) <- paste0("instance", seq_along(df_media)) ; head(df_media, 20)


I am interested in inspecting all the unique values in instance2 column, assuming there are many domain names inside. On this phase, analysts should open to explore all the possibilities, as they plan to gain deeper EDA insights.


instance2_unique <- df_media %>%
  select(instance2) %>%
  distinct()
instance2_unique

I have around one hundred values, and it’s apparent that there are still “non-domain name” in the instance2 column (e.g., com, id, co, web, my, net, etc), which should be identified as top-level domains (TLDs). We have gained further insights from this data pattern.


But now, how about values on the instance1 column?

instance1_unique <- df_media %>%
  select(instance1) %>%
  distinct(); instance1_unique


After inspecting both the instance1 and instance2 columns, we can infer that, for the most part, instance1 contains numerous domain names, such as kumparan, beritasatu, cnnindonesia, and antaranews.

I am confident that I can also identify several media desks with the three instances domains (xxx.xxx.xxx) by simply filtering values inside instance2 column that contains instance1_unique$instance1. This operation allows to extract domain according to list, thanks to the data pattern.


Let’s find our first (instance1, instance2, instance3) discoveries

(using instance1_unique$instance1 as domain name identifier)

domain_three_instances <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  # filtering root domains (instance2) that contains matching values
  # of instance1_unique$instance1
  filter(instance2 %in% instance1_unique$instance1) %>%
  filter(instance4 %in% NA) %>%
  select(-instance4) %>%
  # Taking out two instances domain that still remain
  filter(instance3 %out% NA) %>%
  distinct() %>%
  arrange(instance2); domain_three_instances

Bingo! We got our first filtered list of three-instance domains. Whilst Instance2 contains all the domain names (media names), we can also retrieve some insights about how various these media desks are (instance1). In Public Relations scope, varieties of online media desks (a.k.a subdomains) is one thing that is so unpredictable.


Re-catching 3 instances domain

(using isntance2_unique$instance2 as root domain identifier)

This may fetch domains that contains a single TLD

domain_three_instances_1 <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  # filtering root domains (instance2) that contains matching values
  # of instance2_unique$instance2
 
  filter(instance2 %in% instance2_unique$instance2) %>%
  filter(instance4 %in% NA) %>%
  select(-instance4) %>%
  filter(instance3 %out% NA) %>%
  distinct() %>%
 
  # Negate our previous findings so that we can retrieve the new one
  filter(instance2 %out% domain_three_instances$instance2) %>%
 
  # Negate co, my, go as it supposed to be the SLD and TLD
  filter(instance2 %out% c("co", "my", "go")) %>%
  arrange(instance2)


Another Three Instances domains

which contains both SLDs and TLDs (.my.id, .co.id, .com.my, etc.).
In this case, domain names should be located in the instance2 column.
I require this pattern to bind/aggregate all the results

domain_three_instances_2 <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  filter(instance2 %in% instance2_unique$instance2) %>%
  filter(instance4 %in% NA) %>%
  select(-instance4) %>%
  filter(instance3 %out% NA) %>%
  distinct() %>%
  filter(instance2 %out% domain_three_instances$instance2) %>%
  filter(instance2 %in% c("co", "my", "sg", "de", "com", "gov", "go")) %>%
 
  # Temporary colnames
  rename(two = instance1) %>%
  rename(three = instance2) %>%
  rename(four = instance3) %>%
 
  # Normalized colnames (instance2 as root domain)
  rename(instance2 = two) %>%
  rename(instance3 = three) %>%
  rename(instance4 = four) %>%
  arrange(instance2) ; domain_three_instances_2


The easiest: Four Instances Domains

This process can collect all domains that contain four instances (xxx.xxx.xxx.xxx) by simply removing all missing values throughout the dataframe

domain_four_instances <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  tidyr::drop_na(instance4) %>%
  distinct() %>%
  arrange(instance2)

domain_four_instances

Compiling all the objects

three_instances_final <- rbind(domain_three_instances, domain_three_instances_1)
three_and_fourins <- bind_rows(domain_four_instances, three_instances_final)
to_filter_twoins <- three_and_fourins %>%
  select(instance2) %>%
  distinct()
domain_two_instances_com <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  filter(instance3 %in% NA) %>%
  select(-instance3) %>%
  filter(instance2 %in% "com") %>%
  distinct() %>%
  rename(two = instance1) %>%
  rename(three = instance2) %>%
  rename(instance2 = two) %>%
  rename(instance3 = three)

domain_two_instances_1 <- df_media %>%
  select(instance1, instance2, instance3, instance4) %>%
  filter(instance3 %in% NA) %>%
  select(-instance3) %>%
  filter(instance2 %out% to_filter_twoins$instance2) %>%
  select(-instance4) %>%
  distinct() %>%
  rename(two = instance1) %>%
  rename(three = instance2) %>%
  rename(instance2 = two) %>%
  rename(instance3 = three)



two_instances <- bind_rows(domain_two_instances_com, domain_two_instances_1)
all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)

all_instances
three_and_fourins <- bind_rows(domain_four_instances, three_instances_final)
all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)




Varieties of media names and desks: the result

with instance2 = domain name

all_instances <- all_instances %>%
  filter(instance2 %out% "") %>%
  mutate(final_domain = purrr::pmap_chr(., ~ c(...) %>%
                                   na.omit %>%
                                   paste(collapse = "."))) %>%
  mutate(medianame = instance2) %>%
  arrange(medianame)

unique_medianames <- all_instances %>%
  distinct(); unique_medianames

Equipping the dataset with cleaned values

Recalling all the URLs in order

cleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl4 <- gsub("www.","",cleanurl3)

final_domain <- sapply(strsplit(cleanurl4,"/"),"[",1)

final_domain <- as.data.frame(final_domain)

final_domain <- final_domain %>%
  select(final_domain) %>%
  rename(cleaned_domain = final_domain)

final_domain$NO <- 1:nrow(final_domain)

unique_medianames <- unique_medianames %>%
  select(final_domain, medianame) %>%
  rename(cleaned_domain = final_domain)
to_viz <- merge(final_domain, unique_medianames, by = "cleaned_domain", all.x = FALSE)

to_viz %>% arrange(NO)


Can we spot The Problem?

We obtained 4,223 rows instead of 4,222. I suspect that this issue is related to the medianame_root_domain column, which appears to have been multiplied based on identical values in the cleaned_domain column.


Finding the fault

identify <- to_viz %>%
  select(cleaned_domain, medianame) %>%
  distinct()

the_duplicated_one <-
  identify %>% select(cleaned_domain) %>% janitor::get_dupes() %>% distinct() ; the_duplicated_one
identify %>%
  filter(cleaned_domain %in% the_duplicated_one$cleaned_domain)

Phew! That was probably a mistake in filtering. I should make a note to figure it out later. But for now, this condition is not critical. We can still remove the unwanted values from the medianame_root_domain column by negating it (in other words, '.com' should not be considered as a media name)


Taking out .com

to_viz <- to_viz %>%
  filter(medianame %out% "com")
to_viz <- to_viz %>%
  arrange(NO); to_viz


Done! we got 4,222 entries which is similar to our data input

Initiating Treemap Data Structure

Using a special function called as.treemapDF.

I got some lower-level R code from Stack Overflow (I apologize, but I forgot to include the documentation link), and I plan to include those codes in the appendix as well.


count_domains <- to_viz %>%
  select(cleaned_domain, medianame) %>%
  rename(`Media Outlet` = medianame) %>%
  rename(`URL Simplified` = cleaned_domain) %>%
  group_by(`Media Outlet`, `URL Simplified`) %>%
  summarise("values" = n())

media_treemap <- as.treemapDF(count_domains, valueCol = "values") %>% as.data.frame()

new_parent <- media_treemap$parents %>%
  tidyr::replace_na(., "")

media_treemap <- bind_cols(media_treemap, new_parent) %>%
  rename(new_parent = ...5)

Initiating Plotly Treemap

library(plotly)
media_plotly_treemap <-
  plot_ly(
    data = media_treemap,
    type = "treemap",
    ids = ~ ids,
    labels = ~ labels %>% stringr::str_wrap(width = 15),
    parents = ~ new_parent,
    values = ~ values,
    domain = list(column = 2)
  ) %>% layout(title = list(text="Media Desk Spread: Foldable Phone Product Launch and Prelaunch", y = 0.98, x = 0.55, xanchor = 'center', yanchor =  'top'))  %>%
  layout(uniformtext = list(minsize = 10)) %>% layout(colorway = ~
                                                               c("#005E7C",
                                                                 "#001242",
                                                                 "#632e60",
                                                                 "#162F20")) %>%
  config(displayModeBar = FALSE)

The Plotly Treemap Visualization!

media_plotly_treemap

The Data Intuition

This is the Plotly treemap visualization generated from our data. We can certainly depict the variety of media and media desks. From left to right, you can see the media names sorted by coverage, each with its unique set of media desks; namely: Detik, ,Kurio, Teknosignal, Kompas, Line (Line Today - The media aggregator), Siapgrak, Tribunnews, Pikiran-rakyat, Grid, and many others.


Speaking of coverage, it is actually resembled by area. What makes me into this treemap visualization, is the interactivity from Plotly that makes it hoverable and clickable (also tidier compared to the other package). I hope we can enjoy exploring the interactivity of this plot.

Deeper Analysis on Media Desks

I am interested to analyze several media names: they are Detik, Kompas, Tribunnews, and Antaranews; since those media names has various media desks based on the treemap figured.

Initiating captured data for further analysis

varieties_and_numbers <-
to_viz %>%
  group_by(cleaned_domain, medianame) %>%
  summarise("coverage" = n()) %>%
  arrange(desc(coverage))


The Analysis

Detik

detik <-
varieties_and_numbers %>%
  filter(medianame %in% "detik")

detik
sum(detik$coverage)
#> [1] 145


The first media name we have is Detik.com. As visualized on the treemap, Detik has eight different media desks, including inet.detik.com, wolipop.detik.com, finance.detik.com, travel.detik.com, hot.detik.com, health.detik.com, sport.detik.com, and 20.detik.com. In terms of coverage, Detik generates a significant amount of buzz with 145 articles in total.

When it comes to the distribution of media desks, Detik focuses on inet.detik.com (Inet, covering InterNET and tech-related topics, with 70 articles), followed by wolipop.detik.com (which features lifestyle content, with 20 articles). These media desk placements appear reasonable, as all the articles are related to smartphone product launch and prelaunch.


Detik’s coverage was followed by Kurio and Teknosignal. But none of them have media desks. So, I am heading to the next one, which is Kompas.

Kompas

varieties_and_numbers %>%
  filter(medianame %in% "kompas")

In terms of coverage, Kompas published 105 articles on this product launch and prelaunch campaign. Their media desks don’t seem to be very diverse, with most of the coverage featured on tekno.kompas.com, followed by kompas.tv, beta.kompas.tv, and indeks.kompas.com


Tribunnews

The next worth-to-analyze data point is Tribunnews. On this kind of media, they prefer to specify desks based on region (city/province name). However, on this case, they tend to put our client’s articles on tribunnews.com; which is their main domain.


varieties_and_numbers %>%
  filter(medianame %in% "tribunnews")

Pikiran Rakyat

I found something interesting with this media. Pikiran-rakyat has numerous media desks, with the highest coverage attributed to mantrasukabumi.pikiran-rakyat.com (13 articles). This raises questions for further analysis: Did they distribute articles based on domain performance, or is there another preference?


varieties_and_numbers %>%
  filter(medianame %in% "pikiran-rakyat")

Grid

Grid also prefers to place our client’s articles on nextren.grid.id (pronounced as ‘next trend’), followed by infokomputer.grid.id (focusing on computer and gadgets information). Their desk placements also appear to be reasonable


varieties_and_numbers %>%
  filter(medianame %in% "grid")

Antaranews

Antaranews has the same pattern as Tribunnews. In coverages and desk varieties, they were not the biggest one

varieties_and_numbers %>%
  filter(medianame %in% "antaranews")



Another Graphing Option: Flourish Hierarchy Chart (treemap)

I am using values from count_domains. This visualization is embedded from Flourish, a client-side graphing provider similar to Plotly Express. Please refresh this page if the treemap does not appear.

# writexl::write_xlsx(count_domains, "to_flourish.xlsx")

Appendix

Plotly Hierarchical Data Structure

Custom function: as.treemapDF

as.treemapDF <- function(DF, valueCol = NULL){
  require(data.table)
 
  colNamesDF <- names(DF)
 
  if(is.data.table(DF)){
    DT <- copy(DF)
  } else {
    DT <- data.table(DF, stringsAsFactors = FALSE)
  }
 
  DT[, root := ""]
  colNamesDT <- names(DT)
 
  if(is.null(valueCol)){
    setcolorder(DT, c("root", colNamesDF))
  } else {
    setnames(DT, valueCol, "values", skip_absent=TRUE)
    setcolorder(DT, c("root", setdiff(colNamesDF, valueCol), "values"))
  }
 
  hierarchyCols <- setdiff(colNamesDT, "values")
  hierarchyList <- list()
 
  for(i in seq_along(hierarchyCols)){
    currentCols <- colNamesDT[1:i]
    if(is.null(valueCol)){
      currentDT <- unique(DT[, ..currentCols][, values := .N, by = currentCols], by = currentCols)
    } else {
      currentDT <- DT[, lapply(.SD, sum, na.rm = TRUE), by=currentCols, .SDcols = "values"]
    }
    setnames(currentDT, length(currentCols), "labels")
    hierarchyList[[i]] <- currentDT
  }
 
  hierarchyDT <- rbindlist(hierarchyList, use.names = TRUE, fill = TRUE)
 
  parentCols <- setdiff(names(hierarchyDT), c("labels", "values", valueCol))
  hierarchyDT[, parents := apply(.SD, 1, function(x){fifelse(all(is.na(x)), yes = NA_character_, no = paste(x[!is.na(x)], sep = ":", collapse = " - "))}), .SDcols = parentCols]
  hierarchyDT[, ids := apply(.SD, 1, function(x){paste(x[!is.na(x)], collapse = " - ")}), .SDcols = c("parents", "labels")]
  hierarchyDT[, c(parentCols) := NULL]
  return(hierarchyDT)
}
count_domains
media_treemap