Packages

library(bigrquery)
library(DBI)
library(tidyverse)
library(tidytext)
library(lubridate)
library(topicmodels)
library(countrycode)
library(SnowballC)
library(Matrix)
library(RecordLinkage)
library(textmineR)
library(kableExtra)

This project uses several R packages to support data acquisition, cleaning, analysis, and visualization. For example, bigrquery and DBI are used to query data from Google BigQuery, tidytext enables text mining and topic modeling, and ggplot2 is used for data visualization.

Introduction

In this project, we explore how search behavior in the United States reflects interest in major pop culture events. Using unsupervised topic modeling, we analyze Google’s top search terms to identify recurring themes and trends. By linking search activity with globally documented events, we aim to uncover how societal attention manifests in digital behavior.

Data Sources

  • Google BigQuery – Top 25 Search Terms
    Google’s public dataset on the top 25 search queries per day, available via BigQuery.
  • GDELT 2.0 – Our Global World in Realtime
    A large-scale global event database tracking news events from around the world in real time, including media coverage of pop culture, conflict, disasters, and more.

Data Acquisition

Google BigQuery – Top Search Terms

We use the bigrquery package to access Google’s public dataset of the most popular daily search terms. Make sure you have a Google Cloud project set up with billing enabled and a BigQuery API key.

# Authenticate with Google Cloud (interactive)
bq_auth(path = "data607finalproject-6c666e5b6214.json")
project_id <- "data607finalproject"

query <- as.character(
  "SELECT
  'top_terms' as term_type,
  term,
  score,
  refresh_date,
  dma_id,
  dma_name,
  ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
    `bigquery-public-data.google_trends.top_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL

union  all

SELECT
  'rising_terms' as term_type,
  term as term_rising,
  score as score_rising,
  refresh_date,
  dma_id,
  dma_name,
  ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
    `bigquery-public-data.google_trends.top_rising_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL
ORDER BY
    (select week from unnest(x)) desc,
    (SELECT rank FROM UNNEST(x)),
    refresh_date desc
"
)

# Run dry run query to estimate billing
job <- bq_perform_query_dry_run(
  billing = project_id,
  query = query,
  use_legacy_sql = FALSE
)

google_search_data <- bq_project_query(project_id, query)
google_search_df <- bq_table_download(google_search_data)

knit_table(head(google_search_df))
term_type term score refresh_date dma_id dma_name x
top_terms sabu 100 2025-05-11 501 New York NY 1, 20219
top_terms sabu 100 2025-05-11 803 Los Angeles CA 1, 20219
rising_terms habeas corpus 100 2025-05-11 501 New York NY 1, 20219
rising_terms habeas corpus 100 2025-05-11 803 Los Angeles CA 1, 20219
top_terms barcelona vs real madrid 100 2025-05-11 501 New York NY 2, 20219
top_terms barcelona vs real madrid 100 2025-05-11 803 Los Angeles CA 2, 20219

GDELT 2.0: Our Global World in Realtime

Getting the latest file name:

# Get the name of the most recent GDELT CSV file
latest_file <- read_lines("http://data.gdeltproject.org/gdeltv2/lastupdate.txt")
latest_file
## [1] "58335 ceb398cb084a6c1665071e85971a2279 http://data.gdeltproject.org/gdeltv2/20250517023000.export.CSV.zip"  
## [2] "71232 f41e2d9e275493a78fe4c0185a41a1b4 http://data.gdeltproject.org/gdeltv2/20250517023000.mentions.CSV.zip"
## [3] "3307005 f76e6bc23ff338aef37ecc69d1672c1e http://data.gdeltproject.org/gdeltv2/20250517023000.gkg.csv.zip"


Getting event mentions within the same timeframe as the Google Search Terms

all_files = data_frame(
  time = (
    seq(ymd_hms('20250506000000'),ymd_hms('20250512000000'), by = '1 day'))) |>
  mutate(
    time = format(time, format = '%Y%m%d%H%M%S'),
    file_url = paste0('http://data.gdeltproject.org/gdeltv2/', time, '.mentions.CSV.zip'),
    destfile = paste0("Data/GDELT/gdelt_", time, ".zip")
    )
knit_table(all_files) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
time file_url destfile
20250506000000 http://data.gdeltproject.org/gdeltv2/20250506000000.mentions.CSV.zip Data/GDELT/gdelt_20250506000000.zip
20250507000000 http://data.gdeltproject.org/gdeltv2/20250507000000.mentions.CSV.zip Data/GDELT/gdelt_20250507000000.zip
20250508000000 http://data.gdeltproject.org/gdeltv2/20250508000000.mentions.CSV.zip Data/GDELT/gdelt_20250508000000.zip
20250509000000 http://data.gdeltproject.org/gdeltv2/20250509000000.mentions.CSV.zip Data/GDELT/gdelt_20250509000000.zip
20250510000000 http://data.gdeltproject.org/gdeltv2/20250510000000.mentions.CSV.zip Data/GDELT/gdelt_20250510000000.zip
20250511000000 http://data.gdeltproject.org/gdeltv2/20250511000000.mentions.CSV.zip Data/GDELT/gdelt_20250511000000.zip
20250512000000 http://data.gdeltproject.org/gdeltv2/20250512000000.mentions.CSV.zip Data/GDELT/gdelt_20250512000000.zip


Download and Unzip:

# Constructing the full URL to download the ZIP file
export_url_line <- latest_file[1]

# Extract the URL from the line using strsplit
export_url <- strsplit(export_url_line, " ")[[1]][3]

# Download the export ZIP file to your working directory
download.file(export_url, destfile = "Data/GDELT/gdelt_latest.zip", mode = "wb")

# Unzip the file
unzip("Data/GDELT/gdelt_latest.zip", exdir='Data/GDELT')


Read CSV into R:

# Extract filename only from URL
csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(export_url)))

# Read the CSV file with correct case-sensitive name
gdelt_df <- read_csv(csv_file, col_names = FALSE)


GDELT 2.0 – Structured Load
GDELT 2.0 is a massive global event database that captures media coverage of events in real time from across the world. The export files are updated every 15 minutes and are tab-delimited without column headers, so we must assign names manually using the GDELT schema.

In this chunk, we load the most recent export file and apply column names for easier exploration:

gdelt_cols <- c(
  "GLOBALEVENTID", "Day", "MonthYear", "Year", "FractionDate", 
  "Actor1Code", "Actor1Name", "Actor1CountryCode", "Actor1KnownGroupCode", 
  "Actor1EthnicCode", "Actor1Religion1Code", "Actor1Religion2Code", 
  "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code",
  "Actor2Code", "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", 
  "Actor2EthnicCode", "Actor2Religion1Code", "Actor2Religion2Code", 
  "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code",
  "IsRootEvent", "EventCode", "EventBaseCode", "EventRootCode", 
  "QuadClass", "GoldsteinScale", "NumMentions", "NumSources", 
  "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", 
  "Actor1Geo_CountryCode", "Actor1Geo_ADM1Code", "Actor1Geo_ADM2Code",
  "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type", 
  "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code",
  "Actor2Geo_ADM2Code", "Actor2Geo_Lat", "Actor2Geo_Long", "Actor2Geo_FeatureID", 
  "ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode", 
  "ActionGeo_ADM1Code", "Action2Geo_ADM2Code", "ActionGeo_Lat", "ActionGeo_Long", 
  "ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
)

# Read the file with tab delimiter and custom column names
gdelt_df <- read_delim(csv_file, delim = "\t", col_names = gdelt_cols, show_col_types = FALSE)

# Preview first rows
knit_table(head(gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
GLOBALEVENTID Day MonthYear Year FractionDate Actor1Code Actor1Name Actor1CountryCode Actor1KnownGroupCode Actor1EthnicCode Actor1Religion1Code Actor1Religion2Code Actor1Type1Code Actor1Type2Code Actor1Type3Code Actor2Code Actor2Name Actor2CountryCode Actor2KnownGroupCode Actor2EthnicCode Actor2Religion1Code Actor2Religion2Code Actor2Type1Code Actor2Type2Code Actor2Type3Code IsRootEvent EventCode EventBaseCode EventRootCode QuadClass GoldsteinScale NumMentions NumSources NumArticles AvgTone Actor1Geo_Type Actor1Geo_FullName Actor1Geo_CountryCode Actor1Geo_ADM1Code Actor1Geo_ADM2Code Actor1Geo_Lat Actor1Geo_Long Actor1Geo_FeatureID Actor2Geo_Type Actor2Geo_FullName Actor2Geo_CountryCode Actor2Geo_ADM1Code Actor2Geo_ADM2Code Actor2Geo_Lat Actor2Geo_Long Actor2Geo_FeatureID ActionGeo_Type ActionGeo_FullName ActionGeo_CountryCode ActionGeo_ADM1Code Action2Geo_ADM2Code ActionGeo_Lat ActionGeo_Long ActionGeo_FeatureID DATEADDED SOURCEURL
1244378511 20240517 202405 2024 2024.375 NA NA NA NA NA NA NA NA NA NA AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA 0 042 042 04 1 1.9 3 1 3 -0.831202 0 NA NA NA NA NA NA NA 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378512 20240517 202405 2024 2024.375 AFG AFGHAN AFG NA NA NA NA NA NA NA AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA 0 042 042 04 1 1.9 3 1 3 -0.831202 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378513 20240517 202405 2024 2024.375 AFG AFGHANISTAN AFG NA NA NA NA NA NA NA AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA 0 042 042 04 1 1.9 2 1 2 -0.831202 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 3 Fredericksburg, Virginia, United States US USVA VA163 37.8929 -79.4378 1494946 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378514 20240517 202405 2024 2024.375 AFG AFGHANISTAN AFG NA NA NA NA NA NA NA AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA 0 042 042 04 1 1.9 2 1 2 -0.831202 4 Saigon, H? Chíinh, Vietnam, Republic Of VM VM20 74101 10.7500 106.6670 -3730078 4 Saigon, H? Chíinh, Vietnam, Republic Of VM VM20 74101 10.7500 106.6670 -3730078 4 Saigon, H? Chíinh, Vietnam, Republic Of VM VM20 74101 10.7500 106.6670 -3730078 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378515 20240517 202405 2024 2024.375 AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA NA NA NA NA NA NA NA NA NA NA 0 043 043 04 1 2.8 3 1 3 -0.831202 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 0 NA NA NA NA NA NA NA 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378516 20240517 202405 2024 2024.375 AFGINSTAL TALIBAN AFG TAL NA NA NA INS NA NA AFG AFGHAN AFG NA NA NA NA NA NA NA 0 043 043 04 1 2.8 3 1 3 -0.831202 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 4 Kabul, Kabol, Afghanistan AF AF13 3580 34.5167 69.1833 -3378435 2.025052e+13 https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025

In the following code chunk, we repeat the same process as above to load the event mention export file and apply column names.

The Mentions table records an individual mention of an event in an article, This table includes a Confidence field that records GDELT’s confidence in its extraction of that event from that particular article.

mention_gdelt_cols <- c(
  "GLOBALEVENTID", "EventTimeDate", "MentionTimeDate", "MentionType",
  "MentionSourceName", "MentionIdentifier",
  "SentenceID", "Actor1CharOffset", "Actor2CharOffset", "ActionCharOffset", 
  "InRawText", "Confidence", "MentionDocLen", 
  "MentionDocTone", "MentionDocTranslationInfo", "Extras"
)

for (i in 1:nrow(all_files)){
  download.file(
    all_files$file_url[i],
    destfile = all_files$destfile[i],
    mode = "wb", quiet = TRUE)
  # Unzip the file
  unzip(all_files$destfile[i], exdir='Data/GDELT')
  csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(all_files$file_url[i])))
  # Read the CSV file with correct case-sensitive name
  if (i == 1) {
    mention_gdelt_df <- read_delim(
      csv_file, delim = "\t", col_names = mention_gdelt_cols,
      col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
  }
  else{
    temp_df <- read_delim(
      csv_file, delim = "\t", col_names = mention_gdelt_cols,
      col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
    mention_gdelt_df = rbind(mention_gdelt_df, temp_df)
  }
}
knit_table(head(mention_gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
GLOBALEVENTID EventTimeDate MentionTimeDate MentionType MentionSourceName MentionIdentifier SentenceID Actor1CharOffset Actor2CharOffset ActionCharOffset InRawText Confidence MentionDocLen MentionDocTone MentionDocTranslationInfo Extras
1173957950 20240506004500 20250506000000 1 wwno.org https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent 9 -1 2207 2153 1 60 14002 0.4860804 NA NA
1174053899 20240506131500 20250506000000 1 nydailynews.com https://www.nydailynews.com/2025/05/05/met-gala-2025-adrienne-adams-brad-lander-nyc-politicians/ 4 -1 611 629 0 20 2332 3.7234043 NA NA
1241978379 20250506000000 20250506000000 1 bignewsnetwork.com http://www.bignewsnetwork.com/news/278204974/canada-alberta-premier-says-separation-referendum-possible-in-2026 1 64 209 127 0 20 1114 -2.6881720 NA NA
1173957950 20240506004500 20250506000000 1 wwno.org https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent 9 2207 -1 2153 1 60 14002 0.4860804 NA NA
1174054437 20240506131500 20250506000000 1 marketscreener.com https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/ 8 1767 -1 1796 1 60 2152 3.5714286 NA NA
1174033236 20240506111500 20250506000000 1 wwno.org https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent 9 2205 2132 2151 0 40 14002 0.4860804 NA NA


Load GDELT Reference Table Data

Define reference table data file sources

cameo_country_code_txt <- "https://www.gdeltproject.org/data/lookups/CAMEO.country.txt"
fips_country_code_txt <- 'https://www.gdeltproject.org/data/lookups/FIPS.country.txt'
cameo_code_type_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.type.txt'
cameo_known_code_groups_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.knowngroup.txt'
cameo_ethnic_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.ethnic.txt'
cameo_religion_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.religion.txt'
cameo_event_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt'
cameo_gold_stein_scale_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.goldsteinscale.txt'
adm1_codes_txt <- 'http://efele.net/maps/fips-10/data/fips-414.txt'
adm2_codes_txt <- 'https://download.geonames.org/export/dump/admin2Codes.txt'

Actor Data Reference Tables

# Actor Country Code column reference values
cameo_country_ref <- read.csv(
  cameo_country_code_txt, sep="\t", header=T,
  col.names = c('Country', 'Country_Desc'))

# Actor Known Code Group column reference values
cameo_known_code_groups_ref <- read.csv(
  cameo_known_code_groups_txt, sep="\t", header=T,
  col.names = c('KnownGroup', 'KnownGroup_Desc'))

# Actor Ethnic Code column reference values
cameo_ethnic_ref <- read.csv(cameo_ethnic_code_txt, sep="\t", header=T,
                            col.names = c('Ethnic', 'Ethnic_Desc'))

# Actor Religion Code (1 and 2) column reference values
cameo_religion_ref <- read.csv(cameo_religion_code_txt, sep="\t", header=T,
                              col.names = c('Religion', 'Religion_Desc'))

# Actor Type Code (1, 2, and 3) column reference values
cameo_type_ref <- read.csv(cameo_code_type_txt, sep="\t", header=T,
                                col.names = c('TypeCode', 'Type'))

Event Data Reference Tables

# Event Code (including base and root event codes) column reference values
cameo_event_ref <- read.csv(cameo_event_code_txt, sep="\t", header=T ,
                           colClasses = "character",
                                col.names = c('EventCode', 'Event'))

# Event Gold Stein Scale column reference values
cameo_gold_stein_ref <- read.csv(cameo_gold_stein_scale_txt, sep="\t", header=T,
                                col.names = c('EventCode', 'GoldSteinScale'))

# Event quad class (primary classification) column reference values
quadclass_ref <- tribble(
  ~"QuadClass", ~"Quad",
  1, 'Verbal Cooperation',
  2, 'Material Cooperation',
  3, 'Verbal Conflict',
  4,' Material Conflict'
)

Event Geography Data Reference Tables

# Event geographic resolution column reference values
geo_type_ref <- tribble(
  ~"Geo_Type", ~"Geo_Type_Desc",
  1, 'COUNTRY',
  2, 'USSTATE',
  3, 'USCITY',
  4,' WORLDCITY',
  5, 'WORLDSTATE'
)

# FIPS country column reference values
fips_country_ref <- read.csv(fips_country_code_txt, sep="\t", header=F,
                            col.names = c('GEO_Country', 'GEO_Country_Desc'))

# GEO ADM1 code column reference values
# create U.S. adm1 codes using 'US' followed by the state abbreviation
us_fips_adm1_ref <- tibble(
  'Country' = 'US',
  'division' = 'state',
  fips_adm1 = state.name,
  state_abb = state.abb
  ) |>
  mutate(fips_adm1_code = paste0(Country, state_abb)) |>
  select(fips_adm1_code, division, fips_adm1)

# read in global adm1 codes and join to custom U.S. codes
fips_adm1_ref <- read.csv(
  adm1_codes_txt, sep="_", header=F, na.strings = "",
  col.names = c('fips_adm1_code',
                'first fips version',
                'last fips version',
                'division',
                'designation_2',
                'designation_3',
                'designation4',
                'fips_adm1',
                'convential name',
                'former name')) |>
  select('fips_adm1_code', 'division', 'fips_adm1') |>
  rbind(us_fips_adm1_ref)

Tidy & Clean Data

Google BigQuery

  • Unnest week and rank to transform dataframe into a long-format where each week and rank value is a separate row and own column
  • Fill Null score values with zero
  • Create search_id column from the associated row number
long_google_search_df <-  google_search_df |>
  unnest(x) |>
  mutate(
    score = replace_na(score, 0),
    search_id = row_number())

knit_table(head(long_google_search_df))
term_type term score refresh_date dma_id dma_name rank week search_id
top_terms sabu 100 2025-05-11 501 New York NY 1 2025-05-11 1
top_terms sabu 100 2025-05-11 803 Los Angeles CA 1 2025-05-11 2
rising_terms habeas corpus 100 2025-05-11 501 New York NY 1 2025-05-11 3
rising_terms habeas corpus 100 2025-05-11 803 Los Angeles CA 1 2025-05-11 4
top_terms barcelona vs real madrid 100 2025-05-11 501 New York NY 2 2025-05-11 5
top_terms barcelona vs real madrid 100 2025-05-11 803 Los Angeles CA 2 2025-05-11 6

Subset refresh date into its own table for reference.

refresh_date_ref <- long_google_search_df |>
  select(refresh_date) |>
  distinct()

knit_table(head(refresh_date_ref))
refresh_date
2025-05-11
2025-05-10
2025-05-09
2025-05-08
2025-05-07
2025-05-06

Create Designated Market Area (DMA) reference table.

dma_ref_df <- long_google_search_df |>
  select(dma_id, dma_name) |>
  distinct()

knit_table(head(dma_ref_df))
dma_id dma_name
501 New York NY
803 Los Angeles CA

Create final Google top term dateframe

top_term_df <- long_google_search_df |>
  filter(term_type == "top_terms")

knit_table(head(top_term_df))
term_type term score refresh_date dma_id dma_name rank week search_id
top_terms sabu 100 2025-05-11 501 New York NY 1 2025-05-11 1
top_terms sabu 100 2025-05-11 803 Los Angeles CA 1 2025-05-11 2
top_terms barcelona vs real madrid 100 2025-05-11 501 New York NY 2 2025-05-11 5
top_terms barcelona vs real madrid 100 2025-05-11 803 Los Angeles CA 2 2025-05-11 6
top_terms barcelona - real madrid 100 2025-05-11 501 New York NY 3 2025-05-11 9
top_terms barcelona - real madrid 100 2025-05-11 803 Los Angeles CA 3 2025-05-11 10

Create final Google rising term dataframe

rising_term_df <- long_google_search_df |>
  filter(term_type == "rising_terms")

knit_table(head(rising_term_df))
term_type term score refresh_date dma_id dma_name rank week search_id
rising_terms habeas corpus 100 2025-05-11 501 New York NY 1 2025-05-11 3
rising_terms habeas corpus 100 2025-05-11 803 Los Angeles CA 1 2025-05-11 4
rising_terms derek carr 98 2025-05-11 501 New York NY 2 2025-05-11 7
rising_terms derek carr 72 2025-05-11 803 Los Angeles CA 2 2025-05-11 8
rising_terms tripe 12 2025-05-11 501 New York NY 3 2025-05-11 11
rising_terms tripe 21 2025-05-11 803 Los Angeles CA 3 2025-05-11 12

GDELT 2.0

Note: The GDELT maintains a tidy format in its raw state. No manual tidying tasks were needed to be performed.

GDELT Event Data Cleaning

  • Transforms Day into date data type
  • Transforms MonthYear into datetime string format
  • Transforms GLOBALEVENTID into strings
  • Transforms all categorical columns into factor data type
gdelt_df_cleaned <- gdelt_df |>
  mutate_at(vars(contains('Code')), ~as.factor(.)) |>
  mutate_at(vars(contains('Geo_Type')), ~as.factor(.)) |>
  mutate(
    GLOBALEVENTID = as.character(GLOBALEVENTID),
    Day = as.Date(as.character(Day), '%Y%m%d'),
    MonthYear = format(Day, '%Y-%m'),
    QuadClass = as.factor(QuadClass),
  ) |> 
  filter(Day >= today() - days(14) & Day <= today())

knit_table(head(gdelt_df_cleaned)) |> scroll_box(
  width = "100%", box_css = "border: 1px solid #FFFFFF;")
GLOBALEVENTID Day MonthYear Year FractionDate Actor1Code Actor1Name Actor1CountryCode Actor1KnownGroupCode Actor1EthnicCode Actor1Religion1Code Actor1Religion2Code Actor1Type1Code Actor1Type2Code Actor1Type3Code Actor2Code Actor2Name Actor2CountryCode Actor2KnownGroupCode Actor2EthnicCode Actor2Religion1Code Actor2Religion2Code Actor2Type1Code Actor2Type2Code Actor2Type3Code IsRootEvent EventCode EventBaseCode EventRootCode QuadClass GoldsteinScale NumMentions NumSources NumArticles AvgTone Actor1Geo_Type Actor1Geo_FullName Actor1Geo_CountryCode Actor1Geo_ADM1Code Actor1Geo_ADM2Code Actor1Geo_Lat Actor1Geo_Long Actor1Geo_FeatureID Actor2Geo_Type Actor2Geo_FullName Actor2Geo_CountryCode Actor2Geo_ADM1Code Actor2Geo_ADM2Code Actor2Geo_Lat Actor2Geo_Long Actor2Geo_FeatureID ActionGeo_Type ActionGeo_FullName ActionGeo_CountryCode ActionGeo_ADM1Code Action2Geo_ADM2Code ActionGeo_Lat ActionGeo_Long ActionGeo_FeatureID DATEADDED SOURCEURL
1244378528 2025-05-10 2025-05 2025 2025.356 USA UNITED STATES USA NA NA NA NA NA NA NA USAGOV UNITED STATES USA NA NA NA NA GOV NA NA 1 043 043 04 1 2.8 1 1 1 -7.917889 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 2.025052e+13 https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378529 2025-05-10 2025-05 2025 2025.356 USA UNITED STATES USA NA NA NA NA NA NA NA USAGOV UNITED STATES USA NA NA NA NA GOV NA NA 1 043 043 04 1 2.8 1 1 1 -7.917889 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 1 Jersey JE JE NA 49.21667 -2.116667 JE 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 2.025052e+13 https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378530 2025-05-10 2025-05 2025 2025.356 USAGOV UNITED STATES USA NA NA NA NA GOV NA NA USA UNITED STATES USA NA NA NA NA NA NA NA 1 042 042 04 1 1.9 1 1 1 -7.917889 1 Jersey JE JE NA 49.21667 -2.116667 JE 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 2.025052e+13 https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378531 2025-05-10 2025-05 2025 2025.356 USAGOV UNITED STATES USA NA NA NA NA GOV NA NA USA JERSEY USA NA NA NA NA NA NA NA 1 112 112 11 3 -2.0 1 1 1 -7.917889 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 3 Newark, New Jersey, United States US USNJ NJ013 40.73570 -74.172400 878762 1 Jersey JE JE NA 49.21667 -2.116667 JE 2.025052e+13 https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378532 2025-05-10 2025-05 2025 2025.356 USAGOV UNITED STATES USA NA NA NA NA GOV NA NA USA JERSEY USA NA NA NA NA NA NA NA 1 112 112 11 3 -2.0 1 1 1 -7.917889 1 Jersey JE JE NA 49.21667 -2.116667 JE 1 Jersey JE JE NA 49.21667 -2.116667 JE 1 Jersey JE JE NA 49.21667 -2.116667 JE 2.025052e+13 https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378533 2025-05-16 2025-05 2025 2025.373 IRQSPY IRAQ IRQ NA NA NA NA SPY NA NA SYR DAMASCUS SYR NA NA NA NA NA NA NA 0 040 040 04 1 1.0 1 1 1 -1.829925 4 Damascus, Dimashq, Syria SY SY13 26324 33.50000 36.300000 -2541260 4 Damascus, Dimashq, Syria SY SY13 26324 33.50000 36.300000 -2541260 4 Damascus, Dimashq, Syria SY SY13 26324 33.50000 36.300000 -2541260 2.025052e+13 https://www.zerohedge.com/geopolitical/jordans-king-warned-us-against-assassinating-syrias-sharaa-trump-meeting

GDELT Event Mentions Data Cleaning

  • Transforms Day into date data type
  • Transforms EventTimeDate and MentionTimeDate into datetime data types
  • Transforms GLOBALEVENTID into strings
  • Transforms all categorical columns into factor data type
  • Creates Article column by extracting the article name from the source URL
  • Subsets the dataframe to be on the Event level by filters to the most the event mention with the highest confidence
  • Filters the dataframe to only include event source from the web with a confidence extraction percent of greater than or equal to 60%
mention_gdelt_df_cleaned = mention_gdelt_df |>
  mutate(
    GLOBALEVENTID = as.character(GLOBALEVENTID),
    EventTimeDate =  format(lubridate::ymd_hms(as.character(EventTimeDate)),'%Y-%m-%d- %H:%M:%S'),
    MentionTimeDate =  format(lubridate::ymd_hms(as.character(MentionTimeDate)),'%Y-%m-%d- %H:%M:%S'),
    ) |>
  group_by(GLOBALEVENTID) |>
  arrange(GLOBALEVENTID, desc(Confidence), desc(MentionType), MentionSourceName) |>
  slice_head(n=1) |>
  ungroup() |>
  filter(MentionType == 1 & Confidence >= 60)

mention_gdelt_df_cleaned = mention_gdelt_df_cleaned |>
    mutate(Article = gsub('\\/$||\\/([A-Za-z]|\\d)*$|\\/*\\d*$|\\/([A-Z]|\\d)*\\/$|\\/*-*\\d*(\\.[a-z]+$|\\/$|$)|\\/*\\d*\\/_.*$|\\/\\?.*$', "", MentionIdentifier)) |>
  mutate(Article = gsub('-2025-\\d+.*$|-2025\\d+-.*$|\\/\\d*-*p5lyys.*$', "", Article)) |>
  mutate(Article = gsub('\\/([a-z]|\\d|article)*_*-*([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d){4}-([a-z]|\\d)*$', "", Article)) |>
  mutate(Article = gsub('(https://|http://).*\\..*\\/\\d*-*', '', Article)) |>
  mutate(Article = str_replace_all(Article, '-', ' ')) |>
  mutate(Article = str_replace(Article, '\\.', '')) |>
  mutate(Article = na_if(Article, 'index?more=')) |>
  mutate(Article = ifelse(str_count(Article)<=1, NA, Article)) |>
  mutate(Article = ifelse(str_count(Article, '\\w+')<=1, NA, Article))

knit_table(head(mention_gdelt_df_cleaned)) |> scroll_box(
  width = "100%", box_css = "border: 1px solid #FFFFFF;")
GLOBALEVENTID EventTimeDate MentionTimeDate MentionType MentionSourceName MentionIdentifier SentenceID Actor1CharOffset Actor2CharOffset ActionCharOffset InRawText Confidence MentionDocLen MentionDocTone MentionDocTranslationInfo Extras Article
1173957950 2024-05-06- 00:45:00 2025-05-06- 00:00:00 1 wwno.org https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent 9 -1 2207 2153 1 60 14002 0.4860804 NA NA new orleans got 388 million in federal pandemic aid was that money well spent
1174054437 2024-05-06- 13:15:00 2025-05-06- 00:00:00 1 marketscreener.com https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/ 8 1767 -1 1796 1 60 2152 3.5714286 NA NA Diana Ross Simone Biles shine at Met Ball celebration of Black menswear
1174227361 2024-05-07- 07:15:00 2025-05-07- 00:00:00 1 bangkokpost.com https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south 5 -1 823 941 1 60 3054 -4.4680851 NA NA govt ready for new talks on peace in deep south
1174261394 2024-05-07- 11:00:00 2025-05-07- 00:00:00 1 bangkokpost.com https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south 5 823 -1 978 1 60 3054 -4.4680851 NA NA govt ready for new talks on peace in deep south
1174400771 2024-05-08- 00:15:00 2025-05-08- 00:00:00 1 kenw.org https://www.kenw.org/npr-news/2025-05-07/3-former-memphis-police-officers-found-not-guilty-in-the-death-of-tyre-nichols 6 1055 -1 1037 1 100 5634 -9.5744681 NA NA former memphis police officers found not guilty in the death of tyre nichols
1174418451 2024-05-08- 02:00:00 2025-05-08- 00:00:00 1 cbs12.com https://cbs12.com/news/connect-to-congress/house-gop-continues-to-push-colleges-to-do-more-to-combat-antisemitism-on-campus-haverford-college-depaul-university-california-polytechnic-state-university-house-education-committee 6 2151 2158 2133 1 100 4985 -5.7397959 NA NA house gop continues to push colleges to do more to combat antisemitism on campus haverford college depaul university california polytechnic state university house education committee

Join Data

Join the Google Big Query Data with the GDELT Data

gsearch_gdelt <- left_join(
  long_google_search_df, 
  gdelt_df_cleaned,
  by=c('refresh_date'='Day'))

Exploratory Data Analysis

Stop Words
Stop words are words that should be excluded in text analysis, such as “the”, “a”, “and”, etc. Within the ‘tidytext’ package, stop words can be removed in multiple languages. In this analysis, stop words were removed in English and Spanish.

# getting stop words for english and spanish 
multi_lang_stopwords <- bind_rows(
  get_stopwords(language = "en"),
  get_stopwords(language = "es")
)

Tokenization
In text mining, tokenization refers to breaking down text into smaller units. The Google search data was already broken in to short text, therefore, when tokenizing the text, the unit of our text becomes a word.

tokenized_gsearch_gdelt <- gsearch_gdelt |> 
  unnest_tokens(output = word, input = term) |> 
  anti_join(multi_lang_stopwords, by = "word") |> 
  relocate(search_id, word)

Frequency Analysis
After tokenizing the text, a freqency analysis is conducted to better understand which searches appear most frequently, at the word level, and at the Google search level.

most_freq_tokens <- tokenized_gsearch_gdelt |> 
  count(word, sort = TRUE) |> 
  slice_max(order_by = n, n = 25) |> 
  mutate(color_group = factor(row_number() %% 3)) 

most_freq_searches <- gsearch_gdelt |> 
  count(term, sort = TRUE) |> 
  slice_max(order_by = n, n = 25) |> 
  mutate(color_group = factor(row_number() %% 3)) 

Frequency Plots

ggplot(most_freq_tokens, aes(x=reorder(word, -n), y=n, fill = color_group))+
  geom_bar(stat="identity")+
  scale_fill_manual(values = colors) +
  theme_minimal()+
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
   labs(y = "Count",
       x = "Tokenized Terms (Words)",
       title = "Word Frequency in Google Searches",
       subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
  guides(fill=FALSE) 

ggplot(most_freq_searches, aes(x=reorder(term, -n), y=n, fill = color_group))+
 geom_bar(stat="identity")+
  scale_fill_manual(values = colors) +
  theme_minimal()+
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
   labs(y = "Count",
       x = "Google Searches",
       title = "Google Searches Frequency",
       subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
  guides(fill=FALSE)

tokenized_gsearch_gdelt |> 
  group_by(refresh_date) |> 
  count(word, sort = TRUE) |> 
  top_n(5) |> 
  ungroup() |> 
  mutate(word = reorder_within(word, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(tokenized_gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequently googled words in past two weeks",
       subtitle = "Stop words removed from the list")

gsearch_gdelt |> 
  filter(term_type == "top_terms",
         refresh_date >= "2025-05-03") |> 
  group_by(refresh_date) |> 
  count(term, sort = TRUE) |> 
  top_n(3) |> 
  ungroup() |> 
  mutate(word = reorder_within(term, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = term, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Top Searched Terms",
       title = "Top Google Searches in Past 8 Days in LA and NYC",
       subtitle = "")

gsearch_gdelt |> 
  filter(term_type == "rising_terms",
         refresh_date >= "2025-05-03") |> 
  group_by(refresh_date) |> 
  count(term, sort = TRUE) |> 
  top_n(3) |> 
  ungroup() |> 
  mutate(word = reorder_within(term, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = term, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Top Searched Terms",
       title = "Rising Google Searches in Past 8 Days in LA and NYC",
       subtitle = "")

Statistical Analysis

LDA Modeling

Word Stemming
Another technique that can be used in text mining is word stemming. When word stemming, it takes the text, or token, and reduces the term to the root form (also known as the stem). This can help normalize the terms, so that terms like play, playing, played, plays, would all be stemmed to play.

wrd_stem <- tokenized_gsearch_gdelt %>% 
  mutate(word = SnowballC::wordStem(word))

Document Term Matrix
A document term matrix represents the frequency of terms that appear in the collect of text.

# creating document term matrix 
dtm_tokenized_terms <- wrd_stem |> 
  count(refresh_date, word) |> 
  cast_dtm(document = refresh_date, term = word, value = n)

# Convert to reg matrix
mat <- as.matrix(dtm_tokenized_terms)
# Convert to sparse Matrix (dgCMatrix) so can use in lapply below
dtm_mat <- Matrix(mat, sparse = TRUE)

Latent Dirichlet Allocation (LDA)
LDA is a topic modeling method that can be used to identify central topics and distributions across a corpus.

Corpus: a corpus is a collection of text that is used in natural language processing.

In LDA, the value k is defined as the number of topics that appear in the corpus. To determine k, some additional analysis should be completed.

# developing list of potential k values to fit in LDA model
k_list <- seq(2, 20, by = 1)

# testing fit for k values in LDA models
fit_test_k_models <- lapply(k_list, function(k) {
  FitLdaModel(dtm = dtm_mat, #using converted matrix
              k = k,
              iterations = 200,
              burnin = 175,
              alpha = 0.1,
              beta = 0.05,
              optimize_alpha = TRUE,
              calc_likelihood = TRUE,
              calc_coherence = TRUE,
              calc_r2 = TRUE)
})
names(fit_test_k_models) <- paste0("k_", k_list)

# Extracting final log-likelihood value
log_likelihood_list <- sapply(fit_test_k_models, function(x) tail(x$log_likelihood, 1)[2])
log_likelihood <- as.numeric(log_likelihood_list)

# generating summary of model perf.
model_summary <- data.frame(
  k = k_list,
  coherence = sapply(fit_test_k_models, function(x) mean(x$coherence)), # topic coherence
  r2 = sapply(fit_test_k_models, function(x) x$r2), #residuals squared - how different are the expected versus obs.
  log_likelihood = log_likelihood # probability that the model generated obs. data
)

ggplot(model_summary, aes(x = k)) +
  geom_line(aes(y = coherence), color = "blue") +
  geom_line(aes(y = r2 * max(coherence)), color = "green", linetype = "dashed") +
  geom_line(aes(y = log_likelihood / max(abs(log_likelihood)) * max(coherence)),
            color = "red", linetype = "dotted") +
  ylab("Normalized Metric (for comparison)") +
  ggtitle("LDA Tuning: Coherence, R2, and Log-Likelihood")

# this is where we use a LDA model - the number of topics is defined as K. 
lda_model_k15 <- LDA(dtm_tokenized_terms, k = 15, control = list(seed = 1234))
topics_lda_model_k15 <- tidy(lda_model_k15, matrix = "beta")

# Top terms per topic
top_words_k15 <- topics_lda_model_k15 |>   
  group_by(topic) |> 
  top_n(10, beta) |> 
  ungroup()
top_terms_lda_k15 <- topics_lda_model_k15 |> 
  group_by(topic) |> 
  slice_max(beta, n = 3, with_ties = FALSE) |> 
  ungroup() |> 
  arrange(topic, -beta)

top_terms_lda_k15 |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  group_by(topic, term) |>     
  arrange(desc(beta)) |>   
  ungroup() |> 
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 3 terms in each LDA topic",
       subtitle = "LDA model where k=15",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

# this is where we use a LDA model - the number of topics is defined as K. 
lda_model_k11 <- LDA(dtm_tokenized_terms, k = 11, control = list(seed = 1234))
topics_lda_model_k11 <- tidy(lda_model_k11, matrix = "beta")

# Top terms per topic
top_words_k11 <- topics_lda_model_k11 |>   
  group_by(topic) |> 
  top_n(10, beta) |> 
  ungroup()
top_terms_lda_k11 <- topics_lda_model_k11 |> 
  group_by(topic) |> 
  slice_max(beta, n = 3, with_ties = FALSE) |> 
  ungroup() |> 
  arrange(topic, -beta)

top_terms_lda_k11 |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  group_by(topic, term) |>     
  arrange(desc(beta)) |>   
  ungroup() |> 
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 3 terms in each LDA topic",
       subtitle = "LDA model where k=11",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

String Similarity Analysis

Assess string similarity between Google Search top rising terms and GDELT Event article name extracted from the event mention source URL with the highest confidence within the analysis time frame.

Remove Stop Words
Tokenize both the top rising terms and event article names to allow for the removal of common stop words. Upon removal, the token words are joined back together. This process will assist in obtaining a more accurate similarity score by removing redundant or commonly shared words.

token_str_gsearch = rising_term_df |>
  unnest_tokens(output = clean_term, input = term) |>
  anti_join(multi_lang_stopwords, by = c("clean_term" = "word")) |>
  group_by(search_id) |>
  summarise(clean_term = paste0(clean_term, collapse = ' ')) |>
  ungroup() |>
  select(search_id, clean_term) |>
  distinct() |>
  left_join(rising_term_df) |>
  select(clean_term, term, refresh_date) |>
  distinct()

token_str_gdelt = mention_gdelt_df_cleaned |>
  drop_na(Article) |>
  unnest_tokens(output = clean_article, input = Article) |> 
  anti_join(multi_lang_stopwords, by = c("clean_article" = "word")) |> 
  group_by(GLOBALEVENTID) |>
  summarise(clean_article = paste0(clean_article, collapse = ' ')) |>
  ungroup() |>
  select(GLOBALEVENTID, clean_article) |>
  distinct() |>
  left_join(mention_gdelt_df_cleaned) |>
  select(clean_article, Article, MentionTimeDate) |>
  mutate(MentionTimeDate= as.Date(MentionTimeDate)) |>
  distinct()

String Similarity Metrics

  • Levenshtein Distance (edit distance-based algorithm):
    A string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. The more the number of operations, the less the similarity between the two strings.

  • Jaro–Winkler Distance:
    A string metric measuring an edit distance between two sequences. The distance takes into account the number of matching characters and the order in which they appear, with a higher weight assigned to matching characters at the beginning of the strings.


Calculate similarity scores

sim_df = inner_join(rising_term_df,
            token_str_gdelt,
            by=c('refresh_date'='MentionTimeDate')) |>
  left_join(token_str_gsearch) |>
  select(term, clean_term, clean_article) |>
  distinct() |>
  mutate(jw_score = jarowinkler(term, clean_article, r=0.4)) |>
  mutate(lev_score = levenshteinSim(term, clean_article)) |>
  arrange(desc(jw_score), desc(lev_score))
knit_table(head(sim_df))
term clean_term clean_article jw_score lev_score
pope leo xiv pope leo xiv pope leo xiv pick name 0.9090909 0.5454545
pope leo xiii pope leo xiii pope leo xiv pick name 0.8937063 0.5454545
india pakistan india pakistan india pakistan brink wider conflict 0.8800000 0.4000000
pope leo pope leo pope leo predecessors 0.8761905 0.3809524
pope leo xiv pope leo xiv pope leo xiv chicago augustinian 0.8750000 0.3750000
pope leo pope leo pope leo xiv pick name 0.8727273 0.3636364

Display Levenshtein Distance Metric Similarity Results

  • View counts of rising terms with a Levenshtein score greater than 0.4
  • View the heatmap matrix comparing the Levenshtein similarity scores for rising terms and GDELT article mention names with for terms/names with a Levenshtein score greater than 0.4
sim_df |>
  filter(lev_score > 0.4) |>
  group_by(clean_term) |>
  tally() |>
  arrange(desc(n)) |>
  mutate(color_group = factor(row_number() %% 3)) |>
  ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_classic() +
  scale_fill_manual(values = colors) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Levenshtein Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.4',
    y = 'Term Count',
    x = 'Rising Term'
  )

filt_sim_df = sim_df |>
  filter(lev_score > 0.4) 

sim_df |>
  filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
  ggplot(aes(x=term, y=reorder(clean_article, lev_score), fill=lev_score)) + 
  scale_y_discrete(labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
                                                 width = 45)) +
  geom_tile() +
  scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
  theme_classic() +
  theme(axis.text.x=element_text(angle=50, hjust=1)) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Levenshtein Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.4',
    y = 'GDELT Event Source Article',
    x = 'Rising Term',
    fill='Levenshtein Score'
  )

Display Jaro–Winkler Distance Metric Similarity Results

  • View counts of rising terms with a Jaro–Winkler score greater than 0.85
  • View the heatmap matrix comparing the Jaro–Winkler similarity scores for rising terms and GDELT article mention names with for terms/names with a Jaro–Winkler score greater than 0.85
sim_df |>
  filter(jw_score > 0.85) |>
  group_by(clean_term) |>
  tally() |>
  arrange(desc(n)) |>
  mutate(color_group = factor(row_number() %% 3)) |>
  ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_classic() + 
  scale_fill_manual(values = colors) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.85',
    y = 'Term Count',
    x = 'Rising Term'
  )

filt_sim_df = sim_df |>
  filter(jw_score >= 0.85) 

sim_df |>
  filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
  ggplot(aes(x=term, y=reorder(clean_article, jw_score), fill=jw_score)) + 
  scale_y_discrete(
    labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
                                  width = 45)) +
  geom_tile() +
  scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
  theme_classic() +
  theme(
    plot.title = element_text(face = "bold", hjust=0.5),
    plot.subtitle = element_text(hjust=0.5),
    text = element_text(color = titles)) +
  theme(axis.text.x=element_text(angle=50, hjust=1)) +
  labs(
    title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.85',
    y = 'GDELT Event Source Article',
    x = 'Rising Term',
    fill='Jaro-Winkler Score'
  ) 

Conclusion

Data Join:
Due to the difference in size and complexity of the Goggle search and GDELT data, joining the dataframes for analysis proved to be difficult. For example, the Google search database provides a daily refresh of the top 25 searched values from the previous day, where as GDELT provides a CSV file every 15 minutes with hundreds of global events. Additionally, the GDELT dataset location data was not an exact match with the Google search data. Some observations were at the Country level, State level, and city level, or did not have a geographic area provided. This meant that it was difficult to compare the data adequately by geographic area.

The string similarity analysis was our solution to comparing the reported events and top search terms without needing to implement a join. Given more time, we believe building out logic to more precisely extract the GDELT data for the same time frame as the Google search data would be best. This would involve utilizing the GDELT database, rather than CSV files, to query the data, as well as analyzing the difference in joining on when the event occurred vs when the event was mentioned.


Latent Dirichlet Allocation (LDA) Modeling The results in the LDA modeling show that topic modeling for Google searches is possible, but would require additional tidying. For example, certain terms appeared in multiple topic categories, indicating that there may have been duplicate terms fed into the model. LDA modeling can be effective when the corpus is large enough, and may not be the best option for short-text. To improve this analysis, using another method, such as Single topic LDA with clustering may be preferrable. For more information about this type of modeling approach, please see the stLDA-C GitHub Page.
String Similarity:
The results show the Jaro-Winkler similarity scores to be much higher than the Levenshtein scores. This is due to the the Jaro-Winkler method only taking into account matching characters and any required swapping of characters. As a result of the GDELT article names being significantly longer than the rising term strings, the Jaro-Winkler similarity scores can be misleading.
Next Steps:

  • Reducing the string length in the GDELT article names will assist with providing more meaning and interpretable similarity score results. This could be done by extracting keywords from the article name
  • In addition, exploring more string similarity scoring methods that are more fit for this use case could also improve current limitation. For example, the Jaccard Measure is a similarity measure that considers the number of overlapping tokens in two input strings. It determines the similarity between the strings by comparing the number of tokens they share. This may be more apt for this analysis due to comparing phrases rather than one word strings