Packages

library(bigrquery)
library(DBI)
library(tidyverse)
library(tidytext)
library(lubridate)
library(topicmodels)
library(countrycode)
library(SnowballC)
library(Matrix)
library(RecordLinkage)
library(textmineR)
library(kableExtra)

This project uses several R packages to support data acquisition, cleaning, analysis, and visualization. For example, bigrquery and DBI are used to query data from Google BigQuery, tidytext enables text mining and topic modeling, and ggplot2 is used for data visualization.

Introduction

In this project, we explore how search behavior in the United States reflects interest in major pop culture events. Using unsupervised topic modeling, we analyze Google’s top search terms to identify recurring themes and trends. By linking search activity with globally documented events, we aim to uncover how societal attention manifests in digital behavior.

Data Sources

Google BigQuery – Top 25 Search Terms
Google’s public dataset on the top 25 search queries per day, available via BigQuery.
GDELT 2.0 – Our Global World in Realtime
A large-scale global event database tracking news events from around the world in real time, including media coverage of pop culture, conflict, disasters, and more.

Data Acquisition

Google BigQuery – Top Search Terms

We use the bigrquery package to access Google’s public dataset of the most popular daily search terms. Make sure you have a Google Cloud project set up with billing enabled and a BigQuery API key.

# Authenticate with Google Cloud (interactive)
bq_auth(path = "data607finalproject-6c666e5b6214.json")

project_id <- "data607finalproject"

query <- as.character(
  "SELECT
  'top_terms' as term_type,
  term,
  score,
  refresh_date,
  dma_id,
  dma_name,
  ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
    `bigquery-public-data.google_trends.top_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL

union  all

SELECT
  'rising_terms' as term_type,
  term as term_rising,
  score as score_rising,
  refresh_date,
  dma_id,
  dma_name,
  ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
    `bigquery-public-data.google_trends.top_rising_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL
ORDER BY
    (select week from unnest(x)) desc,
    (SELECT rank FROM UNNEST(x)),
    refresh_date desc
"
)

# Run dry run query to estimate billing
job <- bq_perform_query_dry_run(
  billing = project_id,
  query = query,
  use_legacy_sql = FALSE
)

google_search_data <- bq_project_query(project_id, query)
google_search_df <- bq_table_download(google_search_data)

knit_table(head(google_search_df))


term_type	term	score	refresh_date	dma_id	dma_name	x
top_terms	sabu	100	2025-05-11	501	New York NY	1, 20219
top_terms	sabu	100	2025-05-11	803	Los Angeles CA	1, 20219
rising_terms	habeas corpus	100	2025-05-11	501	New York NY	1, 20219
rising_terms	habeas corpus	100	2025-05-11	803	Los Angeles CA	1, 20219
top_terms	barcelona vs real madrid	100	2025-05-11	501	New York NY	2, 20219
top_terms	barcelona vs real madrid	100	2025-05-11	803	Los Angeles CA	2, 20219

GDELT 2.0: Our Global World in Realtime

Getting the latest file name:

# Get the name of the most recent GDELT CSV file
latest_file <- read_lines("http://data.gdeltproject.org/gdeltv2/lastupdate.txt")
latest_file

## [1] "58335 ceb398cb084a6c1665071e85971a2279 http://data.gdeltproject.org/gdeltv2/20250517023000.export.CSV.zip"  
## [2] "71232 f41e2d9e275493a78fe4c0185a41a1b4 http://data.gdeltproject.org/gdeltv2/20250517023000.mentions.CSV.zip"
## [3] "3307005 f76e6bc23ff338aef37ecc69d1672c1e http://data.gdeltproject.org/gdeltv2/20250517023000.gkg.csv.zip"

Getting event mentions within the same timeframe as the Google Search Terms

all_files = data_frame(
  time = (
    seq(ymd_hms('20250506000000'),ymd_hms('20250512000000'), by = '1 day'))) |>
  mutate(
    time = format(time, format = '%Y%m%d%H%M%S'),
    file_url = paste0('http://data.gdeltproject.org/gdeltv2/', time, '.mentions.CSV.zip'),
    destfile = paste0("Data/GDELT/gdelt_", time, ".zip")
    )
knit_table(all_files) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")


time	file_url	destfile
20250506000000	http://data.gdeltproject.org/gdeltv2/20250506000000.mentions.CSV.zip	Data/GDELT/gdelt_20250506000000.zip
20250507000000	http://data.gdeltproject.org/gdeltv2/20250507000000.mentions.CSV.zip	Data/GDELT/gdelt_20250507000000.zip
20250508000000	http://data.gdeltproject.org/gdeltv2/20250508000000.mentions.CSV.zip	Data/GDELT/gdelt_20250508000000.zip
20250509000000	http://data.gdeltproject.org/gdeltv2/20250509000000.mentions.CSV.zip	Data/GDELT/gdelt_20250509000000.zip
20250510000000	http://data.gdeltproject.org/gdeltv2/20250510000000.mentions.CSV.zip	Data/GDELT/gdelt_20250510000000.zip
20250511000000	http://data.gdeltproject.org/gdeltv2/20250511000000.mentions.CSV.zip	Data/GDELT/gdelt_20250511000000.zip
20250512000000	http://data.gdeltproject.org/gdeltv2/20250512000000.mentions.CSV.zip	Data/GDELT/gdelt_20250512000000.zip

Download and Unzip:

# Constructing the full URL to download the ZIP file
export_url_line <- latest_file[1]

# Extract the URL from the line using strsplit
export_url <- strsplit(export_url_line, " ")[[1]][3]

# Download the export ZIP file to your working directory
download.file(export_url, destfile = "Data/GDELT/gdelt_latest.zip", mode = "wb")

# Unzip the file
unzip("Data/GDELT/gdelt_latest.zip", exdir='Data/GDELT')

Read CSV into R:

# Extract filename only from URL
csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(export_url)))

# Read the CSV file with correct case-sensitive name
gdelt_df <- read_csv(csv_file, col_names = FALSE)

GDELT 2.0 – Structured Load
GDELT 2.0 is a massive global event database that captures media coverage of events in real time from across the world. The export files are updated every 15 minutes and are tab-delimited without column headers, so we must assign names manually using the GDELT schema.

In this chunk, we load the most recent export file and apply column names for easier exploration:

gdelt_cols <- c(
  "GLOBALEVENTID", "Day", "MonthYear", "Year", "FractionDate", 
  "Actor1Code", "Actor1Name", "Actor1CountryCode", "Actor1KnownGroupCode", 
  "Actor1EthnicCode", "Actor1Religion1Code", "Actor1Religion2Code", 
  "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code",
  "Actor2Code", "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", 
  "Actor2EthnicCode", "Actor2Religion1Code", "Actor2Religion2Code", 
  "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code",
  "IsRootEvent", "EventCode", "EventBaseCode", "EventRootCode", 
  "QuadClass", "GoldsteinScale", "NumMentions", "NumSources", 
  "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", 
  "Actor1Geo_CountryCode", "Actor1Geo_ADM1Code", "Actor1Geo_ADM2Code",
  "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type", 
  "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code",
  "Actor2Geo_ADM2Code", "Actor2Geo_Lat", "Actor2Geo_Long", "Actor2Geo_FeatureID", 
  "ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode", 
  "ActionGeo_ADM1Code", "Action2Geo_ADM2Code", "ActionGeo_Lat", "ActionGeo_Long", 
  "ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
)

# Read the file with tab delimiter and custom column names
gdelt_df <- read_delim(csv_file, delim = "\t", col_names = gdelt_cols, show_col_types = FALSE)

# Preview first rows
knit_table(head(gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")


GLOBALEVENTID	Day	MonthYear	Year	FractionDate	Actor1Code	Actor1Name	Actor1CountryCode	Actor1KnownGroupCode	Actor1EthnicCode	Actor1Religion1Code	Actor1Religion2Code	Actor1Type1Code	Actor1Type2Code	Actor1Type3Code	Actor2Code	Actor2Name	Actor2CountryCode	Actor2KnownGroupCode	Actor2EthnicCode	Actor2Religion1Code	Actor2Religion2Code	Actor2Type1Code	Actor2Type2Code	Actor2Type3Code	EventCode	EventBaseCode	EventRootCode	QuadClass	GoldsteinScale	NumMentions	NumSources	NumArticles	AvgTone	Actor1Geo_Type	Actor1Geo_FullName	Actor1Geo_CountryCode	Actor1Geo_ADM1Code	Actor1Geo_ADM2Code	Actor1Geo_Lat	Actor1Geo_Long	Actor1Geo_FeatureID	Actor2Geo_Type	Actor2Geo_FullName	Actor2Geo_CountryCode	Actor2Geo_ADM1Code	Actor2Geo_ADM2Code	Actor2Geo_Lat	Actor2Geo_Long	Actor2Geo_FeatureID	ActionGeo_Type	ActionGeo_FullName	ActionGeo_CountryCode	ActionGeo_ADM1Code	Action2Geo_ADM2Code	ActionGeo_Lat	ActionGeo_Long	ActionGeo_FeatureID	DATEADDED	SOURCEURL
1244378511	20240517	202405	2024	2024.375	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	042	042	04	1	1.9	3	1	3	-0.831202	0	NA	NA	NA	NA	NA	NA	NA	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378512	20240517	202405	2024	2024.375	AFG	AFGHAN	AFG	NA	NA	NA	NA	NA	NA	NA	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	042	042	04	1	1.9	3	1	3	-0.831202	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378513	20240517	202405	2024	2024.375	AFG	AFGHANISTAN	AFG	NA	NA	NA	NA	NA	NA	NA	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	042	042	04	1	1.9	2	1	2	-0.831202	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	3	Fredericksburg, Virginia, United States	US	USVA	VA163	37.8929	-79.4378	1494946	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378514	20240517	202405	2024	2024.375	AFG	AFGHANISTAN	AFG	NA	NA	NA	NA	NA	NA	NA	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	042	042	04	1	1.9	2	1	2	-0.831202	4	Saigon, H? Chíinh, Vietnam, Republic Of	VM	VM20	74101	10.7500	106.6670	-3730078	4	Saigon, H? Chíinh, Vietnam, Republic Of	VM	VM20	74101	10.7500	106.6670	-3730078	4	Saigon, H? Chíinh, Vietnam, Republic Of	VM	VM20	74101	10.7500	106.6670	-3730078	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378515	20240517	202405	2024	2024.375	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	043	043	04	1	2.8	3	1	3	-0.831202	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	0	NA	NA	NA	NA	NA	NA	NA	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025
1244378516	20240517	202405	2024	2024.375	AFGINSTAL	TALIBAN	AFG	TAL	NA	NA	NA	INS	NA	NA	AFG	AFGHAN	AFG	NA	NA	NA	NA	NA	NA	NA	043	043	04	1	2.8	3	1	3	-0.831202	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	4	Kabul, Kabol, Afghanistan	AF	AF13	3580	34.5167	69.1833	-3378435	2.025052e+13	https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025

In the following code chunk, we repeat the same process as above to load the event mention export file and apply column names.

The Mentions table records an individual mention of an event in an article, This table includes a Confidence field that records GDELT’s confidence in its extraction of that event from that particular article.

mention_gdelt_cols <- c(
  "GLOBALEVENTID", "EventTimeDate", "MentionTimeDate", "MentionType",
  "MentionSourceName", "MentionIdentifier",
  "SentenceID", "Actor1CharOffset", "Actor2CharOffset", "ActionCharOffset", 
  "InRawText", "Confidence", "MentionDocLen", 
  "MentionDocTone", "MentionDocTranslationInfo", "Extras"
)

for (i in 1:nrow(all_files)){
  download.file(
    all_files$file_url[i],
    destfile = all_files$destfile[i],
    mode = "wb", quiet = TRUE)
  # Unzip the file
  unzip(all_files$destfile[i], exdir='Data/GDELT')
  csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(all_files$file_url[i])))
  # Read the CSV file with correct case-sensitive name
  if (i == 1) {
    mention_gdelt_df <- read_delim(
      csv_file, delim = "\t", col_names = mention_gdelt_cols,
      col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
  }
  else{
    temp_df <- read_delim(
      csv_file, delim = "\t", col_names = mention_gdelt_cols,
      col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
    mention_gdelt_df = rbind(mention_gdelt_df, temp_df)
  }
}
knit_table(head(mention_gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")


GLOBALEVENTID	EventTimeDate	MentionTimeDate	MentionType	MentionSourceName	MentionIdentifier	SentenceID	Actor1CharOffset	Actor2CharOffset	ActionCharOffset	InRawText	Confidence	MentionDocLen	MentionDocTone	MentionDocTranslationInfo	Extras
1173957950	20240506004500	20250506000000	1	wwno.org	https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent	9	-1	2207	2153	1	60	14002	0.4860804	NA	NA
1174053899	20240506131500	20250506000000	1	nydailynews.com	https://www.nydailynews.com/2025/05/05/met-gala-2025-adrienne-adams-brad-lander-nyc-politicians/	4	-1	611	629	0	20	2332	3.7234043	NA	NA
1241978379	20250506000000	20250506000000	1	bignewsnetwork.com	http://www.bignewsnetwork.com/news/278204974/canada-alberta-premier-says-separation-referendum-possible-in-2026	1	64	209	127	0	20	1114	-2.6881720	NA	NA
1173957950	20240506004500	20250506000000	1	wwno.org	https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent	9	2207	-1	2153	1	60	14002	0.4860804	NA	NA
1174054437	20240506131500	20250506000000	1	marketscreener.com	https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/	8	1767	-1	1796	1	60	2152	3.5714286	NA	NA
1174033236	20240506111500	20250506000000	1	wwno.org	https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent	9	2205	2132	2151	0	40	14002	0.4860804	NA	NA

Load GDELT Reference Table Data

Define reference table data file sources

cameo_country_code_txt <- "https://www.gdeltproject.org/data/lookups/CAMEO.country.txt"
fips_country_code_txt <- 'https://www.gdeltproject.org/data/lookups/FIPS.country.txt'
cameo_code_type_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.type.txt'
cameo_known_code_groups_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.knowngroup.txt'
cameo_ethnic_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.ethnic.txt'
cameo_religion_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.religion.txt'
cameo_event_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt'
cameo_gold_stein_scale_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.goldsteinscale.txt'
adm1_codes_txt <- 'http://efele.net/maps/fips-10/data/fips-414.txt'
adm2_codes_txt <- 'https://download.geonames.org/export/dump/admin2Codes.txt'

Actor Data Reference Tables

# Actor Country Code column reference values
cameo_country_ref <- read.csv(
  cameo_country_code_txt, sep="\t", header=T,
  col.names = c('Country', 'Country_Desc'))

# Actor Known Code Group column reference values
cameo_known_code_groups_ref <- read.csv(
  cameo_known_code_groups_txt, sep="\t", header=T,
  col.names = c('KnownGroup', 'KnownGroup_Desc'))

# Actor Ethnic Code column reference values
cameo_ethnic_ref <- read.csv(cameo_ethnic_code_txt, sep="\t", header=T,
                            col.names = c('Ethnic', 'Ethnic_Desc'))

# Actor Religion Code (1 and 2) column reference values
cameo_religion_ref <- read.csv(cameo_religion_code_txt, sep="\t", header=T,
                              col.names = c('Religion', 'Religion_Desc'))

# Actor Type Code (1, 2, and 3) column reference values
cameo_type_ref <- read.csv(cameo_code_type_txt, sep="\t", header=T,
                                col.names = c('TypeCode', 'Type'))

Event Data Reference Tables

# Event Code (including base and root event codes) column reference values
cameo_event_ref <- read.csv(cameo_event_code_txt, sep="\t", header=T ,
                           colClasses = "character",
                                col.names = c('EventCode', 'Event'))

# Event Gold Stein Scale column reference values
cameo_gold_stein_ref <- read.csv(cameo_gold_stein_scale_txt, sep="\t", header=T,
                                col.names = c('EventCode', 'GoldSteinScale'))

# Event quad class (primary classification) column reference values
quadclass_ref <- tribble(
  ~"QuadClass", ~"Quad",
  1, 'Verbal Cooperation',
  2, 'Material Cooperation',
  3, 'Verbal Conflict',
  4,' Material Conflict'
)

Event Geography Data Reference Tables

# Event geographic resolution column reference values
geo_type_ref <- tribble(
  ~"Geo_Type", ~"Geo_Type_Desc",
  1, 'COUNTRY',
  2, 'USSTATE',
  3, 'USCITY',
  4,' WORLDCITY',
  5, 'WORLDSTATE'
)

# FIPS country column reference values
fips_country_ref <- read.csv(fips_country_code_txt, sep="\t", header=F,
                            col.names = c('GEO_Country', 'GEO_Country_Desc'))

# GEO ADM1 code column reference values
# create U.S. adm1 codes using 'US' followed by the state abbreviation
us_fips_adm1_ref <- tibble(
  'Country' = 'US',
  'division' = 'state',
  fips_adm1 = state.name,
  state_abb = state.abb
  ) |>
  mutate(fips_adm1_code = paste0(Country, state_abb)) |>
  select(fips_adm1_code, division, fips_adm1)

# read in global adm1 codes and join to custom U.S. codes
fips_adm1_ref <- read.csv(
  adm1_codes_txt, sep="_", header=F, na.strings = "",
  col.names = c('fips_adm1_code',
                'first fips version',
                'last fips version',
                'division',
                'designation_2',
                'designation_3',
                'designation4',
                'fips_adm1',
                'convential name',
                'former name')) |>
  select('fips_adm1_code', 'division', 'fips_adm1') |>
  rbind(us_fips_adm1_ref)

Tidy & Clean Data

Google BigQuery

Unnest week and rank to transform dataframe into a long-format where each week and rank value is a separate row and own column
Fill Null score values with zero
Create search_id column from the associated row number

long_google_search_df <-  google_search_df |>
  unnest(x) |>
  mutate(
    score = replace_na(score, 0),
    search_id = row_number())

knit_table(head(long_google_search_df))


term_type	term	score	refresh_date	dma_id	dma_name	rank	week	search_id
top_terms	sabu	100	2025-05-11	501	New York NY	1	2025-05-11	1
top_terms	sabu	100	2025-05-11	803	Los Angeles CA	1	2025-05-11	2
rising_terms	habeas corpus	100	2025-05-11	501	New York NY	1	2025-05-11	3
rising_terms	habeas corpus	100	2025-05-11	803	Los Angeles CA	1	2025-05-11	4
top_terms	barcelona vs real madrid	100	2025-05-11	501	New York NY	2	2025-05-11	5
top_terms	barcelona vs real madrid	100	2025-05-11	803	Los Angeles CA	2	2025-05-11	6

Subset refresh date into its own table for reference.

refresh_date_ref <- long_google_search_df |>
  select(refresh_date) |>
  distinct()

knit_table(head(refresh_date_ref))


refresh_date
2025-05-11
2025-05-10
2025-05-09
2025-05-08
2025-05-07
2025-05-06

Create Designated Market Area (DMA) reference table.

dma_ref_df <- long_google_search_df |>
  select(dma_id, dma_name) |>
  distinct()

knit_table(head(dma_ref_df))


dma_id	dma_name
501	New York NY
803	Los Angeles CA

Create final Google top term dateframe

top_term_df <- long_google_search_df |>
  filter(term_type == "top_terms")

knit_table(head(top_term_df))


term_type	term	score	refresh_date	dma_id	dma_name	rank	week	search_id
top_terms	sabu	100	2025-05-11	501	New York NY	1	2025-05-11	1
top_terms	sabu	100	2025-05-11	803	Los Angeles CA	1	2025-05-11	2
top_terms	barcelona vs real madrid	100	2025-05-11	501	New York NY	2	2025-05-11	5
top_terms	barcelona vs real madrid	100	2025-05-11	803	Los Angeles CA	2	2025-05-11	6
top_terms	barcelona - real madrid	100	2025-05-11	501	New York NY	3	2025-05-11	9
top_terms	barcelona - real madrid	100	2025-05-11	803	Los Angeles CA	3	2025-05-11	10

Create final Google rising term dataframe

rising_term_df <- long_google_search_df |>
  filter(term_type == "rising_terms")

knit_table(head(rising_term_df))


term_type	term	score	refresh_date	dma_id	dma_name	rank	week	search_id
rising_terms	habeas corpus	100	2025-05-11	501	New York NY	1	2025-05-11	3
rising_terms	habeas corpus	100	2025-05-11	803	Los Angeles CA	1	2025-05-11	4
rising_terms	derek carr	98	2025-05-11	501	New York NY	2	2025-05-11	7
rising_terms	derek carr	72	2025-05-11	803	Los Angeles CA	2	2025-05-11	8
rising_terms	tripe	12	2025-05-11	501	New York NY	3	2025-05-11	11
rising_terms	tripe	21	2025-05-11	803	Los Angeles CA	3	2025-05-11	12

GDELT 2.0

Note: The GDELT maintains a tidy format in its raw state. No manual tidying tasks were needed to be performed.

GDELT Event Data Cleaning

Transforms Day into date data type
Transforms MonthYear into datetime string format
Transforms GLOBALEVENTID into strings
Transforms all categorical columns into factor data type

gdelt_df_cleaned <- gdelt_df |>
  mutate_at(vars(contains('Code')), ~as.factor(.)) |>
  mutate_at(vars(contains('Geo_Type')), ~as.factor(.)) |>
  mutate(
    GLOBALEVENTID = as.character(GLOBALEVENTID),
    Day = as.Date(as.character(Day), '%Y%m%d'),
    MonthYear = format(Day, '%Y-%m'),
    QuadClass = as.factor(QuadClass),
  ) |> 
  filter(Day >= today() - days(14) & Day <= today())

knit_table(head(gdelt_df_cleaned)) |> scroll_box(
  width = "100%", box_css = "border: 1px solid #FFFFFF;")


GLOBALEVENTID	Day	MonthYear	Year	FractionDate	Actor1Code	Actor1Name	Actor1CountryCode	Actor1KnownGroupCode	Actor1EthnicCode	Actor1Religion1Code	Actor1Religion2Code	Actor1Type1Code	Actor1Type2Code	Actor1Type3Code	Actor2Code	Actor2Name	Actor2CountryCode	Actor2KnownGroupCode	Actor2EthnicCode	Actor2Religion1Code	Actor2Religion2Code	Actor2Type1Code	Actor2Type2Code	Actor2Type3Code	IsRootEvent	EventCode	EventBaseCode	EventRootCode	QuadClass	GoldsteinScale	NumMentions	NumSources	NumArticles	AvgTone	Actor1Geo_Type	Actor1Geo_FullName	Actor1Geo_CountryCode	Actor1Geo_ADM1Code	Actor1Geo_ADM2Code	Actor1Geo_Lat	Actor1Geo_Long	Actor1Geo_FeatureID	Actor2Geo_Type	Actor2Geo_FullName	Actor2Geo_CountryCode	Actor2Geo_ADM1Code	Actor2Geo_ADM2Code	Actor2Geo_Lat	Actor2Geo_Long	Actor2Geo_FeatureID	ActionGeo_Type	ActionGeo_FullName	ActionGeo_CountryCode	ActionGeo_ADM1Code	Action2Geo_ADM2Code	ActionGeo_Lat	ActionGeo_Long	ActionGeo_FeatureID	DATEADDED	SOURCEURL
1244378528	2025-05-10	2025-05	2025	2025.356	USA	UNITED STATES	USA	NA	NA	NA	NA	NA	NA	NA	USAGOV	UNITED STATES	USA	NA	NA	NA	NA	GOV	NA	NA	1	043	043	04	1	2.8	1	1	1	-7.917889	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	2.025052e+13	https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378529	2025-05-10	2025-05	2025	2025.356	USA	UNITED STATES	USA	NA	NA	NA	NA	NA	NA	NA	USAGOV	UNITED STATES	USA	NA	NA	NA	NA	GOV	NA	NA	1	043	043	04	1	2.8	1	1	1	-7.917889	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	2.025052e+13	https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378530	2025-05-10	2025-05	2025	2025.356	USAGOV	UNITED STATES	USA	NA	NA	NA	NA	GOV	NA	NA	USA	UNITED STATES	USA	NA	NA	NA	NA	NA	NA	NA	1	042	042	04	1	1.9	1	1	1	-7.917889	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	2.025052e+13	https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378531	2025-05-10	2025-05	2025	2025.356	USAGOV	UNITED STATES	USA	NA	NA	NA	NA	GOV	NA	NA	USA	JERSEY	USA	NA	NA	NA	NA	NA	NA	NA	1	112	112	11	3	-2.0	1	1	1	-7.917889	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	3	Newark, New Jersey, United States	US	USNJ	NJ013	40.73570	-74.172400	878762	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	2.025052e+13	https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378532	2025-05-10	2025-05	2025	2025.356	USAGOV	UNITED STATES	USA	NA	NA	NA	NA	GOV	NA	NA	USA	JERSEY	USA	NA	NA	NA	NA	NA	NA	NA	1	112	112	11	3	-2.0	1	1	1	-7.917889	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	1	Jersey	JE	JE	NA	49.21667	-2.116667	JE	2.025052e+13	https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice
1244378533	2025-05-16	2025-05	2025	2025.373	IRQSPY	IRAQ	IRQ	NA	NA	NA	NA	SPY	NA	NA	SYR	DAMASCUS	SYR	NA	NA	NA	NA	NA	NA	NA	0	040	040	04	1	1.0	1	1	1	-1.829925	4	Damascus, Dimashq, Syria	SY	SY13	26324	33.50000	36.300000	-2541260	4	Damascus, Dimashq, Syria	SY	SY13	26324	33.50000	36.300000	-2541260	4	Damascus, Dimashq, Syria	SY	SY13	26324	33.50000	36.300000	-2541260	2.025052e+13	https://www.zerohedge.com/geopolitical/jordans-king-warned-us-against-assassinating-syrias-sharaa-trump-meeting

GDELT Event Mentions Data Cleaning

Transforms Day into date data type
Transforms EventTimeDate and MentionTimeDate into datetime data types
Transforms GLOBALEVENTID into strings
Transforms all categorical columns into factor data type
Creates Article column by extracting the article name from the source URL
Subsets the dataframe to be on the Event level by filters to the most the event mention with the highest confidence
Filters the dataframe to only include event source from the web with a confidence extraction percent of greater than or equal to 60%

mention_gdelt_df_cleaned = mention_gdelt_df |>
  mutate(
    GLOBALEVENTID = as.character(GLOBALEVENTID),
    EventTimeDate =  format(lubridate::ymd_hms(as.character(EventTimeDate)),'%Y-%m-%d- %H:%M:%S'),
    MentionTimeDate =  format(lubridate::ymd_hms(as.character(MentionTimeDate)),'%Y-%m-%d- %H:%M:%S'),
    ) |>
  group_by(GLOBALEVENTID) |>
  arrange(GLOBALEVENTID, desc(Confidence), desc(MentionType), MentionSourceName) |>
  slice_head(n=1) |>
  ungroup() |>
  filter(MentionType == 1 & Confidence >= 60)

mention_gdelt_df_cleaned = mention_gdelt_df_cleaned |>
    mutate(Article = gsub('\\/$||\\/([A-Za-z]|\\d)*$|\\/*\\d*$|\\/([A-Z]|\\d)*\\/$|\\/*-*\\d*(\\.[a-z]+$|\\/$|$)|\\/*\\d*\\/_.*$|\\/\\?.*$', "", MentionIdentifier)) |>
  mutate(Article = gsub('-2025-\\d+.*$|-2025\\d+-.*$|\\/\\d*-*p5lyys.*$', "", Article)) |>
  mutate(Article = gsub('\\/([a-z]|\\d|article)*_*-*([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d){4}-([a-z]|\\d)*$', "", Article)) |>
  mutate(Article = gsub('(https://|http://).*\\..*\\/\\d*-*', '', Article)) |>
  mutate(Article = str_replace_all(Article, '-', ' ')) |>
  mutate(Article = str_replace(Article, '\\.', '')) |>
  mutate(Article = na_if(Article, 'index?more=')) |>
  mutate(Article = ifelse(str_count(Article)<=1, NA, Article)) |>
  mutate(Article = ifelse(str_count(Article, '\\w+')<=1, NA, Article))

knit_table(head(mention_gdelt_df_cleaned)) |> scroll_box(
  width = "100%", box_css = "border: 1px solid #FFFFFF;")


GLOBALEVENTID	EventTimeDate	MentionTimeDate	MentionType	MentionSourceName	MentionIdentifier	SentenceID	Actor1CharOffset	Actor2CharOffset	ActionCharOffset	InRawText	Confidence	MentionDocLen	MentionDocTone	MentionDocTranslationInfo	Extras	Article
1173957950	2024-05-06- 00:45:00	2025-05-06- 00:00:00	1	wwno.org	https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent	9	-1	2207	2153	1	60	14002	0.4860804	NA	NA	new orleans got 388 million in federal pandemic aid was that money well spent
1174054437	2024-05-06- 13:15:00	2025-05-06- 00:00:00	1	marketscreener.com	https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/	8	1767	-1	1796	1	60	2152	3.5714286	NA	NA	Diana Ross Simone Biles shine at Met Ball celebration of Black menswear
1174227361	2024-05-07- 07:15:00	2025-05-07- 00:00:00	1	bangkokpost.com	https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south	5	-1	823	941	1	60	3054	-4.4680851	NA	NA	govt ready for new talks on peace in deep south
1174261394	2024-05-07- 11:00:00	2025-05-07- 00:00:00	1	bangkokpost.com	https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south	5	823	-1	978	1	60	3054	-4.4680851	NA	NA	govt ready for new talks on peace in deep south
1174400771	2024-05-08- 00:15:00	2025-05-08- 00:00:00	1	kenw.org	https://www.kenw.org/npr-news/2025-05-07/3-former-memphis-police-officers-found-not-guilty-in-the-death-of-tyre-nichols	6	1055	-1	1037	1	100	5634	-9.5744681	NA	NA	former memphis police officers found not guilty in the death of tyre nichols
1174418451	2024-05-08- 02:00:00	2025-05-08- 00:00:00	1	cbs12.com	https://cbs12.com/news/connect-to-congress/house-gop-continues-to-push-colleges-to-do-more-to-combat-antisemitism-on-campus-haverford-college-depaul-university-california-polytechnic-state-university-house-education-committee	6	2151	2158	2133	1	100	4985	-5.7397959	NA	NA	house gop continues to push colleges to do more to combat antisemitism on campus haverford college depaul university california polytechnic state university house education committee

Join Data

Join the Google Big Query Data with the GDELT Data

gsearch_gdelt <- left_join(
  long_google_search_df, 
  gdelt_df_cleaned,
  by=c('refresh_date'='Day'))

Exploratory Data Analysis

Stop Words
Stop words are words that should be excluded in text analysis, such as “the”, “a”, “and”, etc. Within the ‘tidytext’ package, stop words can be removed in multiple languages. In this analysis, stop words were removed in English and Spanish.

# getting stop words for english and spanish 
multi_lang_stopwords <- bind_rows(
  get_stopwords(language = "en"),
  get_stopwords(language = "es")
)

Tokenization
In text mining, tokenization refers to breaking down text into smaller units. The Google search data was already broken in to short text, therefore, when tokenizing the text, the unit of our text becomes a word.

tokenized_gsearch_gdelt <- gsearch_gdelt |> 
  unnest_tokens(output = word, input = term) |> 
  anti_join(multi_lang_stopwords, by = "word") |> 
  relocate(search_id, word)

Frequency Analysis
After tokenizing the text, a freqency analysis is conducted to better understand which searches appear most frequently, at the word level, and at the Google search level.

most_freq_tokens <- tokenized_gsearch_gdelt |> 
  count(word, sort = TRUE) |> 
  slice_max(order_by = n, n = 25) |> 
  mutate(color_group = factor(row_number() %% 3)) 

most_freq_searches <- gsearch_gdelt |> 
  count(term, sort = TRUE) |> 
  slice_max(order_by = n, n = 25) |> 
  mutate(color_group = factor(row_number() %% 3))

Frequency Plots

ggplot(most_freq_tokens, aes(x=reorder(word, -n), y=n, fill = color_group))+
  geom_bar(stat="identity")+
  scale_fill_manual(values = colors) +
  theme_minimal()+
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
   labs(y = "Count",
       x = "Tokenized Terms (Words)",
       title = "Word Frequency in Google Searches",
       subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
  guides(fill=FALSE)

ggplot(most_freq_searches, aes(x=reorder(term, -n), y=n, fill = color_group))+
 geom_bar(stat="identity")+
  scale_fill_manual(values = colors) +
  theme_minimal()+
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
   labs(y = "Count",
       x = "Google Searches",
       title = "Google Searches Frequency",
       subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
  guides(fill=FALSE)

tokenized_gsearch_gdelt |> 
  group_by(refresh_date) |> 
  count(word, sort = TRUE) |> 
  top_n(5) |> 
  ungroup() |> 
  mutate(word = reorder_within(word, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(tokenized_gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequently googled words in past two weeks",
       subtitle = "Stop words removed from the list")

gsearch_gdelt |> 
  filter(term_type == "top_terms",
         refresh_date >= "2025-05-03") |> 
  group_by(refresh_date) |> 
  count(term, sort = TRUE) |> 
  top_n(3) |> 
  ungroup() |> 
  mutate(word = reorder_within(term, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = term, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Top Searched Terms",
       title = "Top Google Searches in Past 8 Days in LA and NYC",
       subtitle = "")

gsearch_gdelt |> 
  filter(term_type == "rising_terms",
         refresh_date >= "2025-05-03") |> 
  group_by(refresh_date) |> 
  count(term, sort = TRUE) |> 
  top_n(3) |> 
  ungroup() |> 
  mutate(word = reorder_within(term, n, refresh_date),
         date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
  ggplot(aes(x = term, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ date, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  theme_minimal()+
  labs(y = "Count",
       x = "Top Searched Terms",
       title = "Rising Google Searches in Past 8 Days in LA and NYC",
       subtitle = "")

Statistical Analysis

LDA Modeling

Word Stemming
Another technique that can be used in text mining is word stemming. When word stemming, it takes the text, or token, and reduces the term to the root form (also known as the stem). This can help normalize the terms, so that terms like play, playing, played, plays, would all be stemmed to play.

wrd_stem <- tokenized_gsearch_gdelt %>% 
  mutate(word = SnowballC::wordStem(word))

Document Term Matrix
A document term matrix represents the frequency of terms that appear in the collect of text.

# creating document term matrix 
dtm_tokenized_terms <- wrd_stem |> 
  count(refresh_date, word) |> 
  cast_dtm(document = refresh_date, term = word, value = n)

# Convert to reg matrix
mat <- as.matrix(dtm_tokenized_terms)
# Convert to sparse Matrix (dgCMatrix) so can use in lapply below
dtm_mat <- Matrix(mat, sparse = TRUE)

Latent Dirichlet Allocation (LDA)
LDA is a topic modeling method that can be used to identify central topics and distributions across a corpus.

Corpus: a corpus is a collection of text that is used in natural language processing.

In LDA, the value k is defined as the number of topics that appear in the corpus. To determine k, some additional analysis should be completed.

# developing list of potential k values to fit in LDA model
k_list <- seq(2, 20, by = 1)

# testing fit for k values in LDA models
fit_test_k_models <- lapply(k_list, function(k) {
  FitLdaModel(dtm = dtm_mat, #using converted matrix
              k = k,
              iterations = 200,
              burnin = 175,
              alpha = 0.1,
              beta = 0.05,
              optimize_alpha = TRUE,
              calc_likelihood = TRUE,
              calc_coherence = TRUE,
              calc_r2 = TRUE)
})
names(fit_test_k_models) <- paste0("k_", k_list)

# Extracting final log-likelihood value
log_likelihood_list <- sapply(fit_test_k_models, function(x) tail(x$log_likelihood, 1)[2])
log_likelihood <- as.numeric(log_likelihood_list)

# generating summary of model perf.
model_summary <- data.frame(
  k = k_list,
  coherence = sapply(fit_test_k_models, function(x) mean(x$coherence)), # topic coherence
  r2 = sapply(fit_test_k_models, function(x) x$r2), #residuals squared - how different are the expected versus obs.
  log_likelihood = log_likelihood # probability that the model generated obs. data
)

ggplot(model_summary, aes(x = k)) +
  geom_line(aes(y = coherence), color = "blue") +
  geom_line(aes(y = r2 * max(coherence)), color = "green", linetype = "dashed") +
  geom_line(aes(y = log_likelihood / max(abs(log_likelihood)) * max(coherence)),
            color = "red", linetype = "dotted") +
  ylab("Normalized Metric (for comparison)") +
  ggtitle("LDA Tuning: Coherence, R2, and Log-Likelihood")

# this is where we use a LDA model - the number of topics is defined as K. 
lda_model_k15 <- LDA(dtm_tokenized_terms, k = 15, control = list(seed = 1234))
topics_lda_model_k15 <- tidy(lda_model_k15, matrix = "beta")

# Top terms per topic
top_words_k15 <- topics_lda_model_k15 |>   
  group_by(topic) |> 
  top_n(10, beta) |> 
  ungroup()

top_terms_lda_k15 <- topics_lda_model_k15 |> 
  group_by(topic) |> 
  slice_max(beta, n = 3, with_ties = FALSE) |> 
  ungroup() |> 
  arrange(topic, -beta)

top_terms_lda_k15 |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  group_by(topic, term) |>     
  arrange(desc(beta)) |>   
  ungroup() |> 
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 3 terms in each LDA topic",
       subtitle = "LDA model where k=15",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

# this is where we use a LDA model - the number of topics is defined as K. 
lda_model_k11 <- LDA(dtm_tokenized_terms, k = 11, control = list(seed = 1234))
topics_lda_model_k11 <- tidy(lda_model_k11, matrix = "beta")

# Top terms per topic
top_words_k11 <- topics_lda_model_k11 |>   
  group_by(topic) |> 
  top_n(10, beta) |> 
  ungroup()

top_terms_lda_k11 <- topics_lda_model_k11 |> 
  group_by(topic) |> 
  slice_max(beta, n = 3, with_ties = FALSE) |> 
  ungroup() |> 
  arrange(topic, -beta)

top_terms_lda_k11 |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  group_by(topic, term) |>     
  arrange(desc(beta)) |>   
  ungroup() |> 
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 3 terms in each LDA topic",
       subtitle = "LDA model where k=11",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

String Similarity Analysis

Assess string similarity between Google Search top rising terms and GDELT Event article name extracted from the event mention source URL with the highest confidence within the analysis time frame.

Remove Stop Words
Tokenize both the top rising terms and event article names to allow for the removal of common stop words. Upon removal, the token words are joined back together. This process will assist in obtaining a more accurate similarity score by removing redundant or commonly shared words.

token_str_gsearch = rising_term_df |>
  unnest_tokens(output = clean_term, input = term) |>
  anti_join(multi_lang_stopwords, by = c("clean_term" = "word")) |>
  group_by(search_id) |>
  summarise(clean_term = paste0(clean_term, collapse = ' ')) |>
  ungroup() |>
  select(search_id, clean_term) |>
  distinct() |>
  left_join(rising_term_df) |>
  select(clean_term, term, refresh_date) |>
  distinct()

token_str_gdelt = mention_gdelt_df_cleaned |>
  drop_na(Article) |>
  unnest_tokens(output = clean_article, input = Article) |> 
  anti_join(multi_lang_stopwords, by = c("clean_article" = "word")) |> 
  group_by(GLOBALEVENTID) |>
  summarise(clean_article = paste0(clean_article, collapse = ' ')) |>
  ungroup() |>
  select(GLOBALEVENTID, clean_article) |>
  distinct() |>
  left_join(mention_gdelt_df_cleaned) |>
  select(clean_article, Article, MentionTimeDate) |>
  mutate(MentionTimeDate= as.Date(MentionTimeDate)) |>
  distinct()

String Similarity Metrics

Levenshtein Distance (edit distance-based algorithm):
A string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. The more the number of operations, the less the similarity between the two strings.
Jaro–Winkler Distance:
A string metric measuring an edit distance between two sequences. The distance takes into account the number of matching characters and the order in which they appear, with a higher weight assigned to matching characters at the beginning of the strings.

Calculate similarity scores

sim_df = inner_join(rising_term_df,
            token_str_gdelt,
            by=c('refresh_date'='MentionTimeDate')) |>
  left_join(token_str_gsearch) |>
  select(term, clean_term, clean_article) |>
  distinct() |>
  mutate(jw_score = jarowinkler(term, clean_article, r=0.4)) |>
  mutate(lev_score = levenshteinSim(term, clean_article)) |>
  arrange(desc(jw_score), desc(lev_score))
knit_table(head(sim_df))


term	clean_term	clean_article	jw_score	lev_score
pope leo xiv	pope leo xiv	pope leo xiv pick name	0.9090909	0.5454545
pope leo xiii	pope leo xiii	pope leo xiv pick name	0.8937063	0.5454545
india pakistan	india pakistan	india pakistan brink wider conflict	0.8800000	0.4000000
pope leo	pope leo	pope leo predecessors	0.8761905	0.3809524
pope leo xiv	pope leo xiv	pope leo xiv chicago augustinian	0.8750000	0.3750000
pope leo	pope leo	pope leo xiv pick name	0.8727273	0.3636364

Display Levenshtein Distance Metric Similarity Results

View counts of rising terms with a Levenshtein score greater than 0.4
View the heatmap matrix comparing the Levenshtein similarity scores for rising terms and GDELT article mention names with for terms/names with a Levenshtein score greater than 0.4

sim_df |>
  filter(lev_score > 0.4) |>
  group_by(clean_term) |>
  tally() |>
  arrange(desc(n)) |>
  mutate(color_group = factor(row_number() %% 3)) |>
  ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_classic() +
  scale_fill_manual(values = colors) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Levenshtein Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.4',
    y = 'Term Count',
    x = 'Rising Term'
  )

filt_sim_df = sim_df |>
  filter(lev_score > 0.4) 

sim_df |>
  filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
  ggplot(aes(x=term, y=reorder(clean_article, lev_score), fill=lev_score)) + 
  scale_y_discrete(labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
                                                 width = 45)) +
  geom_tile() +
  scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
  theme_classic() +
  theme(axis.text.x=element_text(angle=50, hjust=1)) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Levenshtein Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.4',
    y = 'GDELT Event Source Article',
    x = 'Rising Term',
    fill='Levenshtein Score'
  )

Display Jaro–Winkler Distance Metric Similarity Results

View counts of rising terms with a Jaro–Winkler score greater than 0.85
View the heatmap matrix comparing the Jaro–Winkler similarity scores for rising terms and GDELT article mention names with for terms/names with a Jaro–Winkler score greater than 0.85

sim_df |>
  filter(jw_score > 0.85) |>
  group_by(clean_term) |>
  tally() |>
  arrange(desc(n)) |>
  mutate(color_group = factor(row_number() %% 3)) |>
  ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_classic() + 
  scale_fill_manual(values = colors) +
  theme(
    plot.title = element_text(face = "bold"),
    text = element_text(color = titles)
  ) +
  labs(
    title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.85',
    y = 'Term Count',
    x = 'Rising Term'
  )

filt_sim_df = sim_df |>
  filter(jw_score >= 0.85) 

sim_df |>
  filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
  ggplot(aes(x=term, y=reorder(clean_article, jw_score), fill=jw_score)) + 
  scale_y_discrete(
    labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
                                  width = 45)) +
  geom_tile() +
  scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
  theme_classic() +
  theme(
    plot.title = element_text(face = "bold", hjust=0.5),
    plot.subtitle = element_text(hjust=0.5),
    text = element_text(color = titles)) +
  theme(axis.text.x=element_text(angle=50, hjust=1)) +
  labs(
    title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
    subtitle = 'Similarity Score > 0.85',
    y = 'GDELT Event Source Article',
    x = 'Rising Term',
    fill='Jaro-Winkler Score'
  )

Conclusion

Data Join:
Due to the difference in size and complexity of the Goggle search and GDELT data, joining the dataframes for analysis proved to be difficult. For example, the Google search database provides a daily refresh of the top 25 searched values from the previous day, where as GDELT provides a CSV file every 15 minutes with hundreds of global events. Additionally, the GDELT dataset location data was not an exact match with the Google search data. Some observations were at the Country level, State level, and city level, or did not have a geographic area provided. This meant that it was difficult to compare the data adequately by geographic area.

The string similarity analysis was our solution to comparing the reported events and top search terms without needing to implement a join. Given more time, we believe building out logic to more precisely extract the GDELT data for the same time frame as the Google search data would be best. This would involve utilizing the GDELT database, rather than CSV files, to query the data, as well as analyzing the difference in joining on when the event occurred vs when the event was mentioned.

Latent Dirichlet Allocation (LDA) Modeling The results in the LDA modeling show that topic modeling for Google searches is possible, but would require additional tidying. For example, certain terms appeared in multiple topic categories, indicating that there may have been duplicate terms fed into the model. LDA modeling can be effective when the corpus is large enough, and may not be the best option for short-text. To improve this analysis, using another method, such as Single topic LDA with clustering may be preferrable. For more information about this type of modeling approach, please see the stLDA-C GitHub Page.
String Similarity:
The results show the Jaro-Winkler similarity scores to be much higher than the Levenshtein scores. This is due to the the Jaro-Winkler method only taking into account matching characters and any required swapping of characters. As a result of the GDELT article names being significantly longer than the rising term strings, the Jaro-Winkler similarity scores can be misleading.
Next Steps:

Reducing the string length in the GDELT article names will assist with providing more meaning and interpretable similarity score results. This could be done by extracting keywords from the article name
In addition, exploring more string similarity scoring methods that are more fit for this use case could also improve current limitation. For example, the Jaccard Measure is a similarity measure that considers the number of overlapping tokens in two input strings. It determines the similarity between the strings by comparing the number of tokens they share. This may be more apt for this analysis due to comparing phrases rather than one word strings

DATA 607 Final Project: Pop Culture & Public Search Behavior

Group Members: Alina Vikhnevich, Olivia Azevedo, Alyssa Gurkas

2025-05-16