library(bigrquery)
library(DBI)
library(tidyverse)
library(tidytext)
library(lubridate)
library(topicmodels)
library(countrycode)
library(SnowballC)
library(Matrix)
library(RecordLinkage)
library(textmineR)
library(kableExtra)
This project uses several R packages to support data acquisition,
cleaning, analysis, and visualization. For example,
bigrquery and DBI are used to query data from
Google BigQuery, tidytext enables text mining and topic
modeling, and ggplot2 is used for data visualization.
In this project, we explore how search behavior in the United States reflects interest in major pop culture events. Using unsupervised topic modeling, we analyze Google’s top search terms to identify recurring themes and trends. By linking search activity with globally documented events, we aim to uncover how societal attention manifests in digital behavior.
We use the bigrquery package to access Google’s public
dataset of the most popular daily search terms. Make sure you have a
Google Cloud project set up with billing enabled and a BigQuery API
key.
# Authenticate with Google Cloud (interactive)
bq_auth(path = "data607finalproject-6c666e5b6214.json")
project_id <- "data607finalproject"
query <- as.character(
"SELECT
'top_terms' as term_type,
term,
score,
refresh_date,
dma_id,
dma_name,
ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
`bigquery-public-data.google_trends.top_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL
union all
SELECT
'rising_terms' as term_type,
term as term_rising,
score as score_rising,
refresh_date,
dma_id,
dma_name,
ARRAY_AGG(STRUCT(rank,week) ORDER BY week DESC LIMIT 1) x
FROM
`bigquery-public-data.google_trends.top_rising_terms`
WHERE refresh_date >= '2025-05-06' and refresh_date <= '2025-05-11'
AND regexp_contains(dma_name, 'New York|Los Angeles')
GROUP BY ALL
ORDER BY
(select week from unnest(x)) desc,
(SELECT rank FROM UNNEST(x)),
refresh_date desc
"
)
# Run dry run query to estimate billing
job <- bq_perform_query_dry_run(
billing = project_id,
query = query,
use_legacy_sql = FALSE
)
google_search_data <- bq_project_query(project_id, query)
google_search_df <- bq_table_download(google_search_data)
knit_table(head(google_search_df))
| term_type | term | score | refresh_date | dma_id | dma_name | x |
|---|---|---|---|---|---|---|
| top_terms | sabu | 100 | 2025-05-11 | 501 | New York NY | 1, 20219 |
| top_terms | sabu | 100 | 2025-05-11 | 803 | Los Angeles CA | 1, 20219 |
| rising_terms | habeas corpus | 100 | 2025-05-11 | 501 | New York NY | 1, 20219 |
| rising_terms | habeas corpus | 100 | 2025-05-11 | 803 | Los Angeles CA | 1, 20219 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 501 | New York NY | 2, 20219 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 803 | Los Angeles CA | 2, 20219 |
Getting the latest file name:
# Get the name of the most recent GDELT CSV file
latest_file <- read_lines("http://data.gdeltproject.org/gdeltv2/lastupdate.txt")
latest_file
## [1] "58335 ceb398cb084a6c1665071e85971a2279 http://data.gdeltproject.org/gdeltv2/20250517023000.export.CSV.zip"
## [2] "71232 f41e2d9e275493a78fe4c0185a41a1b4 http://data.gdeltproject.org/gdeltv2/20250517023000.mentions.CSV.zip"
## [3] "3307005 f76e6bc23ff338aef37ecc69d1672c1e http://data.gdeltproject.org/gdeltv2/20250517023000.gkg.csv.zip"
Getting event mentions within the same timeframe as the Google
Search Terms
all_files = data_frame(
time = (
seq(ymd_hms('20250506000000'),ymd_hms('20250512000000'), by = '1 day'))) |>
mutate(
time = format(time, format = '%Y%m%d%H%M%S'),
file_url = paste0('http://data.gdeltproject.org/gdeltv2/', time, '.mentions.CSV.zip'),
destfile = paste0("Data/GDELT/gdelt_", time, ".zip")
)
knit_table(all_files) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
| time | file_url | destfile |
|---|---|---|
| 20250506000000 | http://data.gdeltproject.org/gdeltv2/20250506000000.mentions.CSV.zip | Data/GDELT/gdelt_20250506000000.zip |
| 20250507000000 | http://data.gdeltproject.org/gdeltv2/20250507000000.mentions.CSV.zip | Data/GDELT/gdelt_20250507000000.zip |
| 20250508000000 | http://data.gdeltproject.org/gdeltv2/20250508000000.mentions.CSV.zip | Data/GDELT/gdelt_20250508000000.zip |
| 20250509000000 | http://data.gdeltproject.org/gdeltv2/20250509000000.mentions.CSV.zip | Data/GDELT/gdelt_20250509000000.zip |
| 20250510000000 | http://data.gdeltproject.org/gdeltv2/20250510000000.mentions.CSV.zip | Data/GDELT/gdelt_20250510000000.zip |
| 20250511000000 | http://data.gdeltproject.org/gdeltv2/20250511000000.mentions.CSV.zip | Data/GDELT/gdelt_20250511000000.zip |
| 20250512000000 | http://data.gdeltproject.org/gdeltv2/20250512000000.mentions.CSV.zip | Data/GDELT/gdelt_20250512000000.zip |
Download and Unzip:
# Constructing the full URL to download the ZIP file
export_url_line <- latest_file[1]
# Extract the URL from the line using strsplit
export_url <- strsplit(export_url_line, " ")[[1]][3]
# Download the export ZIP file to your working directory
download.file(export_url, destfile = "Data/GDELT/gdelt_latest.zip", mode = "wb")
# Unzip the file
unzip("Data/GDELT/gdelt_latest.zip", exdir='Data/GDELT')
Read CSV into R:
# Extract filename only from URL
csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(export_url)))
# Read the CSV file with correct case-sensitive name
gdelt_df <- read_csv(csv_file, col_names = FALSE)
GDELT 2.0 – Structured Load
GDELT 2.0 is a massive global event database that captures media
coverage of events in real time from across the world. The export files
are updated every 15 minutes and are tab-delimited without column
headers, so we must assign names manually using the GDELT schema.
In this chunk, we load the most recent export file and apply column names for easier exploration:
gdelt_cols <- c(
"GLOBALEVENTID", "Day", "MonthYear", "Year", "FractionDate",
"Actor1Code", "Actor1Name", "Actor1CountryCode", "Actor1KnownGroupCode",
"Actor1EthnicCode", "Actor1Religion1Code", "Actor1Religion2Code",
"Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code",
"Actor2Code", "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode",
"Actor2EthnicCode", "Actor2Religion1Code", "Actor2Religion2Code",
"Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code",
"IsRootEvent", "EventCode", "EventBaseCode", "EventRootCode",
"QuadClass", "GoldsteinScale", "NumMentions", "NumSources",
"NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName",
"Actor1Geo_CountryCode", "Actor1Geo_ADM1Code", "Actor1Geo_ADM2Code",
"Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type",
"Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code",
"Actor2Geo_ADM2Code", "Actor2Geo_Lat", "Actor2Geo_Long", "Actor2Geo_FeatureID",
"ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode",
"ActionGeo_ADM1Code", "Action2Geo_ADM2Code", "ActionGeo_Lat", "ActionGeo_Long",
"ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
)
# Read the file with tab delimiter and custom column names
gdelt_df <- read_delim(csv_file, delim = "\t", col_names = gdelt_cols, show_col_types = FALSE)
# Preview first rows
knit_table(head(gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
| GLOBALEVENTID | Day | MonthYear | Year | FractionDate | Actor1Code | Actor1Name | Actor1CountryCode | Actor1KnownGroupCode | Actor1EthnicCode | Actor1Religion1Code | Actor1Religion2Code | Actor1Type1Code | Actor1Type2Code | Actor1Type3Code | Actor2Code | Actor2Name | Actor2CountryCode | Actor2KnownGroupCode | Actor2EthnicCode | Actor2Religion1Code | Actor2Religion2Code | Actor2Type1Code | Actor2Type2Code | Actor2Type3Code | IsRootEvent | EventCode | EventBaseCode | EventRootCode | QuadClass | GoldsteinScale | NumMentions | NumSources | NumArticles | AvgTone | Actor1Geo_Type | Actor1Geo_FullName | Actor1Geo_CountryCode | Actor1Geo_ADM1Code | Actor1Geo_ADM2Code | Actor1Geo_Lat | Actor1Geo_Long | Actor1Geo_FeatureID | Actor2Geo_Type | Actor2Geo_FullName | Actor2Geo_CountryCode | Actor2Geo_ADM1Code | Actor2Geo_ADM2Code | Actor2Geo_Lat | Actor2Geo_Long | Actor2Geo_FeatureID | ActionGeo_Type | ActionGeo_FullName | ActionGeo_CountryCode | ActionGeo_ADM1Code | Action2Geo_ADM2Code | ActionGeo_Lat | ActionGeo_Long | ActionGeo_FeatureID | DATEADDED | SOURCEURL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1244378511 | 20240517 | 202405 | 2024 | 2024.375 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | 0 | 042 | 042 | 04 | 1 | 1.9 | 3 | 1 | 3 | -0.831202 | 0 | NA | NA | NA | NA | NA | NA | NA | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
| 1244378512 | 20240517 | 202405 | 2024 | 2024.375 | AFG | AFGHAN | AFG | NA | NA | NA | NA | NA | NA | NA | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | 0 | 042 | 042 | 04 | 1 | 1.9 | 3 | 1 | 3 | -0.831202 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
| 1244378513 | 20240517 | 202405 | 2024 | 2024.375 | AFG | AFGHANISTAN | AFG | NA | NA | NA | NA | NA | NA | NA | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | 0 | 042 | 042 | 04 | 1 | 1.9 | 2 | 1 | 2 | -0.831202 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 3 | Fredericksburg, Virginia, United States | US | USVA | VA163 | 37.8929 | -79.4378 | 1494946 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
| 1244378514 | 20240517 | 202405 | 2024 | 2024.375 | AFG | AFGHANISTAN | AFG | NA | NA | NA | NA | NA | NA | NA | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | 0 | 042 | 042 | 04 | 1 | 1.9 | 2 | 1 | 2 | -0.831202 | 4 | Saigon, H? Chíinh, Vietnam, Republic Of | VM | VM20 | 74101 | 10.7500 | 106.6670 | -3730078 | 4 | Saigon, H? Chíinh, Vietnam, Republic Of | VM | VM20 | 74101 | 10.7500 | 106.6670 | -3730078 | 4 | Saigon, H? Chíinh, Vietnam, Republic Of | VM | VM20 | 74101 | 10.7500 | 106.6670 | -3730078 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
| 1244378515 | 20240517 | 202405 | 2024 | 2024.375 | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 043 | 043 | 04 | 1 | 2.8 | 3 | 1 | 3 | -0.831202 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 0 | NA | NA | NA | NA | NA | NA | NA | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
| 1244378516 | 20240517 | 202405 | 2024 | 2024.375 | AFGINSTAL | TALIBAN | AFG | TAL | NA | NA | NA | INS | NA | NA | AFG | AFGHAN | AFG | NA | NA | NA | NA | NA | NA | NA | 0 | 043 | 043 | 04 | 1 | 2.8 | 3 | 1 | 3 | -0.831202 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 4 | Kabul, Kabol, Afghanistan | AF | AF13 | 3580 | 34.5167 | 69.1833 | -3378435 | 2.025052e+13 | https://wset.com/news/local/trump-admins-suspension-of-federal-refugee-program-upends-afghan-community-in-virginia-may-2025 |
In the following code chunk, we repeat the same process as above to load the event mention export file and apply column names.
The Mentions table records an individual mention of an event in an article, This table includes a Confidence field that records GDELT’s confidence in its extraction of that event from that particular article.
mention_gdelt_cols <- c(
"GLOBALEVENTID", "EventTimeDate", "MentionTimeDate", "MentionType",
"MentionSourceName", "MentionIdentifier",
"SentenceID", "Actor1CharOffset", "Actor2CharOffset", "ActionCharOffset",
"InRawText", "Confidence", "MentionDocLen",
"MentionDocTone", "MentionDocTranslationInfo", "Extras"
)
for (i in 1:nrow(all_files)){
download.file(
all_files$file_url[i],
destfile = all_files$destfile[i],
mode = "wb", quiet = TRUE)
# Unzip the file
unzip(all_files$destfile[i], exdir='Data/GDELT')
csv_file <- paste0('Data/GDELT/', gsub(".zip", "", basename(all_files$file_url[i])))
# Read the CSV file with correct case-sensitive name
if (i == 1) {
mention_gdelt_df <- read_delim(
csv_file, delim = "\t", col_names = mention_gdelt_cols,
col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
}
else{
temp_df <- read_delim(
csv_file, delim = "\t", col_names = mention_gdelt_cols,
col_types = list('c', 'c', 'c', 'd', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', '?', '?'))
mention_gdelt_df = rbind(mention_gdelt_df, temp_df)
}
}
knit_table(head(mention_gdelt_df)) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")
| GLOBALEVENTID | EventTimeDate | MentionTimeDate | MentionType | MentionSourceName | MentionIdentifier | SentenceID | Actor1CharOffset | Actor2CharOffset | ActionCharOffset | InRawText | Confidence | MentionDocLen | MentionDocTone | MentionDocTranslationInfo | Extras |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1173957950 | 20240506004500 | 20250506000000 | 1 | wwno.org | https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent | 9 | -1 | 2207 | 2153 | 1 | 60 | 14002 | 0.4860804 | NA | NA |
| 1174053899 | 20240506131500 | 20250506000000 | 1 | nydailynews.com | https://www.nydailynews.com/2025/05/05/met-gala-2025-adrienne-adams-brad-lander-nyc-politicians/ | 4 | -1 | 611 | 629 | 0 | 20 | 2332 | 3.7234043 | NA | NA |
| 1241978379 | 20250506000000 | 20250506000000 | 1 | bignewsnetwork.com | http://www.bignewsnetwork.com/news/278204974/canada-alberta-premier-says-separation-referendum-possible-in-2026 | 1 | 64 | 209 | 127 | 0 | 20 | 1114 | -2.6881720 | NA | NA |
| 1173957950 | 20240506004500 | 20250506000000 | 1 | wwno.org | https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent | 9 | 2207 | -1 | 2153 | 1 | 60 | 14002 | 0.4860804 | NA | NA |
| 1174054437 | 20240506131500 | 20250506000000 | 1 | marketscreener.com | https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/ | 8 | 1767 | -1 | 1796 | 1 | 60 | 2152 | 3.5714286 | NA | NA |
| 1174033236 | 20240506111500 | 20250506000000 | 1 | wwno.org | https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent | 9 | 2205 | 2132 | 2151 | 0 | 40 | 14002 | 0.4860804 | NA | NA |
Load GDELT Reference Table Data
Define reference table data file sources
cameo_country_code_txt <- "https://www.gdeltproject.org/data/lookups/CAMEO.country.txt"
fips_country_code_txt <- 'https://www.gdeltproject.org/data/lookups/FIPS.country.txt'
cameo_code_type_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.type.txt'
cameo_known_code_groups_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.knowngroup.txt'
cameo_ethnic_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.ethnic.txt'
cameo_religion_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.religion.txt'
cameo_event_code_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt'
cameo_gold_stein_scale_txt <- 'https://www.gdeltproject.org/data/lookups/CAMEO.goldsteinscale.txt'
adm1_codes_txt <- 'http://efele.net/maps/fips-10/data/fips-414.txt'
adm2_codes_txt <- 'https://download.geonames.org/export/dump/admin2Codes.txt'
Actor Data Reference Tables
# Actor Country Code column reference values
cameo_country_ref <- read.csv(
cameo_country_code_txt, sep="\t", header=T,
col.names = c('Country', 'Country_Desc'))
# Actor Known Code Group column reference values
cameo_known_code_groups_ref <- read.csv(
cameo_known_code_groups_txt, sep="\t", header=T,
col.names = c('KnownGroup', 'KnownGroup_Desc'))
# Actor Ethnic Code column reference values
cameo_ethnic_ref <- read.csv(cameo_ethnic_code_txt, sep="\t", header=T,
col.names = c('Ethnic', 'Ethnic_Desc'))
# Actor Religion Code (1 and 2) column reference values
cameo_religion_ref <- read.csv(cameo_religion_code_txt, sep="\t", header=T,
col.names = c('Religion', 'Religion_Desc'))
# Actor Type Code (1, 2, and 3) column reference values
cameo_type_ref <- read.csv(cameo_code_type_txt, sep="\t", header=T,
col.names = c('TypeCode', 'Type'))
Event Data Reference Tables
# Event Code (including base and root event codes) column reference values
cameo_event_ref <- read.csv(cameo_event_code_txt, sep="\t", header=T ,
colClasses = "character",
col.names = c('EventCode', 'Event'))
# Event Gold Stein Scale column reference values
cameo_gold_stein_ref <- read.csv(cameo_gold_stein_scale_txt, sep="\t", header=T,
col.names = c('EventCode', 'GoldSteinScale'))
# Event quad class (primary classification) column reference values
quadclass_ref <- tribble(
~"QuadClass", ~"Quad",
1, 'Verbal Cooperation',
2, 'Material Cooperation',
3, 'Verbal Conflict',
4,' Material Conflict'
)
Event Geography Data Reference Tables
# Event geographic resolution column reference values
geo_type_ref <- tribble(
~"Geo_Type", ~"Geo_Type_Desc",
1, 'COUNTRY',
2, 'USSTATE',
3, 'USCITY',
4,' WORLDCITY',
5, 'WORLDSTATE'
)
# FIPS country column reference values
fips_country_ref <- read.csv(fips_country_code_txt, sep="\t", header=F,
col.names = c('GEO_Country', 'GEO_Country_Desc'))
# GEO ADM1 code column reference values
# create U.S. adm1 codes using 'US' followed by the state abbreviation
us_fips_adm1_ref <- tibble(
'Country' = 'US',
'division' = 'state',
fips_adm1 = state.name,
state_abb = state.abb
) |>
mutate(fips_adm1_code = paste0(Country, state_abb)) |>
select(fips_adm1_code, division, fips_adm1)
# read in global adm1 codes and join to custom U.S. codes
fips_adm1_ref <- read.csv(
adm1_codes_txt, sep="_", header=F, na.strings = "",
col.names = c('fips_adm1_code',
'first fips version',
'last fips version',
'division',
'designation_2',
'designation_3',
'designation4',
'fips_adm1',
'convential name',
'former name')) |>
select('fips_adm1_code', 'division', 'fips_adm1') |>
rbind(us_fips_adm1_ref)
score values with zerosearch_id column from the associated row
numberlong_google_search_df <- google_search_df |>
unnest(x) |>
mutate(
score = replace_na(score, 0),
search_id = row_number())
knit_table(head(long_google_search_df))
| term_type | term | score | refresh_date | dma_id | dma_name | rank | week | search_id |
|---|---|---|---|---|---|---|---|---|
| top_terms | sabu | 100 | 2025-05-11 | 501 | New York NY | 1 | 2025-05-11 | 1 |
| top_terms | sabu | 100 | 2025-05-11 | 803 | Los Angeles CA | 1 | 2025-05-11 | 2 |
| rising_terms | habeas corpus | 100 | 2025-05-11 | 501 | New York NY | 1 | 2025-05-11 | 3 |
| rising_terms | habeas corpus | 100 | 2025-05-11 | 803 | Los Angeles CA | 1 | 2025-05-11 | 4 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 501 | New York NY | 2 | 2025-05-11 | 5 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 803 | Los Angeles CA | 2 | 2025-05-11 | 6 |
Subset refresh date into its own table for reference.
refresh_date_ref <- long_google_search_df |>
select(refresh_date) |>
distinct()
knit_table(head(refresh_date_ref))
| refresh_date |
|---|
| 2025-05-11 |
| 2025-05-10 |
| 2025-05-09 |
| 2025-05-08 |
| 2025-05-07 |
| 2025-05-06 |
Create Designated Market Area (DMA) reference table.
dma_ref_df <- long_google_search_df |>
select(dma_id, dma_name) |>
distinct()
knit_table(head(dma_ref_df))
| dma_id | dma_name |
|---|---|
| 501 | New York NY |
| 803 | Los Angeles CA |
Create final Google top term dateframe
top_term_df <- long_google_search_df |>
filter(term_type == "top_terms")
knit_table(head(top_term_df))
| term_type | term | score | refresh_date | dma_id | dma_name | rank | week | search_id |
|---|---|---|---|---|---|---|---|---|
| top_terms | sabu | 100 | 2025-05-11 | 501 | New York NY | 1 | 2025-05-11 | 1 |
| top_terms | sabu | 100 | 2025-05-11 | 803 | Los Angeles CA | 1 | 2025-05-11 | 2 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 501 | New York NY | 2 | 2025-05-11 | 5 |
| top_terms | barcelona vs real madrid | 100 | 2025-05-11 | 803 | Los Angeles CA | 2 | 2025-05-11 | 6 |
| top_terms | barcelona - real madrid | 100 | 2025-05-11 | 501 | New York NY | 3 | 2025-05-11 | 9 |
| top_terms | barcelona - real madrid | 100 | 2025-05-11 | 803 | Los Angeles CA | 3 | 2025-05-11 | 10 |
Create final Google rising term dataframe
rising_term_df <- long_google_search_df |>
filter(term_type == "rising_terms")
knit_table(head(rising_term_df))
| term_type | term | score | refresh_date | dma_id | dma_name | rank | week | search_id |
|---|---|---|---|---|---|---|---|---|
| rising_terms | habeas corpus | 100 | 2025-05-11 | 501 | New York NY | 1 | 2025-05-11 | 3 |
| rising_terms | habeas corpus | 100 | 2025-05-11 | 803 | Los Angeles CA | 1 | 2025-05-11 | 4 |
| rising_terms | derek carr | 98 | 2025-05-11 | 501 | New York NY | 2 | 2025-05-11 | 7 |
| rising_terms | derek carr | 72 | 2025-05-11 | 803 | Los Angeles CA | 2 | 2025-05-11 | 8 |
| rising_terms | tripe | 12 | 2025-05-11 | 501 | New York NY | 3 | 2025-05-11 | 11 |
| rising_terms | tripe | 21 | 2025-05-11 | 803 | Los Angeles CA | 3 | 2025-05-11 | 12 |
Note: The GDELT maintains a tidy format in its raw state. No manual tidying tasks were needed to be performed.
GDELT Event Data Cleaning
Day into date data typeMonthYear into datetime string formatGLOBALEVENTID into stringsgdelt_df_cleaned <- gdelt_df |>
mutate_at(vars(contains('Code')), ~as.factor(.)) |>
mutate_at(vars(contains('Geo_Type')), ~as.factor(.)) |>
mutate(
GLOBALEVENTID = as.character(GLOBALEVENTID),
Day = as.Date(as.character(Day), '%Y%m%d'),
MonthYear = format(Day, '%Y-%m'),
QuadClass = as.factor(QuadClass),
) |>
filter(Day >= today() - days(14) & Day <= today())
knit_table(head(gdelt_df_cleaned)) |> scroll_box(
width = "100%", box_css = "border: 1px solid #FFFFFF;")
| GLOBALEVENTID | Day | MonthYear | Year | FractionDate | Actor1Code | Actor1Name | Actor1CountryCode | Actor1KnownGroupCode | Actor1EthnicCode | Actor1Religion1Code | Actor1Religion2Code | Actor1Type1Code | Actor1Type2Code | Actor1Type3Code | Actor2Code | Actor2Name | Actor2CountryCode | Actor2KnownGroupCode | Actor2EthnicCode | Actor2Religion1Code | Actor2Religion2Code | Actor2Type1Code | Actor2Type2Code | Actor2Type3Code | IsRootEvent | EventCode | EventBaseCode | EventRootCode | QuadClass | GoldsteinScale | NumMentions | NumSources | NumArticles | AvgTone | Actor1Geo_Type | Actor1Geo_FullName | Actor1Geo_CountryCode | Actor1Geo_ADM1Code | Actor1Geo_ADM2Code | Actor1Geo_Lat | Actor1Geo_Long | Actor1Geo_FeatureID | Actor2Geo_Type | Actor2Geo_FullName | Actor2Geo_CountryCode | Actor2Geo_ADM1Code | Actor2Geo_ADM2Code | Actor2Geo_Lat | Actor2Geo_Long | Actor2Geo_FeatureID | ActionGeo_Type | ActionGeo_FullName | ActionGeo_CountryCode | ActionGeo_ADM1Code | Action2Geo_ADM2Code | ActionGeo_Lat | ActionGeo_Long | ActionGeo_FeatureID | DATEADDED | SOURCEURL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1244378528 | 2025-05-10 | 2025-05 | 2025 | 2025.356 | USA | UNITED STATES | USA | NA | NA | NA | NA | NA | NA | NA | USAGOV | UNITED STATES | USA | NA | NA | NA | NA | GOV | NA | NA | 1 | 043 | 043 | 04 | 1 | 2.8 | 1 | 1 | 1 | -7.917889 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 2.025052e+13 | https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice |
| 1244378529 | 2025-05-10 | 2025-05 | 2025 | 2025.356 | USA | UNITED STATES | USA | NA | NA | NA | NA | NA | NA | NA | USAGOV | UNITED STATES | USA | NA | NA | NA | NA | GOV | NA | NA | 1 | 043 | 043 | 04 | 1 | 2.8 | 1 | 1 | 1 | -7.917889 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 2.025052e+13 | https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice |
| 1244378530 | 2025-05-10 | 2025-05 | 2025 | 2025.356 | USAGOV | UNITED STATES | USA | NA | NA | NA | NA | GOV | NA | NA | USA | UNITED STATES | USA | NA | NA | NA | NA | NA | NA | NA | 1 | 042 | 042 | 04 | 1 | 1.9 | 1 | 1 | 1 | -7.917889 | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 2.025052e+13 | https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice |
| 1244378531 | 2025-05-10 | 2025-05 | 2025 | 2025.356 | USAGOV | UNITED STATES | USA | NA | NA | NA | NA | GOV | NA | NA | USA | JERSEY | USA | NA | NA | NA | NA | NA | NA | NA | 1 | 112 | 112 | 11 | 3 | -2.0 | 1 | 1 | 1 | -7.917889 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 3 | Newark, New Jersey, United States | US | USNJ | NJ013 | 40.73570 | -74.172400 | 878762 | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 2.025052e+13 | https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice |
| 1244378532 | 2025-05-10 | 2025-05 | 2025 | 2025.356 | USAGOV | UNITED STATES | USA | NA | NA | NA | NA | GOV | NA | NA | USA | JERSEY | USA | NA | NA | NA | NA | NA | NA | NA | 1 | 112 | 112 | 11 | 3 | -2.0 | 1 | 1 | 1 | -7.917889 | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 1 | Jersey | JE | JE | NA | 49.21667 | -2.116667 | JE | 2.025052e+13 | https://newrepublic.com/post/195380/donald-trump-justice-department-charge-member-congress-ice |
| 1244378533 | 2025-05-16 | 2025-05 | 2025 | 2025.373 | IRQSPY | IRAQ | IRQ | NA | NA | NA | NA | SPY | NA | NA | SYR | DAMASCUS | SYR | NA | NA | NA | NA | NA | NA | NA | 0 | 040 | 040 | 04 | 1 | 1.0 | 1 | 1 | 1 | -1.829925 | 4 | Damascus, Dimashq, Syria | SY | SY13 | 26324 | 33.50000 | 36.300000 | -2541260 | 4 | Damascus, Dimashq, Syria | SY | SY13 | 26324 | 33.50000 | 36.300000 | -2541260 | 4 | Damascus, Dimashq, Syria | SY | SY13 | 26324 | 33.50000 | 36.300000 | -2541260 | 2.025052e+13 | https://www.zerohedge.com/geopolitical/jordans-king-warned-us-against-assassinating-syrias-sharaa-trump-meeting |
GDELT Event Mentions Data Cleaning
Day into date data typeEventTimeDate and
MentionTimeDate into datetime data typesGLOBALEVENTID into stringsArticle column by extracting the article name
from the source URLmention_gdelt_df_cleaned = mention_gdelt_df |>
mutate(
GLOBALEVENTID = as.character(GLOBALEVENTID),
EventTimeDate = format(lubridate::ymd_hms(as.character(EventTimeDate)),'%Y-%m-%d- %H:%M:%S'),
MentionTimeDate = format(lubridate::ymd_hms(as.character(MentionTimeDate)),'%Y-%m-%d- %H:%M:%S'),
) |>
group_by(GLOBALEVENTID) |>
arrange(GLOBALEVENTID, desc(Confidence), desc(MentionType), MentionSourceName) |>
slice_head(n=1) |>
ungroup() |>
filter(MentionType == 1 & Confidence >= 60)
mention_gdelt_df_cleaned = mention_gdelt_df_cleaned |>
mutate(Article = gsub('\\/$||\\/([A-Za-z]|\\d)*$|\\/*\\d*$|\\/([A-Z]|\\d)*\\/$|\\/*-*\\d*(\\.[a-z]+$|\\/$|$)|\\/*\\d*\\/_.*$|\\/\\?.*$', "", MentionIdentifier)) |>
mutate(Article = gsub('-2025-\\d+.*$|-2025\\d+-.*$|\\/\\d*-*p5lyys.*$', "", Article)) |>
mutate(Article = gsub('\\/([a-z]|\\d|article)*_*-*([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d)*-([a-z]|\\d){4}-([a-z]|\\d)*$', "", Article)) |>
mutate(Article = gsub('(https://|http://).*\\..*\\/\\d*-*', '', Article)) |>
mutate(Article = str_replace_all(Article, '-', ' ')) |>
mutate(Article = str_replace(Article, '\\.', '')) |>
mutate(Article = na_if(Article, 'index?more=')) |>
mutate(Article = ifelse(str_count(Article)<=1, NA, Article)) |>
mutate(Article = ifelse(str_count(Article, '\\w+')<=1, NA, Article))
knit_table(head(mention_gdelt_df_cleaned)) |> scroll_box(
width = "100%", box_css = "border: 1px solid #FFFFFF;")
| GLOBALEVENTID | EventTimeDate | MentionTimeDate | MentionType | MentionSourceName | MentionIdentifier | SentenceID | Actor1CharOffset | Actor2CharOffset | ActionCharOffset | InRawText | Confidence | MentionDocLen | MentionDocTone | MentionDocTranslationInfo | Extras | Article |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1173957950 | 2024-05-06- 00:45:00 | 2025-05-06- 00:00:00 | 1 | wwno.org | https://www.wwno.org/local-regional-news/2025-05-05/new-orleans-got-388-million-in-federal-pandemic-aid-was-that-money-well-spent | 9 | -1 | 2207 | 2153 | 1 | 60 | 14002 | 0.4860804 | NA | NA | new orleans got 388 million in federal pandemic aid was that money well spent |
| 1174054437 | 2024-05-06- 13:15:00 | 2025-05-06- 00:00:00 | 1 | marketscreener.com | https://www.marketscreener.com/news/latest/Diana-Ross-Simone-Biles-shine-at-Met-Ball-celebration-of-Black-menswear-49831578/ | 8 | 1767 | -1 | 1796 | 1 | 60 | 2152 | 3.5714286 | NA | NA | Diana Ross Simone Biles shine at Met Ball celebration of Black menswear |
| 1174227361 | 2024-05-07- 07:15:00 | 2025-05-07- 00:00:00 | 1 | bangkokpost.com | https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south | 5 | -1 | 823 | 941 | 1 | 60 | 3054 | -4.4680851 | NA | NA | govt ready for new talks on peace in deep south |
| 1174261394 | 2024-05-07- 11:00:00 | 2025-05-07- 00:00:00 | 1 | bangkokpost.com | https://www.bangkokpost.com/thailand/general/3019210/govt-ready-for-new-talks-on-peace-in-deep-south | 5 | 823 | -1 | 978 | 1 | 60 | 3054 | -4.4680851 | NA | NA | govt ready for new talks on peace in deep south |
| 1174400771 | 2024-05-08- 00:15:00 | 2025-05-08- 00:00:00 | 1 | kenw.org | https://www.kenw.org/npr-news/2025-05-07/3-former-memphis-police-officers-found-not-guilty-in-the-death-of-tyre-nichols | 6 | 1055 | -1 | 1037 | 1 | 100 | 5634 | -9.5744681 | NA | NA | former memphis police officers found not guilty in the death of tyre nichols |
| 1174418451 | 2024-05-08- 02:00:00 | 2025-05-08- 00:00:00 | 1 | cbs12.com | https://cbs12.com/news/connect-to-congress/house-gop-continues-to-push-colleges-to-do-more-to-combat-antisemitism-on-campus-haverford-college-depaul-university-california-polytechnic-state-university-house-education-committee | 6 | 2151 | 2158 | 2133 | 1 | 100 | 4985 | -5.7397959 | NA | NA | house gop continues to push colleges to do more to combat antisemitism on campus haverford college depaul university california polytechnic state university house education committee |
Join the Google Big Query Data with the GDELT Data
gsearch_gdelt <- left_join(
long_google_search_df,
gdelt_df_cleaned,
by=c('refresh_date'='Day'))
Stop Words
Stop words are words that should be excluded in text analysis, such as
“the”, “a”, “and”, etc. Within the ‘tidytext’ package, stop words can be
removed in multiple languages. In this analysis, stop words were removed
in English and Spanish.
# getting stop words for english and spanish
multi_lang_stopwords <- bind_rows(
get_stopwords(language = "en"),
get_stopwords(language = "es")
)
Tokenization
In text mining, tokenization refers to breaking down text into smaller
units. The Google search data was already broken in to short text,
therefore, when tokenizing the text, the unit of our text becomes a
word.
tokenized_gsearch_gdelt <- gsearch_gdelt |>
unnest_tokens(output = word, input = term) |>
anti_join(multi_lang_stopwords, by = "word") |>
relocate(search_id, word)
Frequency Analysis
After tokenizing the text, a freqency analysis is conducted to better
understand which searches appear most frequently, at the word level, and
at the Google search level.
most_freq_tokens <- tokenized_gsearch_gdelt |>
count(word, sort = TRUE) |>
slice_max(order_by = n, n = 25) |>
mutate(color_group = factor(row_number() %% 3))
most_freq_searches <- gsearch_gdelt |>
count(term, sort = TRUE) |>
slice_max(order_by = n, n = 25) |>
mutate(color_group = factor(row_number() %% 3))
ggplot(most_freq_tokens, aes(x=reorder(word, -n), y=n, fill = color_group))+
geom_bar(stat="identity")+
scale_fill_manual(values = colors) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
plot.title = element_text(face = "bold"),
text = element_text(color = titles)
) +
labs(y = "Count",
x = "Tokenized Terms (Words)",
title = "Word Frequency in Google Searches",
subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
guides(fill=FALSE)
ggplot(most_freq_searches, aes(x=reorder(term, -n), y=n, fill = color_group))+
geom_bar(stat="identity")+
scale_fill_manual(values = colors) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 90, hjust = 1, color = titles),
plot.title = element_text(face = "bold"),
text = element_text(color = titles)
) +
labs(y = "Count",
x = "Google Searches",
title = "Google Searches Frequency",
subtitle = "Searches are limited to NYC and LA and are taken from the past two weeks")+
guides(fill=FALSE)
tokenized_gsearch_gdelt |>
group_by(refresh_date) |>
count(word, sort = TRUE) |>
top_n(5) |>
ungroup() |>
mutate(word = reorder_within(word, n, refresh_date),
date = factor(refresh_date, levels=sort(unique(tokenized_gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ date, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
theme_minimal()+
labs(y = "Count",
x = "Unique words",
title = "Most frequently googled words in past two weeks",
subtitle = "Stop words removed from the list")
gsearch_gdelt |>
filter(term_type == "top_terms",
refresh_date >= "2025-05-03") |>
group_by(refresh_date) |>
count(term, sort = TRUE) |>
top_n(3) |>
ungroup() |>
mutate(word = reorder_within(term, n, refresh_date),
date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
ggplot(aes(x = term, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ date, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
theme_minimal()+
labs(y = "Count",
x = "Top Searched Terms",
title = "Top Google Searches in Past 8 Days in LA and NYC",
subtitle = "")
gsearch_gdelt |>
filter(term_type == "rising_terms",
refresh_date >= "2025-05-03") |>
group_by(refresh_date) |>
count(term, sort = TRUE) |>
top_n(3) |>
ungroup() |>
mutate(word = reorder_within(term, n, refresh_date),
date = factor(refresh_date, levels=sort(unique(gsearch_gdelt$refresh_date),decreasing=TRUE),ordered = TRUE))|>
ggplot(aes(x = term, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ date, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
theme_minimal()+
labs(y = "Count",
x = "Top Searched Terms",
title = "Rising Google Searches in Past 8 Days in LA and NYC",
subtitle = "")
Word Stemming
Another technique that can be used in text mining is word stemming. When
word stemming, it takes the text, or token, and reduces the term to the
root form (also known as the stem). This can help normalize the terms,
so that terms like play, playing, played, plays, would all be stemmed to
play.
wrd_stem <- tokenized_gsearch_gdelt %>%
mutate(word = SnowballC::wordStem(word))
Document Term Matrix
A document term matrix represents the frequency of terms that appear in
the collect of text.
# creating document term matrix
dtm_tokenized_terms <- wrd_stem |>
count(refresh_date, word) |>
cast_dtm(document = refresh_date, term = word, value = n)
# Convert to reg matrix
mat <- as.matrix(dtm_tokenized_terms)
# Convert to sparse Matrix (dgCMatrix) so can use in lapply below
dtm_mat <- Matrix(mat, sparse = TRUE)
Latent Dirichlet Allocation (LDA)
LDA is a topic modeling method that can be used to identify central
topics and distributions across a corpus.
Corpus: a corpus is a collection of text that is used in natural language processing.
In LDA, the value k is defined as the number of topics that appear in the corpus. To determine k, some additional analysis should be completed.
# developing list of potential k values to fit in LDA model
k_list <- seq(2, 20, by = 1)
# testing fit for k values in LDA models
fit_test_k_models <- lapply(k_list, function(k) {
FitLdaModel(dtm = dtm_mat, #using converted matrix
k = k,
iterations = 200,
burnin = 175,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE)
})
names(fit_test_k_models) <- paste0("k_", k_list)
# Extracting final log-likelihood value
log_likelihood_list <- sapply(fit_test_k_models, function(x) tail(x$log_likelihood, 1)[2])
log_likelihood <- as.numeric(log_likelihood_list)
# generating summary of model perf.
model_summary <- data.frame(
k = k_list,
coherence = sapply(fit_test_k_models, function(x) mean(x$coherence)), # topic coherence
r2 = sapply(fit_test_k_models, function(x) x$r2), #residuals squared - how different are the expected versus obs.
log_likelihood = log_likelihood # probability that the model generated obs. data
)
ggplot(model_summary, aes(x = k)) +
geom_line(aes(y = coherence), color = "blue") +
geom_line(aes(y = r2 * max(coherence)), color = "green", linetype = "dashed") +
geom_line(aes(y = log_likelihood / max(abs(log_likelihood)) * max(coherence)),
color = "red", linetype = "dotted") +
ylab("Normalized Metric (for comparison)") +
ggtitle("LDA Tuning: Coherence, R2, and Log-Likelihood")
# this is where we use a LDA model - the number of topics is defined as K.
lda_model_k15 <- LDA(dtm_tokenized_terms, k = 15, control = list(seed = 1234))
topics_lda_model_k15 <- tidy(lda_model_k15, matrix = "beta")
# Top terms per topic
top_words_k15 <- topics_lda_model_k15 |>
group_by(topic) |>
top_n(10, beta) |>
ungroup()
top_terms_lda_k15 <- topics_lda_model_k15 |>
group_by(topic) |>
slice_max(beta, n = 3, with_ties = FALSE) |>
ungroup() |>
arrange(topic, -beta)
top_terms_lda_k15 |>
mutate(term = reorder_within(term, beta, topic)) |>
group_by(topic, term) |>
arrange(desc(beta)) |>
ungroup() |>
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 3 terms in each LDA topic",
subtitle = "LDA model where k=15",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
# this is where we use a LDA model - the number of topics is defined as K.
lda_model_k11 <- LDA(dtm_tokenized_terms, k = 11, control = list(seed = 1234))
topics_lda_model_k11 <- tidy(lda_model_k11, matrix = "beta")
# Top terms per topic
top_words_k11 <- topics_lda_model_k11 |>
group_by(topic) |>
top_n(10, beta) |>
ungroup()
top_terms_lda_k11 <- topics_lda_model_k11 |>
group_by(topic) |>
slice_max(beta, n = 3, with_ties = FALSE) |>
ungroup() |>
arrange(topic, -beta)
top_terms_lda_k11 |>
mutate(term = reorder_within(term, beta, topic)) |>
group_by(topic, term) |>
arrange(desc(beta)) |>
ungroup() |>
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 3 terms in each LDA topic",
subtitle = "LDA model where k=11",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Assess string similarity between Google Search top rising terms and GDELT Event article name extracted from the event mention source URL with the highest confidence within the analysis time frame.
Remove Stop Words
Tokenize both the top rising terms and event article names to allow for
the removal of common stop words. Upon removal, the token words are
joined back together. This process will assist in obtaining a more
accurate similarity score by removing redundant or commonly shared
words.
token_str_gsearch = rising_term_df |>
unnest_tokens(output = clean_term, input = term) |>
anti_join(multi_lang_stopwords, by = c("clean_term" = "word")) |>
group_by(search_id) |>
summarise(clean_term = paste0(clean_term, collapse = ' ')) |>
ungroup() |>
select(search_id, clean_term) |>
distinct() |>
left_join(rising_term_df) |>
select(clean_term, term, refresh_date) |>
distinct()
token_str_gdelt = mention_gdelt_df_cleaned |>
drop_na(Article) |>
unnest_tokens(output = clean_article, input = Article) |>
anti_join(multi_lang_stopwords, by = c("clean_article" = "word")) |>
group_by(GLOBALEVENTID) |>
summarise(clean_article = paste0(clean_article, collapse = ' ')) |>
ungroup() |>
select(GLOBALEVENTID, clean_article) |>
distinct() |>
left_join(mention_gdelt_df_cleaned) |>
select(clean_article, Article, MentionTimeDate) |>
mutate(MentionTimeDate= as.Date(MentionTimeDate)) |>
distinct()
String Similarity Metrics
Levenshtein Distance (edit distance-based
algorithm):
A string metric for measuring the difference between two sequences. The
Levenshtein distance between two words is the minimum number of
single-character edits (insertions, deletions or substitutions) required
to change one word into the other. The more the number of operations,
the less the similarity between the two strings.
Jaro–Winkler Distance:
A string metric measuring an edit distance between two sequences. The
distance takes into account the number of matching characters and the
order in which they appear, with a higher weight assigned to matching
characters at the beginning of the strings.
Calculate similarity scores
sim_df = inner_join(rising_term_df,
token_str_gdelt,
by=c('refresh_date'='MentionTimeDate')) |>
left_join(token_str_gsearch) |>
select(term, clean_term, clean_article) |>
distinct() |>
mutate(jw_score = jarowinkler(term, clean_article, r=0.4)) |>
mutate(lev_score = levenshteinSim(term, clean_article)) |>
arrange(desc(jw_score), desc(lev_score))
knit_table(head(sim_df))
| term | clean_term | clean_article | jw_score | lev_score |
|---|---|---|---|---|
| pope leo xiv | pope leo xiv | pope leo xiv pick name | 0.9090909 | 0.5454545 |
| pope leo xiii | pope leo xiii | pope leo xiv pick name | 0.8937063 | 0.5454545 |
| india pakistan | india pakistan | india pakistan brink wider conflict | 0.8800000 | 0.4000000 |
| pope leo | pope leo | pope leo predecessors | 0.8761905 | 0.3809524 |
| pope leo xiv | pope leo xiv | pope leo xiv chicago augustinian | 0.8750000 | 0.3750000 |
| pope leo | pope leo | pope leo xiv pick name | 0.8727273 | 0.3636364 |
Display Levenshtein Distance Metric Similarity Results
sim_df |>
filter(lev_score > 0.4) |>
group_by(clean_term) |>
tally() |>
arrange(desc(n)) |>
mutate(color_group = factor(row_number() %% 3)) |>
ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
geom_col(show.legend = FALSE) +
coord_flip() +
theme_classic() +
scale_fill_manual(values = colors) +
theme(
plot.title = element_text(face = "bold"),
text = element_text(color = titles)
) +
labs(
title = 'Levenshtein Distance: Rising Terms & Event Article',
subtitle = 'Similarity Score > 0.4',
y = 'Term Count',
x = 'Rising Term'
)
filt_sim_df = sim_df |>
filter(lev_score > 0.4)
sim_df |>
filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
ggplot(aes(x=term, y=reorder(clean_article, lev_score), fill=lev_score)) +
scale_y_discrete(labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
width = 45)) +
geom_tile() +
scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
theme_classic() +
theme(axis.text.x=element_text(angle=50, hjust=1)) +
theme(
plot.title = element_text(face = "bold"),
text = element_text(color = titles)
) +
labs(
title = 'Levenshtein Distance: Rising Terms & Event Article',
subtitle = 'Similarity Score > 0.4',
y = 'GDELT Event Source Article',
x = 'Rising Term',
fill='Levenshtein Score'
)
Display Jaro–Winkler Distance Metric Similarity Results
sim_df |>
filter(jw_score > 0.85) |>
group_by(clean_term) |>
tally() |>
arrange(desc(n)) |>
mutate(color_group = factor(row_number() %% 3)) |>
ggplot(aes(x=reorder(clean_term, n), y=n, fill = color_group)) +
geom_col(show.legend = FALSE) +
coord_flip() +
theme_classic() +
scale_fill_manual(values = colors) +
theme(
plot.title = element_text(face = "bold"),
text = element_text(color = titles)
) +
labs(
title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
subtitle = 'Similarity Score > 0.85',
y = 'Term Count',
x = 'Rising Term'
)
filt_sim_df = sim_df |>
filter(jw_score >= 0.85)
sim_df |>
filter(term %in% filt_sim_df$term & clean_article %in% filt_sim_df$clean_article) |>
ggplot(aes(x=term, y=reorder(clean_article, jw_score), fill=jw_score)) +
scale_y_discrete(
labels = function(x) str_wrap(str_replace_all(x, "clean_article" , " "),
width = 45)) +
geom_tile() +
scale_fill_gradient(low = "#aac4e9", high = "#1f2c8f")+
theme_classic() +
theme(
plot.title = element_text(face = "bold", hjust=0.5),
plot.subtitle = element_text(hjust=0.5),
text = element_text(color = titles)) +
theme(axis.text.x=element_text(angle=50, hjust=1)) +
labs(
title = 'Jaro-Winkler Distance: Rising Terms & Event Article',
subtitle = 'Similarity Score > 0.85',
y = 'GDELT Event Source Article',
x = 'Rising Term',
fill='Jaro-Winkler Score'
)
Data Join:
Due to the difference in size and complexity of the Goggle search and
GDELT data, joining the dataframes for analysis proved to be difficult.
For example, the Google search database provides a daily refresh of the
top 25 searched values from the previous day, where as GDELT provides a
CSV file every 15 minutes with hundreds of global events. Additionally,
the GDELT dataset location data was not an exact match with the Google
search data. Some observations were at the Country level, State level,
and city level, or did not have a geographic area provided. This meant
that it was difficult to compare the data adequately by geographic
area.
The string similarity analysis was our solution to comparing the reported events and top search terms without needing to implement a join. Given more time, we believe building out logic to more precisely extract the GDELT data for the same time frame as the Google search data would be best. This would involve utilizing the GDELT database, rather than CSV files, to query the data, as well as analyzing the difference in joining on when the event occurred vs when the event was mentioned.
Latent Dirichlet Allocation (LDA) Modeling The results
in the LDA modeling show that topic modeling for Google searches is
possible, but would require additional tidying. For example, certain
terms appeared in multiple topic categories, indicating that there may
have been duplicate terms fed into the model. LDA modeling can be
effective when the corpus is large enough, and may not be the best
option for short-text. To improve this analysis, using another method,
such as Single topic LDA with clustering may be preferrable. For more
information about this type of modeling approach, please see the stLDA-C
GitHub Page.
String Similarity:
The results show the Jaro-Winkler similarity scores to be much higher
than the Levenshtein scores. This is due to the the Jaro-Winkler method
only taking into account matching characters and any required swapping
of characters. As a result of the GDELT article names being
significantly longer than the rising term strings, the Jaro-Winkler
similarity scores can be misleading.
Next Steps: