Lab 3 - Scraping and Collecting Data with R from Google Trends or NewsAPI and Conducting Exploratory Factor Analysis (EFA)

Introduction

To optimize marketing spend, this study maps out exactly when American consumers begin preparing for autumn by tracking web search behaviors for “Halloween costume.” By evaluating monthly interest patterns, we isolate the precise moment public curiosity begins to build, establishing a data-backed timeline for deploying promotional campaigns well ahead of the holiday rush.

Our source data was pulled into R utilizing the gtrendsR framework. Crucially, the dataset tracks relative popularity rather than raw inquiry counts; it uses a 0-to-100 index where a score of 100 pinpoints the absolute climax of consumer attention within the region over the studied timeframe.

packages <- c(
  "gtrendsR", "tidyverse", "lubridate", "openxlsx", "scales", "readxl"
)

installed <- rownames(installed.packages())
for (p in packages) {
  if (!(p %in% installed)) install.packages(p)
}

library(gtrendsR)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(openxlsx)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

keyword <- "Halloween costume"
cache_file <- "Halloween_costume_google_trends_data.xlsx"

if (file.exists(cache_file)) {
  interest <- readxl::read_excel(cache_file)
} else {
  trend_raw <- gtrends(
    keyword = keyword,
    geo = "US",
    time = "today+5-y",
    gprop = "web",
    onlyInterest = TRUE
  )

  # CRITICAL FIX: Extract the 'interest_over_time' dataframe from the list
  interest <- trend_raw$interest_over_time %>%
    as_tibble() %>%
    mutate(
      date = as.Date(date),
      # Coerce hits to character first to safely handle the "<1" string comparison, then to numeric
      hits = as.numeric(ifelse(as.character(hits) == "<1", 0.5, hits)),
      month = month(date, label = TRUE, abbr = FALSE),
      month_num = month(date),
      year = year(date)
    ) %>%
    # Filter for the last 3 years
    filter(date >= Sys.Date() - years(3)) %>%
    select(date, year, month_num, month, keyword, hits, geo, time)

  # Write to Excel cache
  openxlsx::write.xlsx(
    interest,
    file = cache_file,
    overwrite = TRUE
  )
}

head(interest)

## # A tibble: 6 × 8
##   date                 year month_num month keyword            hits geo   time  
##   <dttm>              <dbl>     <dbl> <chr> <chr>             <dbl> <chr> <chr> 
## 1 2023-06-25 00:00:00  2023         6 June  Halloween costume     2 US    today…
## 2 2023-07-02 00:00:00  2023         7 July  Halloween costume     3 US    today…
## 3 2023-07-09 00:00:00  2023         7 July  Halloween costume     3 US    today…
## 4 2023-07-16 00:00:00  2023         7 July  Halloween costume     3 US    today…
## 5 2023-07-23 00:00:00  2023         7 July  Halloween costume     4 US    today…
## 6 2023-07-30 00:00:00  2023         7 July  Halloween costume     5 US    today…

write.xlsx(
  interest,
  file = "Halloween_costume_google_trends_data.xlsx",
  overwrite = TRUE
)

ggplot(interest, aes(x = date, y = hits)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 1.6) +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") +
  labs(
    title = "Google Trends Interest for Holloween costumes in the U.S.",
    subtitle = "Three-year web search trend",
    x = "Date",
    y = "Search interest, normalized 0–100"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

monthly_avg <- interest %>%
  group_by(month_num, month) %>%
  summarize(avg_interest = mean(hits, na.rm = TRUE), .groups = "drop") %>%
  arrange(month_num)

monthly_avg

## # A tibble: 12 × 3
##    month_num month     avg_interest
##        <dbl> <chr>            <dbl>
##  1         1 January           1   
##  2         2 February          1.5 
##  3         3 March             1.2 
##  4         4 April             1.58
##  5         5 May               1.62
##  6         6 June              2.14
##  7         7 July              4.08
##  8         8 August           10.3 
##  9         9 September        26   
## 10        10 October          61.5 
## 11        11 November          3.38
## 12        12 December          1

ggplot(monthly_avg, aes(x = reorder(month, month_num), y = avg_interest)) +
  # Change fill (bar color) and color (border color) here
  geom_col(fill = "steelblue", color = "white") + 
  labs(
    title = "Average Google Trends Interest by Month",
    subtitle = "Keyword: Halloween costume",
    x = "Month",
    y = "Average search interest"
  ) +
  theme_minimal(base_size = 12) + # Sets a clean baseline text size
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank(), # Removes vertical grid lines for a cleaner look
    plot.title = element_text(face = "bold")
  )

peak_month <- monthly_avg %>%
  filter(avg_interest == max(avg_interest, na.rm = TRUE))

# Display a beautiful summary in the console
cat(paste0(
  "📊 Peak Search Month Summary\n",
  "----------------------------\n",
  "Highest Traffic Month: ", peak_month$month, "\n",
  "Average Search Interest Score: ", round(peak_month$avg_interest, 2), "\n"
))

## 📊 Peak Search Month Summary
## ----------------------------
## Highest Traffic Month: October
## Average Search Interest Score: 61.54

Findings

The data reveals a highly concentrated, predictable seasonal spike for the phrase “Halloween costume.” From January through June, search interest is virtually non-existent before momentum begins to build throughout July and August. Demand reaches its absolute peak in October on the week of halloween at a dominant 61.5, but drops off a cliff immediately after the holiday concludes.

Campaign Recommendations

To maximize efficiency, marketing budgets should be aggressively front-loaded. Brands should soft-launch awareness ads and website updates in mid-July to capture early-bird planners at a low cost. The heaviest ad spend should deploy in September to match the rapid acceleration of consumer curiosity. By October, the strategy must pivot entirely to high-conversion retargeting and urgency messaging, scaling down spending days before October 31st to avoid wasteful, non-converting traffic.This is a similar stratagy the company Spirit Halloween uses.

Data Limitations

While useful, this data relies on Google’s normalized 0–100 index, meaning it shows the timing of market interest but hides the actual volume of searches. Additionally, looking at a single broad phrase obscures granular shifts like regional demand or specific trending costume categories. Finally, high search velocity in the late summer may simply reflect casual brainstorming or DIY research rather than active buying intent.

References

Domenech, J., & Blazquez, D. (2021). Is Google Trends a quality data source? Applied Economics Letters, 29(1), 1–6. https://doi.org/10.1080/13504851.2021.2023088
Google News Initiative. (2025). Understanding Google Trends Data. Google Training Center. https://newsinitiative.withgoogle.com