knitr::include_graphics("~/Downloads/roswell1947.jpg")

Intro

My project deep dives into the historical and geographical patterns of UFO sightings using data from the National UFO Reporting Center (NUFORC). The dataset, ufo sightings, has over 80,000 recorded events spanning from 1906 to 2014. I wanted to do this topic because it looked cool, and I knew some of my map points would be fun to point out. In analyzing this data we are able to move beyond random peoples sightings and uncover some objective trends: where and when are people most likely to report a sighting, and what shapes are reported most often? The point of my study is to use mapping to explore peoples experience of the unknown, possibly revealing how environmental factors or cultural phenomena influence reporting patterns across time and place.

Cleaning

The dataset has 11 variables initially. The main ones are the city, state, country, and shape (all categorical variables), latitude and longitude (which are quantitative and needed for mapping part), and duration (seconds) (a key quantitative variable). For data cleaning, I did a few things. Mostly I filtered for specific plots when I wanted to show something, such as getting top 8 or 10 shapes, duration under a certain number, and year over a certain number, also mutated, like extracting only year from my datetime variable.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
ufos <- read_csv("~/Downloads/ufo_sightings_scrubbed.csv") |>
    filter(
    !is.na(datetime), #get nas out of my main variables
    !is.na(latitude),
    !is.na(longitude)
  )
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 80332 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): city, state, country, shape, duration (hours/min), comments
## dbl  (3): duration (seconds), latitude, longitude
## dttm (1): datetime
## date (1): date posted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
scp <- ufos |>
  # Group by shape, count, and filter for top 10
  count(shape, sort = TRUE) |>
  top_n(10) |>
  ggplot(aes(x = n, y = reorder(shape, n))) +
  geom_col(fill = "darkviolet") +
  labs(
    title = "Top 10 Most Reported UFO Shapes",
    x = "Number of Sightings",
    y = "Shape"
  ) +
  theme_bw()
## Selecting by n
scp

dp <- ufos |>
  # filter crazy outliers (still some in there)
  filter(`duration (seconds)` < 7500) |>
  # hist
  ggplot(aes(x = `duration (seconds)`)) +
  geom_histogram(binwidth = 100, fill = "darkorange", color = "white") +
  labs(
    title = "Distribution of Short UFO Sighting Durations",
    x = "Duration (Seconds)",
    y = "Frequency (Count)"
  ) +
  theme_bw()
dp

stp <- ufos |>
  # getting only the year
  mutate(year = year(datetime)) |>
  # Group by year and count
  count(year) |>
  filter(year >= 1940) |> # Filter to post 40s because nothing significant before then
  # plot
  ggplot(aes(x = year, y = n)) +
  geom_line(color = "purple", linewidth = 1) +
  geom_point(color = "purple", size = 2) +
  labs(
    title = "Total Number of UFO Sightings Reported by Year",
    x = "Year",
    y = "Total Sightings Count"
  ) +
  theme_bw()
stp

top_shapes <- ufos |>
  count(shape) |>
  top_n(8, n) |>
  pull(shape)
ufot <- ufos |>
  # top 8 shapes only
  filter(shape %in% top_shapes) |>
  # get only year
  mutate(year = year(datetime)) |>
  # Group by year and shape, and count them
  group_by(year, shape) |>
  summarise(count = n(), .groups = 'drop') |>
  # Filter to post 40s because nothing significant before then
  filter(year >= 1940)
ufotp <- ggplot(ufot, aes(x = year, y = count, fill = shape)) +
  
  # stacking
  geom_area(alpha = 0.8, position = "stack") + 
  
  # Label
  labs(
    title = "Historical Trends of Most Reported UFO Shapes (1940-2015)",
    x = "Year of Sighting",
    y = "Number of Sightings",
    fill = "Reported Shape",
    caption = "Data Source: NUFORC (ufo_sightings_scrubbed.csv)"
  ) +
  
  # palette looked cool
  scale_fill_viridis_d(option = "plasma") + 
  
  # theme
  theme_minimal() + 
  
  # Add a vertical line for a major cultural event (1947 Roswell)
  geom_vline(xintercept = 1947, linetype = "dashed", color = "black", linewidth = 0.8) +
  
  # Annotation
  annotate("text", x = 1947, y = max(ufot$count), 
           label = "Roswell 1947", vjust = 1, hjust = -0.1, color = "black", size = 3.5)
ufotp

My stacked area chart which focused on the top eight most-reported shapes, visualizes the evolution of the UFO phenomenon from 1940 onward. It clearly shows two patterns. First, there is a massive increase in the total number of reports, peaking in the late 90s and early 00s. Second, the popularity of reported shapes has changed dramatically. Reports of disks and circles, which are the classic flying saucer shapes, dominate the early decades, mostly in the 1950s. And light is suddenly popular after the 1990s, becoming the single largest category reported, which could be reflecting the new technology of like drones, satellites, or flares, which are kinda hard to classify as objects. A key pattern of surprise is how clearly the “triangle” shape sees its own distinct, sudden rise in the 1990s, it coincides with military planes like the stealth bomber.In my project I would have liked to include a vertical axis showing the proportions of each shape (instead of raw counts) to more clearly show shifts in types reported, but the current chart format does a decent job showing each one.

ufo_leaflet_map <- leaflet(data = ufos) |>
  
  # map
  addProviderTiles(providers$CartoDB.Positron, group = "Light Map") |>
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") |>

  # clusters
  addMarkers(
    lng = ~longitude,
    lat = ~latitude,
    
    # popup
    popup = ~paste(
      "<b>City:</b>", city, "<br>",
      "<b>State:</b>", state, "<br>",
      "<b>Shape:</b>", shape, "<br>",
      "<b>Duration:</b>", `duration (hours/min)`
    ),
    
    clusterOptions = markerClusterOptions()
  ) |>
  
  # layer options
  addLayersControl(
    baseGroups = c("Light Map", "Satellite"),
    options = layersControlOptions(collapsed = FALSE)
  )
ufo_leaflet_map

My Leaflet map is a spatial representation of the over 80,000 UFO sightings, showing where these phenomena are most frequently reported globally. The most obvious pattern is the concentration of reports within the United States, followed somewhat by pockets of activity in the United Kingdom, Canada, and Australia. This high concentration in the states suggests less about extraterrestrial activity and more about culture, such as the high visibility of the NUFORC reporting center, a lot of English speakers, and a cultural history full of UFO lore (Roswell). A weird surprise is the almost complete absence of reports from big countries like China and Russia, which points to a likely reporting bias tied to internet access, language barriers, and state media control rather than a lack of aerial sightings.