Have you ever seen a UFO? A short analysis of the UFO sightings dataset

Author: Márton Nagy

Course: Coding 3 at Central European University

Target grade: A

Dataset description

The UFO sightings dataset has been featured on Tidy Tuesday on the 25th week of 2023. The dataset comes from the National UFO Reporting Centre (NUFORC), and has been enriched with data from https://sunrise-sunset.org/.

The dataset contains three tables:

the ufo_sightings table contains data (like shape, duration, date and time, location) on every UFO sighting reported by NUFORC;
the placestable contains data on the sighting locations (like timezone, elevation, population, coordinates);
the day_parts_map table contains the beginning of certain day parts (like sunrise, twilight) for every coordinates and date - this table has been used by the Tidy Tuesday team to enrich the ufo_sightings table with the day_part variable, denoting which time of the day the sighting took place (thus this table is not used in my analysis).

The dataset contains 96,429 sightings (most of which are from the US) and 14,417 places.

The data can be imported into an R-session with the following code:

library(tidytuesdayR)
raw_data <- tidytuesdayR::tt_load('2023-06-20')

## ---- Compiling #TidyTuesday Information for 2023-06-20 ----
## --- There are 3 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 3: "ufo_sightings.csv"
##   2 of 3: "places.csv"
##   3 of 3: "day_parts_map.csv"

library(data.table)
ufo_sightings <- data.table(raw_data$`ufo_sightings`)
places <- data.table(raw_data$`places`)
day_parts_map <- data.table(raw_data$`day_parts_map`)

rm(raw_data)

Data table exercises

Before doing some visualizations, I perform some analysis using data table operations. The goal of this is to get an idea about what is going on in the dataset. For the most simple tasks, only the code and the output is provided. For more complex tasks, I also provide a short explanation.

1. How many UFO sightings were there in total?

ufo_sightings[, .N]

## [1] 96429

2. Number of sightings per country (top 10)

First I count the number of rows, and group by country codes. Then, I order by the number of sightings descending, and select the top 10.

ufo_sightings[, .N, by = country_code][order(-N)][1:10]

##     country_code     N
##           <char> <int>
##  1:           US 88213
##  2:           CA  3514
##  3:           GB  1818
##  4:           AU   602
##  5:           IN   240
##  6:           MX   143
##  7:           NZ   123
##  8:           ZA   122
##  9:           NL    98
## 10:           DE    90

3. Average duration of sightings during the night

Filtering for night day par, and calculating a simple average, and rounding it to zero decimals.

ufo_sightings[day_part == 'night', .(avg_duration_sec = round(mean(duration_seconds)))]

##    avg_duration_sec
##               <num>
## 1:            12941

4. Average duration and number of sightings in the US, by states

Filtering for US, grouping by state, calculating average duration and number of observations, then ordering descending by average duration.

ufo_sightings[country_code == 'US', .(avg_duration_sec = round(mean(duration_seconds)),
                                      n_obs = .N), 
              by = state][order(-avg_duration_sec)]

##      state avg_duration_sec n_obs
##     <char>            <num> <int>
##  1:     WY            34201   236
##  2:     NM            17723  1045
##  3:     MN            17469  1434
##  4:     TN            15544  1518
##  5:     MT            14907   660
##  6:     OK            14533  1034
##  7:     SD            13821   244
##  8:     NV            13697  1123
##  9:     MA            13019  1861
## 10:     UT            12942  1051
## 11:     AL            12863   908
## 12:     AZ            12740  3448
## 13:     OR            12737  2521
## 14:     NE            12477   467
## 15:     AK            12373   421
## 16:     ME            12314   749
## 17:     NC            12179  2517
## 18:     NH            12125   795
## 19:     MI            11942  2295
## 20:     TX            11640  4170
## 21:     NJ            11565  1735
## 22:     AR            11137   742
## 23:     LA            11043   756
## 24:     FL            10950  5833
## 25:     GA            10913  1868
## 26:     CT            10852  1190
## 27:     WV            10595   520
## 28:     IN            10554  1667
## 29:     MO            10477  1802
## 30:     CO            10439  2193
## 31:     MD            10295  1221
## 32:     SC            10113  1528
## 33:     WA            10072  5055
## 34:     OH            10061  2954
## 35:     CA             9980 11472
## 36:     PA             9945  3181
## 37:     MS             9770   483
## 38:     KS             9303   808
## 39:     HI             9190   345
## 40:     KY             9082  1056
## 41:     VA             8994  1766
## 42:     ND             8836   164
## 43:     NY             8830  3854
## 44:     WI             8583  1683
## 45:     ID             8567   904
## 46:     VT             8017   284
## 47:     DE             7484   275
## 48:     IA             7311   856
## 49:     IL             7183  2995
## 50:     RI             6383   423
## 51:     DC             2097    98
## 52:     Fl              218     5
##      state avg_duration_sec n_obs

5. Most seen shape in Canada per states

Filtering for Canada and excluding missing or unknown shapes, the counting the number of observations by states and shapes. Then, I calculate the maximum number of observations by state, and order descending by this. This way, only the most frequent shape per state is shown.

ufo_sightings[country_code == 'CA' & !is.na(shape) & shape != 'unknown',
              .N, by = .(state, shape)][, .SD[which.max(N)], by = state][order(-N)]

##      state    shape     N
##     <char>   <char> <int>
##  1:     ON    light   296
##  2:     BC    light   189
##  3:     AB    light   102
##  4:     QC    light    38
##  5:     SK    light    32
##  6:     MB    light    32
##  7:     NS    light    28
##  8:     NB    light    14
##  9:     NL    light     3
## 10:     PE    light     2
## 11:     YT    light     2
## 12:     NU changing     1
## 13:     NT changing     1

6. Average duration and number of sightings in Europe by year and month (from 2020)

First, I have to merge the ufo_sightings and the places tables. The key is a compound one (as city names are only unique within each country and state). I filter the merged dataset to years above 2019 and I extract the continent from the timezone denomination using grepl. With this, I can calculate the average duration and the number of sightings per year-month pairs. I order the output from most recent to least recent.

(
  merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
  [year(reported_date_time) > 2019 & grepl('Europe',timezone),
    .(avg_duration_sec = round(mean(duration_seconds)), n_obs = .N),
    by = .(year(reported_date_time), month(reported_date_time))][order(-year, -month)]
)

##      year month avg_duration_sec n_obs
##     <int> <int>            <num> <int>
##  1:  2023     4              600     1
##  2:  2023     3              260     3
##  3:  2023     2              256     9
##  4:  2023     1              752     2
##  5:  2022    12              118     7
##  6:  2022    11               96     5
##  7:  2022    10            32822     8
##  8:  2022     9              300     7
##  9:  2022     8              670    13
## 10:  2022     7              264    13
## 11:  2022     6              344    11
## 12:  2022     5            34741     5
## 13:  2022     4              200     7
## 14:  2022     3              235     6
## 15:  2022     2             8748     5
## 16:  2022     1              130     5
## 17:  2021    12              180     4
## 18:  2021    11               70     2
## 19:  2021    10               10     1
## 20:  2021     9              360     1
## 21:  2021     8              135     2
## 22:  2021     7              120     1
## 23:  2021     6              450     2
## 24:  2021     3               54     4
## 25:  2021     2               52     3
## 26:  2020    12               95     2
## 27:  2020    11              210     2
## 28:  2020    10              226     4
## 29:  2020     9              755     4
## 30:  2020     8               64    11
## 31:  2020     7              350     5
## 32:  2020     6            77977     5
## 33:  2020     5              273    10
## 34:  2020     4            22189    12
## 35:  2020     3              600     8
## 36:  2020     2              521     3
## 37:  2020     1              434     5
##      year month avg_duration_sec n_obs

7. US cities with the highest number of sightings per capita (and total sightings)

I again have to join the two tables mentioned above. I filter for the US and non-zero population. I then calculate the sightings per capita by state and city. Note that because of spelling differences, I aggregate by the uppercase city name. I then order descending by sightings per capita and select the top 10.

ufo_sightings[, city_upper := toupper(city)]
unique(
  merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
  [country_code == 'US' & population != 0, .(sightings_pc = .N / population, .N), by = .(state, city_upper)][order(-sightings_pc)]
)[1:10]

##      state       city_upper sightings_pc     N
##     <char>           <char>        <num> <int>
##  1:     NC     HOLDEN BEACH   0.04595880    29
##  2:     VA         YORKTOWN   0.04102564     8
##  3:     NC OCEAN ISLE BEACH   0.03350084    20
##  4:     WA       TROUT LAKE   0.03231598    18
##  5:     FL          CAPTIVA   0.02915952    17
##  6:     KY       BOONEVILLE   0.02631579     2
##  7:     MD          BEL AIR   0.02146264    27
##  8:     TX       MONTGOMERY   0.01736973    14
##  9:     NE           ARTHUR   0.01709402     2
## 10:     VA         NEW KENT   0.01673640     4

8. Mean duration of sightings conditional on hour of the day

ufo_sightings[, .(avg_duration_sec = mean(duration_seconds)), by = hour(reported_date_time)][order(hour)]

##      hour avg_duration_sec
##     <int>            <num>
##  1:     0         9504.247
##  2:     1         8896.202
##  3:     2        10122.244
##  4:     3       191648.466
##  5:     4        11434.445
##  6:     5        12699.113
##  7:     6        15473.006
##  8:     7        17734.120
##  9:     8        17331.038
## 10:     9        14825.502
## 11:    10        12448.331
## 12:    11        12720.066
## 13:    12         9251.014
## 14:    13         9334.447
## 15:    14        14436.854
## 16:    15        10392.217
## 17:    16         9095.173
## 18:    17        10424.567
## 19:    18         9481.491
## 20:    19        10736.873
## 21:    20         9936.239
## 22:    21         7498.419
## 23:    22         8711.597
## 24:    23         8608.890
##      hour avg_duration_sec

9. Sightings (count and % of country total) by day part in the US and Canada

First I count the number of observations by country and day part. I then augment this by the percentage of country totals field.

(
  ufo_sightings[country_code %in% c('US', 'CA') & !is.na(day_part),
              .N, by = .(country_code, day_part)]
  [, pctg := round(100 * N / sum(N), 2), by = country_code]
  [order(day_part, -country_code)]
)

##     country_code          day_part     N  pctg
##           <char>            <char> <int> <num>
##  1:           US         afternoon 11543 13.32
##  2:           CA         afternoon   428 13.42
##  3:           US astronomical dawn  1596  1.84
##  4:           CA astronomical dawn    62  1.94
##  5:           US astronomical dusk  9604 11.08
##  6:           CA astronomical dusk   415 13.01
##  7:           US        civil dawn   617  0.71
##  8:           CA        civil dawn    20  0.63
##  9:           US        civil dusk  3230  3.73
## 10:           CA        civil dusk   111  3.48
## 11:           US           morning  6939  8.01
## 12:           CA           morning   243  7.62
## 13:           US     nautical dawn  1220  1.41
## 14:           CA     nautical dawn    48  1.51
## 15:           US     nautical dusk  7126  8.22
## 16:           CA     nautical dusk   275  8.62
## 17:           US             night 44777 51.67
## 18:           CA             night  1587 49.76

10. Number and average duration of sightings by elevation category

First I have to use the cut function to categorize elevation values. Then I merge the ufo_sightings and places tables, and count the number of observations and average duration of sightings by each category.

places[, elevation_cat := cut(elevation_m,
                              breaks = c(-100, 200, 500, 1500, Inf),
                              labels = c('plains', 'hill', 'mountain', 'high mountain'))]
(
  merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
  [!is.na(elevation_cat), .(n_obs = .N, avg_duration_sec = round(mean(duration_seconds))), by = elevation_cat]
)

##    elevation_cat n_obs avg_duration_sec
##           <fctr> <int>            <num>
## 1: high mountain  3500            15947
## 2:        plains 51259            10148
## 3:          hill 25639            11380
## 4:      mountain  8282            12744

Data visualization exercises

Now we can move on to some visualization exercises. I below present 8 visualizations and I also provide a brief explanation for the code, as well as some insights for each chart.

1. Number of sightings by states in the US

The first plot is map of US states, colored by the number of UFO sightings. As my dataset had state names as two-letter codes, whereas the map data had full names, I had to create a lookup table to connect these two. Having these, I could draw up a US map, colored by the number of sightings. I used a color brewer scale for the coloring.

The map tells us that most sightings happened in California. There were also many sightings in Florida, Texas and Washington. For the other states, the distribution of sightings is relatively uniform.

library(ggplot2)
library(maps)
library(dplyr)

## 
## Kapcsolódás csomaghoz: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Look-up table to match state names and codes
state_lookup <- data.frame(
  state_code = state.abb,
  state_name = tolower(state.name)
)

# Aggregate sightings by state and map to full names
state_sightings <- ufo_sightings[country_code == 'US', .N, by = state]
state_sightings <- state_sightings %>% left_join(state_lookup, by = c("state" = "state_code"))

# US states map data (uses full state names)
us_states <- map_data("state")

# Merge UFO data with map data
map_data <- us_states %>% left_join(state_sightings, by = c("region" = "state_name"))

# Plotting
ggplot(map_data, aes(long, lat, group = group, fill = N)) +
  geom_polygon(color = "black", size = 0.2) +
  scale_fill_distiller(name = 'No. of sightings', type = 'seq', palette = 'RdPu', direction = 1, na.value = 'grey90') +
  labs(title = "No. of UFO sightings by state in the US") +
  theme_void()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rm(us_states, map_data, state_lookup, state_sightings)

2. Distribution of sighting duration by elevation category

The second chart uncovers whether there is a difference in sighting duration distribution by elevation category. I used overlayed density plots with a logarithmic scale for the duration. I used the BuPu palette for coloring.

Generally speaking, the chart tells us that the lower the elevation, the flatter the distribution. However, the distributions look more or less similar with the two modes roughly at the same place.

ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[!is.na(duration_seconds) & !is.na(elevation_cat)],
       aes(x = duration_seconds, color = elevation_cat, fill = elevation_cat)) +
  geom_density(alpha = 0.25) +
  scale_x_log10() +
  theme_classic() +
  scale_color_brewer(name = 'Elevation category', type = 'seq', palette = 'BuPu', aesthetics = c('color', 'fill')) +
  xlab('Duration in seconds (logarithmic scale)') +
  ylab('Density')+
  ggtitle('Distribution of sighting duration by elevation category') +
  theme(legend.position = 'top', plot.title = element_text(hjust = 0.5))

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).

3. Distribution of sighting duration in the top 10 countries by number of sightings

For the third chart, I opted for combined violin and boxplots. The duration is again at a log scale. The key technical issue for this chart was to select the top 10 countries with the most sightings and only show the plots for these. I achieved this by merging the ufo_sightings dataset with a count by country table for the top 10 countries by observation count.

From the plot we can see that the distributions per country are roughly the same, with heavy outliers mainly present in the US and Canada.

ggplot(merge(ufo_sightings, ufo_sightings[, .N, by = country_code][order(-N)][1:10]),
       aes(y = duration_seconds, x = reorder(country_code, -N))) +
  geom_violin(fill = "skyblue", alpha = 0.5) +
  geom_boxplot(width = 0.1, fill = "blue", color = "black", outlier.size = 0.5, alpha = 0.7) +
  scale_y_log10() +
  theme_light() +
  ylab('Duration in seconds (logarithmic scale)') +
  xlab('Country (ordered by no. of sightings)') +
  ggtitle('Distribution of sighting duration by country', subtitle = '(Only TOP10 countries by no. of sightings shown)') +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

4. Average sighting duration and number of sightings by month (2013-2023)

In this plot I show two time series on one plot: the average duration and the number of sightings. For this, I created a helper table with the required aggregations, which I have augmented with a year-month column for axis labeling. I included the average duration as minutes to have the two axis at roughly the same scale.

We can see from the chart that the number of sightings is generally declining with time, while the average duration seems rather stationary with no trend.

monthly_stats <- ufo_sightings[,.(avg_duration = mean(duration_seconds, na.rm = TRUE), sightings = .N),
                               by = .(year = year(reported_date_time), month = month(reported_date_time))][order(year, month)][year > 2012]
monthly_stats[, year_month := sprintf("%d-%02d", year, month)]

ggplot(monthly_stats, aes(x = year_month)) +
  geom_bar(aes(y = sightings), stat = "identity", fill = "skyblue", alpha = 0.7) +
  geom_line(aes(y = avg_duration / 60), 
            color = "navy", size = 1.2, group = 1) +
  scale_y_continuous(name = "Number of sightings",
                     sec.axis = sec_axis(~ ., name = "Avg. duration in minutes",)) +
  scale_x_discrete(breaks = monthly_stats$year_month[seq(1, nrow(monthly_stats), by = 6)]) +
  labs(title = "No. of sightings and average sighting duration per year and month",
       subtitle = '(2013-2023)',
       x = "Year and month",
       y = "Number of sightings") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        axis.title.y.right = element_text(color = "navy"),
        axis.text.x = element_text(angle = 45, hjust = 1))

rm(monthly_stats)

5. Hourly distribution of no. of sightings, by month

This plot is a faceted histogram by month. To create this, I augmented my dataset with month and hour variables.

We can see that regardless of the month, the number of sightings always peaks during nighttime. We can also see that the peaks tend get larger and larger (more number of sightings) during the summer months.

ufo_sightings[, hour := hour(reported_date_time)]
ufo_sightings[, month := month(reported_date_time)]

ggplot(ufo_sightings[, .N, by = .(month, hour)], aes(x = hour, y = N)) +
  geom_col(color = 'skyblue', fill = 'skyblue') +
  facet_wrap(~month) +
  theme_minimal() +
  labs(title = 'Hourly distribution of no. of sightings by month',
       x = 'Hour of the day',
       y = 'No. of sightings') +
  theme(plot.title = element_text(hjust = 0.5))

6. Marking places in the world where there was a UFO sighting

This plot is again a map, with an overlayed scatterplot. Each dot represents a city where at least one sighting took place, with the coloring indicating the the total number of sightings. The color scheme was created by color brewer.

We can see that indeed, most sightings took place in the US, though there are many observations in European countries as well. We can also see that practically no sightings took place in less developed countries, with India being an exception for this.

world <- map_data('world')
ggplot() +
  geom_map(data = world, map = world, aes(long, lat, map_id = region), color = 'black', fill = 'lightgrey') +
  geom_point(data = merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[, .N, by = .(latitude, longitude)],
             aes(y = latitude, x = longitude, fill = N, color = N), alpha = 0.2) +
  theme_void() +
  theme(legend.position = 'bottom',
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_distiller(name = 'No. of sightings', type = 'seq', palette = 'Reds', direction = 1, aesthetics = c('color', 'fill')) +
  ggtitle('Geographical distribution of UFO sightings')

## Warning in geom_map(data = world, map = world, aes(long, lat, map_id = region),
## : Ignoring unknown aesthetics: x and y

rm(world)

7. Population vs. no. of sightings

The last two plots are very simple regression plots. The first one shows the pattern between the population and the number of sightings in a log-log set-up. I also figured that for aggregations by cities, a nice workaround for the different spellings is to aggregate by the coordinates instead of the city names.

As expected, the chart shows a clear positive pattern of association between the two variables. This means that generally, more sightings take place in more populous cities.

ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[, .(pop = mean(population), .N), by = .(latitude, longitude)],
       aes(x = pop, y = N)) +
  geom_point(color = 'grey', alpha = 0.1, size = 1) +
  geom_smooth(method = 'lm') +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = 'Log-log regression of number of sightings on population',
       y = 'Number of sightings (logarithmic scale)',
       x = 'Population (logarithmic scale)') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 39 rows containing non-finite outside the scale range
## (`stat_smooth()`).

8. Elevation vs. number of sightings in the US

The second regression plot shows the relationship between the elevation and the number of sightings in the US with a loess regression. The number of sightings is presented on a logarithmic scale.

The plot suggests that there is practically no clear relationship between these two variables.

ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[country_code == 'US', .(elev = mean(elevation_m), .N), by = .(country_code, latitude, longitude)],
       aes(x = elev, y = N)) +
  geom_point(color = 'grey', alpha = 0.1, size = 1) +
  geom_smooth() +
  scale_y_log10() +
  labs(title = 'Loess regression of log number of sightings on elevation in the US',
       y = 'Number of sightings (logarithmic scale)',
       x = 'Elevation (meter)') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 26 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 26 rows containing missing values or values outside the scale range
## (`geom_point()`).

Conclusion

Above, I have presented a short, introductory analysis on the UFO sightings dataset. I have used data table operations and data visualization techniques to extract some insights from the data. The most important of these were:

The vast majority of UFO sightings takes place in the US;
The number of sightings declined in the past decade, while the average duration of sightings stayed relatively the same;
There is a clear, positive relationship between the number of sightings and the population of the city, but it seems that elevation is not a major determinant of the number of sightings;
Most of the sightings take place during the nighttime and during summer months.