Author: Márton Nagy
Course: Coding 3 at Central European University
Target grade: A
The UFO sightings dataset has been featured on Tidy Tuesday on the 25th week of 2023. The dataset comes from the National UFO Reporting Centre (NUFORC), and has been enriched with data from https://sunrise-sunset.org/.
The dataset contains three tables:
the ufo_sightings table contains data (like shape,
duration, date and time, location) on every UFO sighting reported by
NUFORC;
the placestable contains data on the sighting
locations (like timezone, elevation, population, coordinates);
the day_parts_map table contains the beginning of
certain day parts (like sunrise, twilight) for every coordinates and
date - this table has been used by the Tidy Tuesday team to enrich the
ufo_sightings table with the day_part
variable, denoting which time of the day the sighting took place (thus
this table is not used in my analysis).
The dataset contains 96,429 sightings (most of which are from the US) and 14,417 places.
The data can be imported into an R-session with the following code:
library(tidytuesdayR)
raw_data <- tidytuesdayR::tt_load('2023-06-20')
## ---- Compiling #TidyTuesday Information for 2023-06-20 ----
## --- There are 3 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 3: "ufo_sightings.csv"
## 2 of 3: "places.csv"
## 3 of 3: "day_parts_map.csv"
library(data.table)
ufo_sightings <- data.table(raw_data$`ufo_sightings`)
places <- data.table(raw_data$`places`)
day_parts_map <- data.table(raw_data$`day_parts_map`)
rm(raw_data)
Before doing some visualizations, I perform some analysis using data table operations. The goal of this is to get an idea about what is going on in the dataset. For the most simple tasks, only the code and the output is provided. For more complex tasks, I also provide a short explanation.
ufo_sightings[, .N]
## [1] 96429
First I count the number of rows, and group by country codes. Then, I order by the number of sightings descending, and select the top 10.
ufo_sightings[, .N, by = country_code][order(-N)][1:10]
## country_code N
## <char> <int>
## 1: US 88213
## 2: CA 3514
## 3: GB 1818
## 4: AU 602
## 5: IN 240
## 6: MX 143
## 7: NZ 123
## 8: ZA 122
## 9: NL 98
## 10: DE 90
Filtering for night day par, and calculating a simple average, and rounding it to zero decimals.
ufo_sightings[day_part == 'night', .(avg_duration_sec = round(mean(duration_seconds)))]
## avg_duration_sec
## <num>
## 1: 12941
Filtering for US, grouping by state, calculating average duration and number of observations, then ordering descending by average duration.
ufo_sightings[country_code == 'US', .(avg_duration_sec = round(mean(duration_seconds)),
n_obs = .N),
by = state][order(-avg_duration_sec)]
## state avg_duration_sec n_obs
## <char> <num> <int>
## 1: WY 34201 236
## 2: NM 17723 1045
## 3: MN 17469 1434
## 4: TN 15544 1518
## 5: MT 14907 660
## 6: OK 14533 1034
## 7: SD 13821 244
## 8: NV 13697 1123
## 9: MA 13019 1861
## 10: UT 12942 1051
## 11: AL 12863 908
## 12: AZ 12740 3448
## 13: OR 12737 2521
## 14: NE 12477 467
## 15: AK 12373 421
## 16: ME 12314 749
## 17: NC 12179 2517
## 18: NH 12125 795
## 19: MI 11942 2295
## 20: TX 11640 4170
## 21: NJ 11565 1735
## 22: AR 11137 742
## 23: LA 11043 756
## 24: FL 10950 5833
## 25: GA 10913 1868
## 26: CT 10852 1190
## 27: WV 10595 520
## 28: IN 10554 1667
## 29: MO 10477 1802
## 30: CO 10439 2193
## 31: MD 10295 1221
## 32: SC 10113 1528
## 33: WA 10072 5055
## 34: OH 10061 2954
## 35: CA 9980 11472
## 36: PA 9945 3181
## 37: MS 9770 483
## 38: KS 9303 808
## 39: HI 9190 345
## 40: KY 9082 1056
## 41: VA 8994 1766
## 42: ND 8836 164
## 43: NY 8830 3854
## 44: WI 8583 1683
## 45: ID 8567 904
## 46: VT 8017 284
## 47: DE 7484 275
## 48: IA 7311 856
## 49: IL 7183 2995
## 50: RI 6383 423
## 51: DC 2097 98
## 52: Fl 218 5
## state avg_duration_sec n_obs
Filtering for Canada and excluding missing or unknown shapes, the counting the number of observations by states and shapes. Then, I calculate the maximum number of observations by state, and order descending by this. This way, only the most frequent shape per state is shown.
ufo_sightings[country_code == 'CA' & !is.na(shape) & shape != 'unknown',
.N, by = .(state, shape)][, .SD[which.max(N)], by = state][order(-N)]
## state shape N
## <char> <char> <int>
## 1: ON light 296
## 2: BC light 189
## 3: AB light 102
## 4: QC light 38
## 5: SK light 32
## 6: MB light 32
## 7: NS light 28
## 8: NB light 14
## 9: NL light 3
## 10: PE light 2
## 11: YT light 2
## 12: NU changing 1
## 13: NT changing 1
First, I have to merge the ufo_sightings and the
places tables. The key is a compound one (as city names are
only unique within each country and state). I filter the merged dataset
to years above 2019 and I extract the continent from the timezone
denomination using grepl. With this, I can calculate the
average duration and the number of sightings per year-month pairs. I
order the output from most recent to least recent.
(
merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
[year(reported_date_time) > 2019 & grepl('Europe',timezone),
.(avg_duration_sec = round(mean(duration_seconds)), n_obs = .N),
by = .(year(reported_date_time), month(reported_date_time))][order(-year, -month)]
)
## year month avg_duration_sec n_obs
## <int> <int> <num> <int>
## 1: 2023 4 600 1
## 2: 2023 3 260 3
## 3: 2023 2 256 9
## 4: 2023 1 752 2
## 5: 2022 12 118 7
## 6: 2022 11 96 5
## 7: 2022 10 32822 8
## 8: 2022 9 300 7
## 9: 2022 8 670 13
## 10: 2022 7 264 13
## 11: 2022 6 344 11
## 12: 2022 5 34741 5
## 13: 2022 4 200 7
## 14: 2022 3 235 6
## 15: 2022 2 8748 5
## 16: 2022 1 130 5
## 17: 2021 12 180 4
## 18: 2021 11 70 2
## 19: 2021 10 10 1
## 20: 2021 9 360 1
## 21: 2021 8 135 2
## 22: 2021 7 120 1
## 23: 2021 6 450 2
## 24: 2021 3 54 4
## 25: 2021 2 52 3
## 26: 2020 12 95 2
## 27: 2020 11 210 2
## 28: 2020 10 226 4
## 29: 2020 9 755 4
## 30: 2020 8 64 11
## 31: 2020 7 350 5
## 32: 2020 6 77977 5
## 33: 2020 5 273 10
## 34: 2020 4 22189 12
## 35: 2020 3 600 8
## 36: 2020 2 521 3
## 37: 2020 1 434 5
## year month avg_duration_sec n_obs
I again have to join the two tables mentioned above. I filter for the US and non-zero population. I then calculate the sightings per capita by state and city. Note that because of spelling differences, I aggregate by the uppercase city name. I then order descending by sightings per capita and select the top 10.
ufo_sightings[, city_upper := toupper(city)]
unique(
merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
[country_code == 'US' & population != 0, .(sightings_pc = .N / population, .N), by = .(state, city_upper)][order(-sightings_pc)]
)[1:10]
## state city_upper sightings_pc N
## <char> <char> <num> <int>
## 1: NC HOLDEN BEACH 0.04595880 29
## 2: VA YORKTOWN 0.04102564 8
## 3: NC OCEAN ISLE BEACH 0.03350084 20
## 4: WA TROUT LAKE 0.03231598 18
## 5: FL CAPTIVA 0.02915952 17
## 6: KY BOONEVILLE 0.02631579 2
## 7: MD BEL AIR 0.02146264 27
## 8: TX MONTGOMERY 0.01736973 14
## 9: NE ARTHUR 0.01709402 2
## 10: VA NEW KENT 0.01673640 4
ufo_sightings[, .(avg_duration_sec = mean(duration_seconds)), by = hour(reported_date_time)][order(hour)]
## hour avg_duration_sec
## <int> <num>
## 1: 0 9504.247
## 2: 1 8896.202
## 3: 2 10122.244
## 4: 3 191648.466
## 5: 4 11434.445
## 6: 5 12699.113
## 7: 6 15473.006
## 8: 7 17734.120
## 9: 8 17331.038
## 10: 9 14825.502
## 11: 10 12448.331
## 12: 11 12720.066
## 13: 12 9251.014
## 14: 13 9334.447
## 15: 14 14436.854
## 16: 15 10392.217
## 17: 16 9095.173
## 18: 17 10424.567
## 19: 18 9481.491
## 20: 19 10736.873
## 21: 20 9936.239
## 22: 21 7498.419
## 23: 22 8711.597
## 24: 23 8608.890
## hour avg_duration_sec
First I count the number of observations by country and day part. I then augment this by the percentage of country totals field.
(
ufo_sightings[country_code %in% c('US', 'CA') & !is.na(day_part),
.N, by = .(country_code, day_part)]
[, pctg := round(100 * N / sum(N), 2), by = country_code]
[order(day_part, -country_code)]
)
## country_code day_part N pctg
## <char> <char> <int> <num>
## 1: US afternoon 11543 13.32
## 2: CA afternoon 428 13.42
## 3: US astronomical dawn 1596 1.84
## 4: CA astronomical dawn 62 1.94
## 5: US astronomical dusk 9604 11.08
## 6: CA astronomical dusk 415 13.01
## 7: US civil dawn 617 0.71
## 8: CA civil dawn 20 0.63
## 9: US civil dusk 3230 3.73
## 10: CA civil dusk 111 3.48
## 11: US morning 6939 8.01
## 12: CA morning 243 7.62
## 13: US nautical dawn 1220 1.41
## 14: CA nautical dawn 48 1.51
## 15: US nautical dusk 7126 8.22
## 16: CA nautical dusk 275 8.62
## 17: US night 44777 51.67
## 18: CA night 1587 49.76
First I have to use the cut function to categorize
elevation values. Then I merge the ufo_sightings and
places tables, and count the number of observations and
average duration of sightings by each category.
places[, elevation_cat := cut(elevation_m,
breaks = c(-100, 200, 500, 1500, Inf),
labels = c('plains', 'hill', 'mountain', 'high mountain'))]
(
merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))
[!is.na(elevation_cat), .(n_obs = .N, avg_duration_sec = round(mean(duration_seconds))), by = elevation_cat]
)
## elevation_cat n_obs avg_duration_sec
## <fctr> <int> <num>
## 1: high mountain 3500 15947
## 2: plains 51259 10148
## 3: hill 25639 11380
## 4: mountain 8282 12744
Now we can move on to some visualization exercises. I below present 8 visualizations and I also provide a brief explanation for the code, as well as some insights for each chart.
The first plot is map of US states, colored by the number of UFO sightings. As my dataset had state names as two-letter codes, whereas the map data had full names, I had to create a lookup table to connect these two. Having these, I could draw up a US map, colored by the number of sightings. I used a color brewer scale for the coloring.
The map tells us that most sightings happened in California. There were also many sightings in Florida, Texas and Washington. For the other states, the distribution of sightings is relatively uniform.
library(ggplot2)
library(maps)
library(dplyr)
##
## Kapcsolódás csomaghoz: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Look-up table to match state names and codes
state_lookup <- data.frame(
state_code = state.abb,
state_name = tolower(state.name)
)
# Aggregate sightings by state and map to full names
state_sightings <- ufo_sightings[country_code == 'US', .N, by = state]
state_sightings <- state_sightings %>% left_join(state_lookup, by = c("state" = "state_code"))
# US states map data (uses full state names)
us_states <- map_data("state")
# Merge UFO data with map data
map_data <- us_states %>% left_join(state_sightings, by = c("region" = "state_name"))
# Plotting
ggplot(map_data, aes(long, lat, group = group, fill = N)) +
geom_polygon(color = "black", size = 0.2) +
scale_fill_distiller(name = 'No. of sightings', type = 'seq', palette = 'RdPu', direction = 1, na.value = 'grey90') +
labs(title = "No. of UFO sightings by state in the US") +
theme_void()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
rm(us_states, map_data, state_lookup, state_sightings)
The second chart uncovers whether there is a difference in sighting
duration distribution by elevation category. I used overlayed density
plots with a logarithmic scale for the duration. I used the
BuPu palette for coloring.
Generally speaking, the chart tells us that the lower the elevation, the flatter the distribution. However, the distributions look more or less similar with the two modes roughly at the same place.
ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[!is.na(duration_seconds) & !is.na(elevation_cat)],
aes(x = duration_seconds, color = elevation_cat, fill = elevation_cat)) +
geom_density(alpha = 0.25) +
scale_x_log10() +
theme_classic() +
scale_color_brewer(name = 'Elevation category', type = 'seq', palette = 'BuPu', aesthetics = c('color', 'fill')) +
xlab('Duration in seconds (logarithmic scale)') +
ylab('Density')+
ggtitle('Distribution of sighting duration by elevation category') +
theme(legend.position = 'top', plot.title = element_text(hjust = 0.5))
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
For the third chart, I opted for combined violin and boxplots. The duration is again at a log scale. The key technical issue for this chart was to select the top 10 countries with the most sightings and only show the plots for these. I achieved this by merging the ufo_sightings dataset with a count by country table for the top 10 countries by observation count.
From the plot we can see that the distributions per country are roughly the same, with heavy outliers mainly present in the US and Canada.
ggplot(merge(ufo_sightings, ufo_sightings[, .N, by = country_code][order(-N)][1:10]),
aes(y = duration_seconds, x = reorder(country_code, -N))) +
geom_violin(fill = "skyblue", alpha = 0.5) +
geom_boxplot(width = 0.1, fill = "blue", color = "black", outlier.size = 0.5, alpha = 0.7) +
scale_y_log10() +
theme_light() +
ylab('Duration in seconds (logarithmic scale)') +
xlab('Country (ordered by no. of sightings)') +
ggtitle('Distribution of sighting duration by country', subtitle = '(Only TOP10 countries by no. of sightings shown)') +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In this plot I show two time series on one plot: the average duration and the number of sightings. For this, I created a helper table with the required aggregations, which I have augmented with a year-month column for axis labeling. I included the average duration as minutes to have the two axis at roughly the same scale.
We can see from the chart that the number of sightings is generally declining with time, while the average duration seems rather stationary with no trend.
monthly_stats <- ufo_sightings[,.(avg_duration = mean(duration_seconds, na.rm = TRUE), sightings = .N),
by = .(year = year(reported_date_time), month = month(reported_date_time))][order(year, month)][year > 2012]
monthly_stats[, year_month := sprintf("%d-%02d", year, month)]
ggplot(monthly_stats, aes(x = year_month)) +
geom_bar(aes(y = sightings), stat = "identity", fill = "skyblue", alpha = 0.7) +
geom_line(aes(y = avg_duration / 60),
color = "navy", size = 1.2, group = 1) +
scale_y_continuous(name = "Number of sightings",
sec.axis = sec_axis(~ ., name = "Avg. duration in minutes",)) +
scale_x_discrete(breaks = monthly_stats$year_month[seq(1, nrow(monthly_stats), by = 6)]) +
labs(title = "No. of sightings and average sighting duration per year and month",
subtitle = '(2013-2023)',
x = "Year and month",
y = "Number of sightings") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y.right = element_text(color = "navy"),
axis.text.x = element_text(angle = 45, hjust = 1))
rm(monthly_stats)
This plot is a faceted histogram by month. To create this, I augmented my dataset with month and hour variables.
We can see that regardless of the month, the number of sightings always peaks during nighttime. We can also see that the peaks tend get larger and larger (more number of sightings) during the summer months.
ufo_sightings[, hour := hour(reported_date_time)]
ufo_sightings[, month := month(reported_date_time)]
ggplot(ufo_sightings[, .N, by = .(month, hour)], aes(x = hour, y = N)) +
geom_col(color = 'skyblue', fill = 'skyblue') +
facet_wrap(~month) +
theme_minimal() +
labs(title = 'Hourly distribution of no. of sightings by month',
x = 'Hour of the day',
y = 'No. of sightings') +
theme(plot.title = element_text(hjust = 0.5))
This plot is again a map, with an overlayed scatterplot. Each dot represents a city where at least one sighting took place, with the coloring indicating the the total number of sightings. The color scheme was created by color brewer.
We can see that indeed, most sightings took place in the US, though there are many observations in European countries as well. We can also see that practically no sightings took place in less developed countries, with India being an exception for this.
world <- map_data('world')
ggplot() +
geom_map(data = world, map = world, aes(long, lat, map_id = region), color = 'black', fill = 'lightgrey') +
geom_point(data = merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[, .N, by = .(latitude, longitude)],
aes(y = latitude, x = longitude, fill = N, color = N), alpha = 0.2) +
theme_void() +
theme(legend.position = 'bottom',
plot.title = element_text(hjust = 0.5)) +
scale_fill_distiller(name = 'No. of sightings', type = 'seq', palette = 'Reds', direction = 1, aesthetics = c('color', 'fill')) +
ggtitle('Geographical distribution of UFO sightings')
## Warning in geom_map(data = world, map = world, aes(long, lat, map_id = region),
## : Ignoring unknown aesthetics: x and y
rm(world)
The last two plots are very simple regression plots. The first one shows the pattern between the population and the number of sightings in a log-log set-up. I also figured that for aggregations by cities, a nice workaround for the different spellings is to aggregate by the coordinates instead of the city names.
As expected, the chart shows a clear positive pattern of association between the two variables. This means that generally, more sightings take place in more populous cities.
ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[, .(pop = mean(population), .N), by = .(latitude, longitude)],
aes(x = pop, y = N)) +
geom_point(color = 'grey', alpha = 0.1, size = 1) +
geom_smooth(method = 'lm') +
scale_x_log10() +
scale_y_log10() +
labs(title = 'Log-log regression of number of sightings on population',
y = 'Number of sightings (logarithmic scale)',
x = 'Population (logarithmic scale)') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 39 rows containing non-finite outside the scale range
## (`stat_smooth()`).
The second regression plot shows the relationship between the elevation and the number of sightings in the US with a loess regression. The number of sightings is presented on a logarithmic scale.
The plot suggests that there is practically no clear relationship between these two variables.
ggplot(merge(ufo_sightings, places, by = c('city', 'state', 'country_code'))[country_code == 'US', .(elev = mean(elevation_m), .N), by = .(country_code, latitude, longitude)],
aes(x = elev, y = N)) +
geom_point(color = 'grey', alpha = 0.1, size = 1) +
geom_smooth() +
scale_y_log10() +
labs(title = 'Loess regression of log number of sightings on elevation in the US',
y = 'Number of sightings (logarithmic scale)',
x = 'Elevation (meter)') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 26 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 26 rows containing missing values or values outside the scale range
## (`geom_point()`).
Above, I have presented a short, introductory analysis on the UFO sightings dataset. I have used data table operations and data visualization techniques to extract some insights from the data. The most important of these were:
The vast majority of UFO sightings takes place in the US;
The number of sightings declined in the past decade, while the average duration of sightings stayed relatively the same;
There is a clear, positive relationship between the number of sightings and the population of the city, but it seems that elevation is not a major determinant of the number of sightings;
Most of the sightings take place during the nighttime and during summer months.