The data set I have chosen to work with for the final project is the UFO Sightings csv file derived from The National UFO Reporting Center (NUFORC). From my understanding, the methodology for collecting this data appears to be individual reporting, where those who encountered a UFO personally uploaded a report to the website, which was then stored alongside every other report. I chose this data set as I found it while searching for data sets for project 2. I found the idea to be interesting enough to dedicate an entire project to it. For this final project, I will be analyzing the following variables; date_time, city_area, state, ufo_shape, encounter_length, latitude, and longitude. City area, state, and ufo shape are all categorical variables while date time, encounter length, latitude, and longitude are all quantitative. It is worth mentioning date time deals with the date while latitude and longitude are coordinates for the individual reports. For this project, I will be specifically looking for how UFO sightings change over time as well as analyzing the American states which hold the most sightings.
Background Research
A UFO is defined as an unidentified flying object, pertaining to any aerial object in the sky which cannot be explained or understood by the viewer (Britannica). While UFO’s are not inherently correlated to alien life, mainstream media has made a point to emphasize the two as being one and the same. In actuality, a UFO can be any flying object not understood by the viewer. Because of this, there is an extremely high chance for bias when making UFO reports. From unknown birds to aircraft to something truly unidentifiable, these different forms of reports can lead to many misleading results.
Load the libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(ggplot2)library(plotly)
Warning: package 'plotly' was built under R version 4.5.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(lubridate)library(viridis)
Warning: package 'viridis' was built under R version 4.5.3
Rows: 80332 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): date_time, city_area, state, country, ufo_shape, described_encounte...
dbl (3): encounter_length, latitude, longitude
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I will now clean the data set
I will first remove all unnecessary variables and NA values found across the data set
# Remove unecessary variablesclean_data <- data |>select(-country, -described_encounter_length, -description, -date_documented) |># Remove any NA valuesfilter(!is.na(date_time)) |>filter(!is.na(state)) |>filter(!is.na(ufo_shape)) |>filter(!is.na(latitude))sum(is.na(clean_data)) # Check for any remaining NA values
[1] 0
I will now filter the date_time variable to only include the year the sighting was reported
Based on online research, the easiest method to get solely the year is to use the string extract function. I found the specific code in the Epidemiologist R handbook; chapter 10 characters and strings; sub chapter 10.8 Regex and special characters.
clean_data <- clean_data |>mutate(year =as.integer(str_extract(date_time, "\\d{4}"))) # Extract only the 4 digit value for yearunique(clean_data$year) # Check to make sure only years were extracted
clean_data <- clean_data |>filter(year >=1990) # Filter out all years earlier than 1990unique(clean_data$year) # Check all the remaining years in the data set
With an upper limit of 1,455 seconds, I will filter the data set to exclude any number greater than 1,455 and any values less than 1
clean_data <- clean_data |>filter(encounter_length <=1455) |>filter(encounter_length >=1)summary(clean_data$encounter_length) # Check to make sure filtering worked
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 20.0 120.0 258.5 300.0 1440.0
I will now filter the states to only include those in the United States
unique(clean_data$state) # Check all listed values in states variable
america_states <-c("ga","pa","tx","tn","il","ny","ar","mo","sc","oh","az","ca","nv","wa","nc","ks","ne","fl","or","wi","ky","ia","va","mi","id","nm","nj","in","wv","mn","ok","co","ct","ri","al","vt","la","nh","me","ms","ma","hi","ut","md","wy","mt","ak","sd","de","nd") # Filter only for US Statesclean_data <- clean_data |>filter(state %in% america_states) # Filter only for states inside the america_states collectionunique(clean_data$state)
Warning: not plotting observations with leverage one:
9837, 22730
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
With an adjusted r squared of 0.01212, it is less than 0.01233, albeit only a little. This means it is only slightly significant to the regression model.
As every variable is significant to the regression model, this is the final result.
main_model <-lm(encounter_length ~ state + ufo_shape + year,data = clean_data)summary(main_model)
Warning: not plotting observations with leverage one:
9837, 22730
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Analysis of Linear Regression Model
The model equation for the linear regression is y = b0 + b1x1 + b2x2 + b3x3, where y is encounter length, b0 is the intercept (-1389.35), b1 is the state (varies by each state), b2 is the ufo shape(varies by each shape), and b3 is the year (0.9149).
The model equation gives two significant takeaways. First, for each singular increase in year, the encounter length increases by 0.9149 seconds. Second, because each state and ufo shape have different values, each one is significantly important in its own manner.
The P value was equal to < 2.2e-16, meaning there is a strong association between encounter length and the state, ufo shape, and year of the report.
The adjusted r squared is 0.01233. This means the regression model explains only 1.233% of variance in the data. This number is very low, and means there is little relationship between the encounter length and the state, ufo shape, and year. Another way of interpreting this is there is a multitude of other factors outside this data set which have a bigger impact on the encounter length.
The diagnostic plots:
Residuals vs. Fitted - The points are heavily bunched together in 4 specific locations with only a couple outside these zones. The red line appears to be constant for the beginning half, however it begins to decrease towards the latter half of the graph.
Q-Q Residuals - Much of the graph does not follow the reference line, but rather floats above it. The line of data displays a staircase shape where it rises and moves a multitude of times.
Scale-Location - The line shows a steady increase throughout the entire graph. The data points appear to cluster in a number of different locations.
Residuals vs. Leverage - Majority of the data points sit at the 0.0 marker with a couple outliers appearing around the 0.2 mark, 2 appearing at the 0.4 mark, and 1 at the 1.0 mark. The line dips in the beginning but gradually rises for the latter half of the graph.
Now I will create the two visualizations.
I will start with a hex graph visualization. My goal is to test for the relationship between encounter length and the year the reporting was made. I found the hex graph by going through the pop up menu when typing in geom_
ggplot(clean_data, aes(x = year, y = encounter_length)) +# Set the x and y theme_minimal() +geom_hex() +# Make a hex graph. I found this just by trying out the different options under geom.scale_fill_viridis() +labs(title ="Year vs. Encounter Length (Seconds) for UFO Reports In the U.S.",caption ="The National UFO Reporting Center (NUFORC)",x ="Year",y ="Encounter Length (Seconds)",fill ="Number of Reports" )
ggplotly() # Add interactivity
Analysis of Visualization
The visualization I chose to make is a hex graph depicting the relationship between the year and the encounter length in seconds for UFO sightings in the United States. From the graph, it is clear the number of reports increases throughout the years, with there being no sightings above 300 in the early 1990’s to the mid 2010’s having multiple years with roughly 1200 reportings. This can be understood through the expansion of technology. As the years go by, the creation and wide distribution of technology allows for more people to have the means to report their findings. From this graph, I don’t see any surprises. My hypothesis going into the graph was there would be a constant upwards trend in the amount of reports versus the years and the graph corroborates this notion. In regards to stuff I couldn’t include, there wasn’t anything for this graph. However, I did wish to make a 3d model, however I did not have enough quantitative variables to achieve the desired result.
Now I will create the second visualization; a density plot analyzing the top 5 states with the most reports. First, I must figure out which states have the most reporting.
clean_data |>group_by(state) |># Group by the staterscount(state) |># Count the amount of times the state is in the dataarrange(desc(n)) # Arrange the count in descending order
# A tibble: 50 × 2
# Groups: state [50]
state n
<chr> <int>
1 ca 7644
2 fl 3405
3 wa 3356
4 tx 2828
5 ny 2457
6 az 2106
7 il 2103
8 pa 2011
9 oh 1911
10 mi 1587
# ℹ 40 more rows
Now I will filter for the top 5 states and then make the density plot. I chose to facet wrap the states, and found this in the epidemiologist r handbook in chapter 30 ggplot basics, sub chapter 30.6 Facets / Small-multiples
filter_five_states <- clean_data |>filter(state %in%c("ca", "fl", "wa", "tx", "ny")) # Filter for the top 5 states found in the previous codeggplot(filter_five_states, aes(x = year, fill = state)) +geom_density() +# Make a density charttheme_minimal() +facet_wrap(~state) +# Seperate the states so they have their own repsective graphscale_fill_brewer(palette ="Accent") +# Change the palette colourslabs(fill ="State",title ="Yearly Reporting Distribution of UFO Sightings",x ="Year",y ="Density",caption ="The National UFO Reporting Center (NUFORC)")
ggplotly() # Add interactivity
Analysis of Visualization
The graph I chose to make was a density chart facet wrapped so each state got their own clear visualization. The y axis shows the density ranging from 0 to1 while the x is divided into the years. When analyzing the graph, the one surprise I have was the amount of dips in terms of reporting. I expected for the percentage of reportings to continuously increase over the years with little to no dips. This is based on the belief of technological advancements as well as social media attention seeking crazes. I imagine with the development of social media, people would be more inclined to report their findings in an effort to gain some form of internet popularity, however, the graph displays the reported dropping in density for some years. The biggest example of this was the graph for Washington where it appear the mid 2000’s saw a major drop in terms of reporting. Another shock from the graph is the major spike towards the end for Florida. While I do not know what might have caused this, it is definitely unusual and something to take into account should further investigations be conducted. In regards to what I couldn’t get to work, I first tried to make a map visualization for this depicting the encounter length of reports across the entire United States. However, I had a hard time making the information and the graph look legible.
Sources
Hibberd, J. (2026, May 8). Trump UFO Files Released: The 5 Strangest Photos, Videos. The Hollywood Reporter. https://www.hollywoodreporter.com/news/general-news/trump-ufo-files-revelations-1236590123/#respond
The Epidemiologist R Handbook. (2024). Epirhandbook.com. https://www.epirhandbook.com/en/
TWO−LETTER STATE AND TERRITORY ABBREVIATIONS. (2022). Faa.gov. https://www.faa.gov/air_traffic/publications/atpubs/cnt_html/appendix_a.html
Shostak, S. (2026, May 8). unidentified flying object. Encyclopedia Britannica. https://www.britannica.com/topic/unidentified-flying-object