Final Project

Author

Oliver Kronen

FBI composite image of a UFO sighting

Introduction

The data set I have chosen to work with for the final project is the UFO Sightings csv file derived from The National UFO Reporting Center (NUFORC). From my understanding, the methodology for collecting this data appears to be individual reporting, where those who encountered a UFO personally uploaded a report to the website, which was then stored alongside every other report. I chose this data set as I found it while searching for data sets for project 2. I found the idea to be interesting enough to dedicate an entire project to it. For this final project, I will be analyzing the following variables; date_time, city_area, state, ufo_shape, encounter_length, latitude, and longitude. City area, state, and ufo shape are all categorical variables while date time, encounter length, latitude, and longitude are all quantitative. It is worth mentioning date time deals with the date while latitude and longitude are coordinates for the individual reports. For this project, I will be specifically looking for how UFO sightings change over time as well as analyzing the American states which hold the most sightings.

Background Research

A UFO is defined as an unidentified flying object, pertaining to any aerial object in the sky which cannot be explained or understood by the viewer (Britannica). While UFO’s are not inherently correlated to alien life, mainstream media has made a point to emphasize the two as being one and the same. In actuality, a UFO can be any flying object not understood by the viewer. Because of this, there is an extremely high chance for bias when making UFO reports. From unknown birds to aircraft to something truly unidentifiable, these different forms of reports can lead to many misleading results.

Load the libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(ggplot2)
library(plotly)
Warning: package 'plotly' was built under R version 4.5.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(lubridate)
library(viridis)
Warning: package 'viridis' was built under R version 4.5.3
Loading required package: viridisLite

Set the working directory

setwd("C:/Users/MyPC/Downloads/Data 110")
data <- read_csv("ufo_sightings.csv")
Rows: 80332 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): date_time, city_area, state, country, ufo_shape, described_encounte...
dbl (3): encounter_length, latitude, longitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I will now clean the data set

I will first remove all unnecessary variables and NA values found across the data set

# Remove unecessary variables
clean_data <- data |>
  select(-country, -described_encounter_length, -description, -date_documented) |>
# Remove any NA values
  filter(!is.na(date_time)) |>
  filter(!is.na(state)) |>
  filter(!is.na(ufo_shape)) |>
  filter(!is.na(latitude))
sum(is.na(clean_data)) # Check for any remaining NA values
[1] 0

I will now filter the date_time variable to only include the year the sighting was reported

Based on online research, the easiest method to get solely the year is to use the string extract function. I found the specific code in the Epidemiologist R handbook; chapter 10 characters and strings; sub chapter 10.8 Regex and special characters.

clean_data <- clean_data |>
  mutate(year = as.integer(str_extract(date_time, "\\d{4}"))) # Extract only the 4 digit value for year
unique(clean_data$year) # Check to make sure only years were extracted
 [1] 1949 1956 1960 1961 1965 1966 1968 1970 1971 1972 1973 1974 1975 1976 1977
[16] 1978 1979 1980 1984 1986 1988 1989 1990 1991 1992 1993 1994 1996 1997 1998
[31] 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
[46] 1987 1950 1952 1954 1955 1957 1959 1967 1969 1981 1982 1983 1985 1995 1936
[61] 1943 1953 1962 1964 1958 1963 1947 2014 1951 1910 1944 1948 1945 1925 1946
[76] 1931 1933 1920 1930 1939 1941 1942 1937 1929 1934

Next I will filter out every year prior to 1990

clean_data <- clean_data |>
  filter(year >= 1990) # Filter out all years earlier than 1990
unique(clean_data$year) # Check all the remaining years in the data set
 [1] 1990 1991 1992 1993 1994 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
[16] 2006 2007 2008 2009 2010 2011 2012 2013 1995 2014

Next I will check the encounter_length to ensure there are no outliers

summary(clean_data$encounter_length)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0       30      180     6382      600 82800000 

There appears to be a massive outlier in the data.

For the sake of the assignment, I will calculate the upper limit and filter out any encounters greater than the upper limit.

IQR = Q3 - Q1 = 600 - 30 = 570

Upper Limit = Q3 + 1.5 * IQR = 600 + 1.5 * 570 = 1,455 seconds

With an upper limit of 1,455 seconds, I will filter the data set to exclude any number greater than 1,455 and any values less than 1

clean_data <- clean_data |>
  filter(encounter_length <= 1455) |>
  filter(encounter_length >= 1)
summary(clean_data$encounter_length) # Check to make sure filtering worked
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0    20.0   120.0   258.5   300.0  1440.0 

I will now filter the states to only include those in the United States

unique(clean_data$state) # Check all listed values in states variable
 [1] "ga" "pa" "tx" "tn" "il" "ny" "ar" "mo" "sc" "oh" "az" "ca" "nv" "wa" "nc"
[16] "ks" "ne" "fl" "or" "bc" "wi" "ky" "ia" "va" "mi" "id" "nm" "nj" "mb" "in"
[31] "on" "wv" "mn" "ok" "co" "ct" "ri" "al" "nb" "vt" "la" "nh" "pr" "me" "ms"
[46] "ma" "hi" "ut" "md" "nf" "sk" "wy" "mt" "ak" "sd" "pq" "ab" "ns" "dc" "nt"
[61] "qc" "sa" "de" "yk" "nd" "pe" "yt"
america_states <- c("ga","pa","tx","tn","il","ny","ar","mo","sc","oh","az","ca",
                    "nv","wa","nc","ks","ne","fl","or","wi","ky","ia","va","mi",
                    "id","nm","nj","in","wv","mn","ok","co","ct","ri","al","vt",
                    "la","nh","me","ms","ma","hi","ut","md","wy","mt","ak","sd",
                    "de","nd") # Filter only for US States
clean_data <- clean_data |>
  filter(state %in% america_states) # Filter only for states inside the america_states collection
unique(clean_data$state)
 [1] "ga" "pa" "tx" "tn" "il" "ny" "ar" "mo" "sc" "oh" "az" "ca" "nv" "wa" "nc"
[16] "ks" "ne" "fl" "or" "wi" "ky" "ia" "va" "mi" "id" "nm" "nj" "in" "wv" "mn"
[31] "ok" "co" "ct" "ri" "al" "vt" "la" "nh" "me" "ms" "ma" "hi" "ut" "md" "wy"
[46] "mt" "ak" "sd" "de" "nd"

Now I will do the multiple linear regression model

main_model <- lm(encounter_length ~ 
                   state +
                   ufo_shape +
                   year,
                 data = clean_data)
summary(main_model) # Summarise the main model

Call:
lm(formula = encounter_length ~ state + ufo_shape + year, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-456.53 -225.83 -134.67   65.02 1230.68 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1389.3453   526.5193  -2.639 0.008324 ** 
stateal              -58.2260    24.7905  -2.349 0.018842 *  
statear              -91.1422    24.9014  -3.660 0.000252 ***
stateaz               14.0014    21.4606   0.652 0.514131    
stateca              -52.3896    20.6154  -2.541 0.011047 *  
stateco              -77.2442    22.2582  -3.470 0.000520 ***
statect              -60.8933    23.5165  -2.589 0.009617 ** 
statede              -45.2040    33.6654  -1.343 0.179360    
statefl              -63.8873    21.0187  -3.040 0.002370 ** 
statega              -74.6372    22.5634  -3.308 0.000941 ***
statehi              -63.0696    28.0473  -2.249 0.024536 *  
stateia              -65.3156    24.3479  -2.683 0.007308 ** 
stateid              -39.7374    25.4259  -1.563 0.118090    
stateil              -20.1451    21.4598  -0.939 0.347871    
statein              -70.5995    22.4849  -3.140 0.001691 ** 
stateks              -65.6070    24.7396  -2.652 0.008006 ** 
stateky              -54.5414    23.8073  -2.291 0.021970 *  
statela              -64.0031    25.3883  -2.521 0.011706 *  
statema              -70.6239    22.5422  -3.133 0.001731 ** 
statemd              -68.5857    23.5878  -2.908 0.003643 ** 
stateme              -15.9830    25.1483  -0.636 0.525073    
statemi              -64.4503    21.8286  -2.953 0.003153 ** 
statemn              -83.6442    23.1678  -3.610 0.000306 ***
statemo              -67.6543    22.2456  -3.041 0.002357 ** 
statems              -65.1316    27.6364  -2.357 0.018440 *  
statemt              -55.5232    26.1351  -2.124 0.033635 *  
statenc              -71.6695    21.9254  -3.269 0.001081 ** 
statend              -87.8962    38.8573  -2.262 0.023700 *  
statene              -55.8678    27.3269  -2.044 0.040916 *  
statenh              -56.4081    25.5898  -2.204 0.027506 *  
statenj              -63.2229    22.2810  -2.838 0.004548 ** 
statenm              -57.3054    24.3544  -2.353 0.018627 *  
statenv              -37.1029    23.6951  -1.566 0.117391    
stateny              -62.4400    21.2948  -2.932 0.003367 ** 
stateoh              -48.4539    21.5752  -2.246 0.024720 *  
stateok              -73.8053    24.1865  -3.052 0.002278 ** 
stateor              -68.7648    21.9429  -3.134 0.001726 ** 
statepa              -64.9443    21.5136  -3.019 0.002539 ** 
stateri              -63.2308    29.2383  -2.163 0.030576 *  
statesc              -54.9783    23.0808  -2.382 0.017223 *  
statesd              -93.1074    33.0926  -2.814 0.004902 ** 
statetn              -71.0020    22.8254  -3.111 0.001868 ** 
statetx              -77.1537    21.1671  -3.645 0.000268 ***
stateut              -52.1907    24.1839  -2.158 0.030926 *  
stateva              -88.6317    22.4805  -3.943 8.07e-05 ***
statevt              -79.6006    28.3545  -2.807 0.004997 ** 
statewa              -71.1381    21.0292  -3.383 0.000718 ***
statewi              -42.5349    22.5729  -1.884 0.059525 .  
statewv              -85.3441    26.1775  -3.260 0.001114 ** 
statewy              -99.1895    32.8900  -3.016 0.002564 ** 
ufo_shapechevron    -192.6857    14.9230 -12.912  < 2e-16 ***
ufo_shapecigar      -159.3453    12.6616 -12.585  < 2e-16 ***
ufo_shapecircle     -117.9298    10.1630 -11.604  < 2e-16 ***
ufo_shapecone       -124.3197    24.5491  -5.064 4.12e-07 ***
ufo_shapecrescent   -371.2295   321.7539  -1.154 0.248600    
ufo_shapecross       -87.4039    26.5726  -3.289 0.001005 ** 
ufo_shapecylinder   -120.4151    13.9653  -8.622  < 2e-16 ***
ufo_shapedelta       -10.4960   160.9595  -0.065 0.948008    
ufo_shapediamond     -91.5524    14.6815  -6.236 4.52e-10 ***
ufo_shapedisk       -105.8401    10.9265  -9.687  < 2e-16 ***
ufo_shapedome       -369.7784   321.4583  -1.150 0.250019    
ufo_shapeegg        -131.2411    16.8047  -7.810 5.83e-15 ***
ufo_shapefireball   -179.0715    10.2038 -17.550  < 2e-16 ***
ufo_shapeflare      -355.2480   321.3916  -1.105 0.269016    
ufo_shapeflash      -211.5616    13.6930 -15.450  < 2e-16 ***
ufo_shapeformation  -101.8383    11.8203  -8.616  < 2e-16 ***
ufo_shapehexagon    -132.6933   321.4560  -0.413 0.679763    
ufo_shapelight      -115.9462     9.6141 -12.060  < 2e-16 ***
ufo_shapeother      -128.9214    10.5167 -12.259  < 2e-16 ***
ufo_shapeoval       -131.5514    11.1246 -11.825  < 2e-16 ***
ufo_shapepyramid    -233.9934   321.5709  -0.728 0.466827    
ufo_shaperectangle  -160.6569    13.8968 -11.561  < 2e-16 ***
ufo_shaperound        74.9017   227.3824   0.329 0.741848    
ufo_shapesphere     -125.7420    10.5181 -11.955  < 2e-16 ***
ufo_shapeteardrop   -159.5628    16.4122  -9.722  < 2e-16 ***
ufo_shapetriangle   -132.5151    10.0553 -13.179  < 2e-16 ***
ufo_shapeunknown    -115.2631    10.5015 -10.976  < 2e-16 ***
year                   0.9149     0.2621   3.490 0.000483 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 321.2 on 55304 degrees of freedom
Multiple R-squared:  0.0137,    Adjusted R-squared:  0.01233 
F-statistic: 9.977 on 77 and 55304 DF,  p-value: < 2.2e-16
plot(main_model) # Plot the regression analysis
Warning: not plotting observations with leverage one:
  9837, 22730

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

The adjusted r squared is 0.01233

Because there are so many factors in the regression model, I will analyze the removal of each variable individually.

First, I will remove state

main_model <- lm(encounter_length ~
                   ufo_shape +
                   year,
                 data = clean_data)
summary(main_model)

Call:
lm(formula = encounter_length ~ ufo_shape + year, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-447.91 -228.04 -139.09   55.86 1226.12 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1253.4091   523.3711  -2.395  0.01663 *  
ufo_shapechevron    -194.7944    14.9407 -13.038  < 2e-16 ***
ufo_shapecigar      -161.3228    12.6738 -12.729  < 2e-16 ***
ufo_shapecircle     -118.1490    10.1741 -11.613  < 2e-16 ***
ufo_shapecone       -124.5962    24.5784  -5.069 4.01e-07 ***
ufo_shapecrescent   -370.5521   321.9049  -1.151  0.24969    
ufo_shapecross       -87.1378    26.6002  -3.276  0.00105 ** 
ufo_shapecylinder   -122.3681    13.9785  -8.754  < 2e-16 ***
ufo_shapedelta       -30.5748   161.1680  -0.190  0.84954    
ufo_shapediamond     -92.7210    14.6989  -6.308 2.85e-10 ***
ufo_shapedisk       -107.1213    10.9385  -9.793  < 2e-16 ***
ufo_shapedome       -377.7339   321.9070  -1.173  0.24063    
ufo_shapeegg        -132.4472    16.8229  -7.873 3.52e-15 ***
ufo_shapefireball   -179.7616    10.2124 -17.602  < 2e-16 ***
ufo_shapeflare      -350.5521   321.9049  -1.089  0.27616    
ufo_shapeflash      -212.7919    13.7077 -15.524  < 2e-16 ***
ufo_shapeformation  -100.7413    11.8321  -8.514  < 2e-16 ***
ufo_shapehexagon    -140.5521   321.9049  -0.437  0.66238    
ufo_shapelight      -116.2231     9.6254 -12.075  < 2e-16 ***
ufo_shapeother      -129.9767    10.5287 -12.345  < 2e-16 ***
ufo_shapeoval       -132.8764    11.1362 -11.932  < 2e-16 ***
ufo_shapepyramid    -260.5521   321.9049  -0.809  0.41828    
ufo_shaperectangle  -162.1881    13.9149 -11.656  < 2e-16 ***
ufo_shaperound        73.1752   227.7243   0.321  0.74796    
ufo_shapesphere     -126.3735    10.5303 -12.001  < 2e-16 ***
ufo_shapeteardrop   -161.2195    16.4330  -9.811  < 2e-16 ***
ufo_shapetriangle   -133.8959    10.0650 -13.303  < 2e-16 ***
ufo_shapeunknown    -117.0326    10.5124 -11.133  < 2e-16 ***
year                   0.8182     0.2608   3.137  0.00171 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 321.8 on 55353 degrees of freedom
Multiple R-squared:  0.009573,  Adjusted R-squared:  0.009072 
F-statistic: 19.11 on 28 and 55353 DF,  p-value: < 2.2e-16
plot(main_model)
Warning: not plotting observations with leverage one:
  9837, 13306, 22730

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

With an adjusted r squared of 0.009072, it is less than 0.01233, meaning state was significant to the regression model.

I will now remove ufo shape.

main_model <- lm(encounter_length ~ 
                   state +
                   year,
                 data = clean_data)
summary(main_model)

Call:
lm(formula = encounter_length ~ state + year, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-339.18 -232.03 -132.82   55.87 1203.33 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1225.7693   523.3235  -2.342 0.019170 *  
stateal       -60.4034    24.8947  -2.426 0.015255 *  
statear       -90.2796    25.0047  -3.611 0.000306 ***
stateaz        16.1454    21.5489   0.749 0.453712    
stateca       -51.2234    20.6998  -2.475 0.013342 *  
stateco       -77.8969    22.3501  -3.485 0.000492 ***
statect       -64.1654    23.6108  -2.718 0.006577 ** 
statede       -44.1912    33.8063  -1.307 0.191155    
statefl       -65.1053    21.1041  -3.085 0.002037 ** 
statega       -74.6348    22.6561  -3.294 0.000987 ***
statehi       -66.0387    28.1673  -2.345 0.019055 *  
stateia       -64.3795    24.4495  -2.633 0.008462 ** 
stateid       -37.3197    25.5335  -1.462 0.143857    
stateil       -20.3400    21.5484  -0.944 0.345215    
statein       -71.7133    22.5789  -3.176 0.001493 ** 
stateks       -65.5809    24.8397  -2.640 0.008289 ** 
stateky       -53.5959    23.9033  -2.242 0.024953 *  
statela       -63.1245    25.4923  -2.476 0.013281 *  
statema       -70.5100    22.6357  -3.115 0.001840 ** 
statemd       -67.3649    23.6860  -2.844 0.004456 ** 
stateme       -16.0089    25.2535  -0.634 0.526131    
statemi       -65.2121    21.9190  -2.975 0.002930 ** 
statemn       -84.4583    23.2594  -3.631 0.000282 ***
statemo       -68.1092    22.3354  -3.049 0.002294 ** 
statems       -65.5059    27.7525  -2.360 0.018261 *  
statemt       -57.3961    26.2447  -2.187 0.028749 *  
statenc       -71.2913    22.0156  -3.238 0.001204 ** 
statend       -87.1734    39.0199  -2.234 0.025482 *  
statene       -54.7232    27.4392  -1.994 0.046119 *  
statenh       -53.9260    25.6862  -2.099 0.035785 *  
statenj       -63.7440    22.3734  -2.849 0.004386 ** 
statenm       -56.2469    24.4569  -2.300 0.021461 *  
statenv       -36.3735    23.7923  -1.529 0.126321    
stateny       -62.1522    21.3823  -2.907 0.003654 ** 
stateoh       -49.1533    21.6629  -2.269 0.023272 *  
stateok       -73.7225    24.2856  -3.036 0.002401 ** 
stateor       -67.5860    22.0352  -3.067 0.002162 ** 
statepa       -65.4747    21.6002  -3.031 0.002437 ** 
stateri       -64.4375    29.3538  -2.195 0.028153 *  
statesc       -53.7607    23.1783  -2.319 0.020375 *  
statesd       -91.4875    33.2310  -2.753 0.005906 ** 
statetn       -71.1231    22.9176  -3.103 0.001914 ** 
statetx       -76.5846    21.2530  -3.603 0.000314 ***
stateut       -52.6893    24.2857  -2.170 0.030044 *  
stateva       -88.0283    22.5728  -3.900 9.64e-05 ***
statevt       -78.8963    28.4758  -2.771 0.005596 ** 
statewa       -72.2048    21.1169  -3.419 0.000628 ***
statewi       -42.2990    22.6645  -1.866 0.062003 .  
statewv       -83.3426    26.2862  -3.171 0.001522 ** 
statewy      -100.7096    33.0310  -3.049 0.002298 ** 
year            0.7699     0.2605   2.955 0.003127 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 322.7 on 55331 degrees of freedom
Multiple R-squared:  0.00441,   Adjusted R-squared:  0.00351 
F-statistic: 4.901 on 50 and 55331 DF,  p-value: < 2.2e-16
plot(main_model)

The adjusted r squared is 0.00351, less than 0.01233. This means ufo shape is highly significant to the regression model.

I will now remove year.

main_model <- lm(encounter_length ~ 
                   state +
                   ufo_shape,
                 data = clean_data)
summary(main_model)

Call:
lm(formula = encounter_length ~ state + ufo_shape, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-454.52 -226.40 -133.69   64.39 1234.65 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         446.689     22.249  20.077  < 2e-16 ***
stateal             -58.304     24.793  -2.352 0.018694 *  
statear             -92.179     24.902  -3.702 0.000214 ***
stateaz              12.833     21.460   0.598 0.549859    
stateca             -53.211     20.616  -2.581 0.009853 ** 
stateco             -78.137     22.259  -3.510 0.000448 ***
statect             -60.063     23.518  -2.554 0.010653 *  
statede             -45.234     33.669  -1.344 0.179115    
statefl             -63.536     21.021  -3.023 0.002507 ** 
statega             -74.669     22.566  -3.309 0.000937 ***
statehi             -63.121     28.050  -2.250 0.024435 *  
stateia             -65.450     24.350  -2.688 0.007193 ** 
stateid             -39.679     25.428  -1.560 0.118665    
stateil             -20.734     21.461  -0.966 0.333984    
statein             -70.978     22.487  -3.156 0.001598 ** 
stateks             -66.365     24.741  -2.682 0.007312 ** 
stateky             -54.805     23.810  -2.302 0.021350 *  
statela             -64.639     25.390  -2.546 0.010905 *  
statema             -70.437     22.544  -3.124 0.001783 ** 
statemd             -68.489     23.590  -2.903 0.003694 ** 
stateme             -15.907     25.151  -0.632 0.527080    
statemi             -65.106     21.830  -2.982 0.002861 ** 
statemn             -83.783     23.170  -3.616 0.000299 ***
statemo             -68.251     22.247  -3.068 0.002157 ** 
statems             -65.932     27.638  -2.386 0.017057 *  
statemt             -55.645     26.138  -2.129 0.033264 *  
statenc             -71.184     21.927  -3.246 0.001170 ** 
statend             -89.096     38.860  -2.293 0.021865 *  
statene             -56.790     27.328  -2.078 0.037710 *  
statenh             -55.984     25.592  -2.188 0.028706 *  
statenj             -63.540     22.283  -2.851 0.004353 ** 
statenm             -57.976     24.356  -2.380 0.017299 *  
statenv             -38.448     23.694  -1.623 0.104666    
stateny             -63.020     21.296  -2.959 0.003086 ** 
stateoh             -48.691     21.577  -2.257 0.024039 *  
stateok             -74.111     24.189  -3.064 0.002186 ** 
stateor             -69.705     21.943  -3.177 0.001491 ** 
statepa             -64.892     21.516  -3.016 0.002562 ** 
stateri             -62.877     29.241  -2.150 0.031535 *  
statesc             -53.917     23.081  -2.336 0.019496 *  
statesd             -93.255     33.096  -2.818 0.004839 ** 
statetn             -71.297     22.828  -3.123 0.001789 ** 
statetx             -78.020     21.168  -3.686 0.000228 ***
stateut             -52.565     24.186  -2.173 0.029757 *  
stateva             -88.528     22.483  -3.938 8.24e-05 ***
statevt             -78.044     28.354  -2.753 0.005916 ** 
statewa             -72.278     21.029  -3.437 0.000588 ***
statewi             -42.886     22.575  -1.900 0.057475 .  
statewv             -85.264     26.180  -3.257 0.001127 ** 
statewy            -100.665     32.891  -3.061 0.002210 ** 
ufo_shapechevron   -193.436     14.923 -12.962  < 2e-16 ***
ufo_shapecigar     -160.166     12.661 -12.651  < 2e-16 ***
ufo_shapecircle    -116.896     10.160 -11.506  < 2e-16 ***
ufo_shapecone      -124.253     24.552  -5.061 4.19e-07 ***
ufo_shapecrescent  -380.705    321.775  -1.183 0.236758    
ufo_shapecross      -86.739     26.575  -3.264 0.001099 ** 
ufo_shapecylinder  -120.451     13.967  -8.624  < 2e-16 ***
ufo_shapedelta      -20.572    160.950  -0.128 0.898293    
ufo_shapediamond    -91.520     14.683  -6.233 4.61e-10 ***
ufo_shapedisk      -107.076     10.922  -9.804  < 2e-16 ***
ufo_shapedome      -379.797    321.478  -1.181 0.237445    
ufo_shapeegg       -132.229     16.804  -7.869 3.64e-15 ***
ufo_shapefireball  -177.807     10.198 -17.435  < 2e-16 ***
ufo_shapeflare     -363.479    321.415  -1.131 0.258115    
ufo_shapeflash     -211.150     13.694 -15.419  < 2e-16 ***
ufo_shapeformation -101.346     11.821  -8.574  < 2e-16 ***
ufo_shapehexagon   -141.797    321.478  -0.441 0.659158    
ufo_shapelight     -115.523      9.614 -12.016  < 2e-16 ***
ufo_shapeother     -129.490     10.516 -12.313  < 2e-16 ***
ufo_shapeoval      -131.294     11.125 -11.801  < 2e-16 ***
ufo_shapepyramid   -242.906    321.593  -0.755 0.450060    
ufo_shaperectangle -160.617     13.898 -11.557  < 2e-16 ***
ufo_shaperound       64.862    227.387   0.285 0.775454    
ufo_shapesphere    -125.041     10.517 -11.889  < 2e-16 ***
ufo_shapeteardrop  -159.499     16.414  -9.717  < 2e-16 ***
ufo_shapetriangle  -133.062     10.055 -13.233  < 2e-16 ***
ufo_shapeunknown   -114.976     10.502 -10.948  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 321.3 on 55305 degrees of freedom
Multiple R-squared:  0.01348,   Adjusted R-squared:  0.01213 
F-statistic: 9.946 on 76 and 55305 DF,  p-value: < 2.2e-16
plot(main_model)
Warning: not plotting observations with leverage one:
  9837, 22730

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

With an adjusted r squared of 0.01212, it is less than 0.01233, albeit only a little. This means it is only slightly significant to the regression model.

As every variable is significant to the regression model, this is the final result.

main_model <- lm(encounter_length ~ 
                   state +
                   ufo_shape +
                   year,
                 data = clean_data)
summary(main_model)

Call:
lm(formula = encounter_length ~ state + ufo_shape + year, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-456.53 -225.83 -134.67   65.02 1230.68 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1389.3453   526.5193  -2.639 0.008324 ** 
stateal              -58.2260    24.7905  -2.349 0.018842 *  
statear              -91.1422    24.9014  -3.660 0.000252 ***
stateaz               14.0014    21.4606   0.652 0.514131    
stateca              -52.3896    20.6154  -2.541 0.011047 *  
stateco              -77.2442    22.2582  -3.470 0.000520 ***
statect              -60.8933    23.5165  -2.589 0.009617 ** 
statede              -45.2040    33.6654  -1.343 0.179360    
statefl              -63.8873    21.0187  -3.040 0.002370 ** 
statega              -74.6372    22.5634  -3.308 0.000941 ***
statehi              -63.0696    28.0473  -2.249 0.024536 *  
stateia              -65.3156    24.3479  -2.683 0.007308 ** 
stateid              -39.7374    25.4259  -1.563 0.118090    
stateil              -20.1451    21.4598  -0.939 0.347871    
statein              -70.5995    22.4849  -3.140 0.001691 ** 
stateks              -65.6070    24.7396  -2.652 0.008006 ** 
stateky              -54.5414    23.8073  -2.291 0.021970 *  
statela              -64.0031    25.3883  -2.521 0.011706 *  
statema              -70.6239    22.5422  -3.133 0.001731 ** 
statemd              -68.5857    23.5878  -2.908 0.003643 ** 
stateme              -15.9830    25.1483  -0.636 0.525073    
statemi              -64.4503    21.8286  -2.953 0.003153 ** 
statemn              -83.6442    23.1678  -3.610 0.000306 ***
statemo              -67.6543    22.2456  -3.041 0.002357 ** 
statems              -65.1316    27.6364  -2.357 0.018440 *  
statemt              -55.5232    26.1351  -2.124 0.033635 *  
statenc              -71.6695    21.9254  -3.269 0.001081 ** 
statend              -87.8962    38.8573  -2.262 0.023700 *  
statene              -55.8678    27.3269  -2.044 0.040916 *  
statenh              -56.4081    25.5898  -2.204 0.027506 *  
statenj              -63.2229    22.2810  -2.838 0.004548 ** 
statenm              -57.3054    24.3544  -2.353 0.018627 *  
statenv              -37.1029    23.6951  -1.566 0.117391    
stateny              -62.4400    21.2948  -2.932 0.003367 ** 
stateoh              -48.4539    21.5752  -2.246 0.024720 *  
stateok              -73.8053    24.1865  -3.052 0.002278 ** 
stateor              -68.7648    21.9429  -3.134 0.001726 ** 
statepa              -64.9443    21.5136  -3.019 0.002539 ** 
stateri              -63.2308    29.2383  -2.163 0.030576 *  
statesc              -54.9783    23.0808  -2.382 0.017223 *  
statesd              -93.1074    33.0926  -2.814 0.004902 ** 
statetn              -71.0020    22.8254  -3.111 0.001868 ** 
statetx              -77.1537    21.1671  -3.645 0.000268 ***
stateut              -52.1907    24.1839  -2.158 0.030926 *  
stateva              -88.6317    22.4805  -3.943 8.07e-05 ***
statevt              -79.6006    28.3545  -2.807 0.004997 ** 
statewa              -71.1381    21.0292  -3.383 0.000718 ***
statewi              -42.5349    22.5729  -1.884 0.059525 .  
statewv              -85.3441    26.1775  -3.260 0.001114 ** 
statewy              -99.1895    32.8900  -3.016 0.002564 ** 
ufo_shapechevron    -192.6857    14.9230 -12.912  < 2e-16 ***
ufo_shapecigar      -159.3453    12.6616 -12.585  < 2e-16 ***
ufo_shapecircle     -117.9298    10.1630 -11.604  < 2e-16 ***
ufo_shapecone       -124.3197    24.5491  -5.064 4.12e-07 ***
ufo_shapecrescent   -371.2295   321.7539  -1.154 0.248600    
ufo_shapecross       -87.4039    26.5726  -3.289 0.001005 ** 
ufo_shapecylinder   -120.4151    13.9653  -8.622  < 2e-16 ***
ufo_shapedelta       -10.4960   160.9595  -0.065 0.948008    
ufo_shapediamond     -91.5524    14.6815  -6.236 4.52e-10 ***
ufo_shapedisk       -105.8401    10.9265  -9.687  < 2e-16 ***
ufo_shapedome       -369.7784   321.4583  -1.150 0.250019    
ufo_shapeegg        -131.2411    16.8047  -7.810 5.83e-15 ***
ufo_shapefireball   -179.0715    10.2038 -17.550  < 2e-16 ***
ufo_shapeflare      -355.2480   321.3916  -1.105 0.269016    
ufo_shapeflash      -211.5616    13.6930 -15.450  < 2e-16 ***
ufo_shapeformation  -101.8383    11.8203  -8.616  < 2e-16 ***
ufo_shapehexagon    -132.6933   321.4560  -0.413 0.679763    
ufo_shapelight      -115.9462     9.6141 -12.060  < 2e-16 ***
ufo_shapeother      -128.9214    10.5167 -12.259  < 2e-16 ***
ufo_shapeoval       -131.5514    11.1246 -11.825  < 2e-16 ***
ufo_shapepyramid    -233.9934   321.5709  -0.728 0.466827    
ufo_shaperectangle  -160.6569    13.8968 -11.561  < 2e-16 ***
ufo_shaperound        74.9017   227.3824   0.329 0.741848    
ufo_shapesphere     -125.7420    10.5181 -11.955  < 2e-16 ***
ufo_shapeteardrop   -159.5628    16.4122  -9.722  < 2e-16 ***
ufo_shapetriangle   -132.5151    10.0553 -13.179  < 2e-16 ***
ufo_shapeunknown    -115.2631    10.5015 -10.976  < 2e-16 ***
year                   0.9149     0.2621   3.490 0.000483 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 321.2 on 55304 degrees of freedom
Multiple R-squared:  0.0137,    Adjusted R-squared:  0.01233 
F-statistic: 9.977 on 77 and 55304 DF,  p-value: < 2.2e-16
plot(main_model)
Warning: not plotting observations with leverage one:
  9837, 22730

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Analysis of Linear Regression Model

The model equation for the linear regression is y = b0 + b1x1 + b2x2 + b3x3, where y is encounter length, b0 is the intercept (-1389.35), b1 is the state (varies by each state), b2 is the ufo shape(varies by each shape), and b3 is the year (0.9149).

The model equation gives two significant takeaways. First, for each singular increase in year, the encounter length increases by 0.9149 seconds. Second, because each state and ufo shape have different values, each one is significantly important in its own manner.

The P value was equal to < 2.2e-16, meaning there is a strong association between encounter length and the state, ufo shape, and year of the report.

The adjusted r squared is 0.01233. This means the regression model explains only 1.233% of variance in the data. This number is very low, and means there is little relationship between the encounter length and the state, ufo shape, and year. Another way of interpreting this is there is a multitude of other factors outside this data set which have a bigger impact on the encounter length.

The diagnostic plots:

Residuals vs. Fitted - The points are heavily bunched together in 4 specific locations with only a couple outside these zones. The red line appears to be constant for the beginning half, however it begins to decrease towards the latter half of the graph.

Q-Q Residuals - Much of the graph does not follow the reference line, but rather floats above it. The line of data displays a staircase shape where it rises and moves a multitude of times.

Scale-Location - The line shows a steady increase throughout the entire graph. The data points appear to cluster in a number of different locations.

Residuals vs. Leverage - Majority of the data points sit at the 0.0 marker with a couple outliers appearing around the 0.2 mark, 2 appearing at the 0.4 mark, and 1 at the 1.0 mark. The line dips in the beginning but gradually rises for the latter half of the graph.

Now I will create the two visualizations.

I will start with a hex graph visualization. My goal is to test for the relationship between encounter length and the year the reporting was made. I found the hex graph by going through the pop up menu when typing in geom_

ggplot(clean_data, aes(x = year, y = encounter_length)) + # Set the x and y 
  theme_minimal() +
  geom_hex() + # Make a hex graph. I found this just by trying out the different options under geom.
  scale_fill_viridis() +
  labs(title = "Year vs. Encounter Length (Seconds) for UFO Reports In the U.S.",
       caption = "The National UFO Reporting Center (NUFORC)",
       x = "Year",
       y = "Encounter Length (Seconds)",
       fill = "Number of Reports"
       )

ggplotly() # Add interactivity

Analysis of Visualization

The visualization I chose to make is a hex graph depicting the relationship between the year and the encounter length in seconds for UFO sightings in the United States. From the graph, it is clear the number of reports increases throughout the years, with there being no sightings above 300 in the early 1990’s to the mid 2010’s having multiple years with roughly 1200 reportings. This can be understood through the expansion of technology. As the years go by, the creation and wide distribution of technology allows for more people to have the means to report their findings. From this graph, I don’t see any surprises. My hypothesis going into the graph was there would be a constant upwards trend in the amount of reports versus the years and the graph corroborates this notion. In regards to stuff I couldn’t include, there wasn’t anything for this graph. However, I did wish to make a 3d model, however I did not have enough quantitative variables to achieve the desired result.

Now I will create the second visualization; a density plot analyzing the top 5 states with the most reports. First, I must figure out which states have the most reporting.

clean_data |>
  group_by(state) |> # Group by the staters
  count(state) |> # Count the amount of times the state is in the data
  arrange(desc(n)) # Arrange the count in descending order
# A tibble: 50 × 2
# Groups:   state [50]
   state     n
   <chr> <int>
 1 ca     7644
 2 fl     3405
 3 wa     3356
 4 tx     2828
 5 ny     2457
 6 az     2106
 7 il     2103
 8 pa     2011
 9 oh     1911
10 mi     1587
# ℹ 40 more rows

Now I will filter for the top 5 states and then make the density plot. I chose to facet wrap the states, and found this in the epidemiologist r handbook in chapter 30 ggplot basics, sub chapter 30.6 Facets / Small-multiples

filter_five_states <- clean_data |>
  filter(state %in% c("ca", "fl", "wa", "tx", "ny")) # Filter for the top 5 states found in the previous code

ggplot(filter_five_states, aes(x = year, fill = state)) +
  geom_density() + # Make a density chart
  theme_minimal() +
  facet_wrap(~state) + # Seperate the states so they have their own repsective graph
  scale_fill_brewer(palette = "Accent") + # Change the palette colours
  labs(fill = "State",
       title = "Yearly Reporting Distribution of UFO Sightings",
       x = "Year",
       y = "Density",
       caption = "The National UFO Reporting Center (NUFORC)")

ggplotly() # Add interactivity

Analysis of Visualization

The graph I chose to make was a density chart facet wrapped so each state got their own clear visualization. The y axis shows the density ranging from 0 to1 while the x is divided into the years. When analyzing the graph, the one surprise I have was the amount of dips in terms of reporting. I expected for the percentage of reportings to continuously increase over the years with little to no dips. This is based on the belief of technological advancements as well as social media attention seeking crazes. I imagine with the development of social media, people would be more inclined to report their findings in an effort to gain some form of internet popularity, however, the graph displays the reported dropping in density for some years. The biggest example of this was the graph for Washington where it appear the mid 2000’s saw a major drop in terms of reporting. Another shock from the graph is the major spike towards the end for Florida. While I do not know what might have caused this, it is definitely unusual and something to take into account should further investigations be conducted. In regards to what I couldn’t get to work, I first tried to make a map visualization for this depicting the encounter length of reports across the entire United States. However, I had a hard time making the information and the graph look legible.

Sources

  • Hibberd, J. (2026, May 8). Trump UFO Files Released: The 5 Strangest Photos, Videos. The Hollywood Reporter. https://www.hollywoodreporter.com/news/general-news/trump-ufo-files-revelations-1236590123/#respond

  • The Epidemiologist R Handbook. (2024). Epirhandbook.com. https://www.epirhandbook.com/en/

  • TWO−LETTER STATE AND TERRITORY ABBREVIATIONS. (2022). Faa.gov. https://www.faa.gov/air_traffic/publications/atpubs/cnt_html/appendix_a.html

  • Shostak, S. (2026, May 8). unidentified flying objectEncyclopedia Britannica. https://www.britannica.com/topic/unidentified-flying-object