Introduction

My research question is: Is there an association between the state and the shape of the UFO sighted? To answer this question I will use the “UFO Sightings” data set. I found this data set on Kaggle, but the data was sourced from NUFORC. NUFORC stands for “National UFO Reporting Center,” a non-profit corporation. The data set has 88875 observations and 12 variables. The variables of importance include duration_(seconds), state, country, shape, and datetime. Duration_(seconds) provides the length of time the UFO was visible in seconds. State is the state the observation was reported from. Country includes what country the sighting was from. Shape indicates what shape the UFO sighted was, for example, circle, sphere, or square. The datetime variable gives the month, day, year, and time of the observation. This variable will be used to create the year_only variable, which will exclude the month, day, and time.

The dataset can be accessed here: https://www.kaggle.com/datasets/NUFORC/ufo-sightings

library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.4.2
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.4.3
library(sf)
## Warning: package 'sf' was built under R version 4.4.3
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
setwd("~/Data 101")
UFO <- read_csv("complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl  (1): duration (seconds)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(UFO)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… san … tx    us      cyli…                 2700 45 minutes            
## 2 10/10/1… lack… tx    <NA>    light                 7200 1-2 hrs               
## 3 10/10/1… ches… <NA>  gb      circ…                   20 20 seconds            
## 4 10/10/1… edna  tx    us      circ…                   20 1/2 hour              
## 5 10/10/1… kane… hi    us      light                  900 15 minutes            
## 6 10/10/1… bris… tn    us      sphe…                  300 5 minutes             
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>
dim(UFO)
## [1] 88875    11

Cleaning data

replacing spaces with underscores

names(UFO) <- gsub(" ","_",names(UFO))
head(UFO)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration_(seconds)` `duration_(hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… san … tx    us      cyli…                 2700 45 minutes            
## 2 10/10/1… lack… tx    <NA>    light                 7200 1-2 hrs               
## 3 10/10/1… ches… <NA>  gb      circ…                   20 20 seconds            
## 4 10/10/1… edna  tx    us      circ…                   20 1/2 hour              
## 5 10/10/1… kane… hi    us      light                  900 15 minutes            
## 6 10/10/1… bris… tn    us      sphe…                  300 5 minutes             
## # ℹ 4 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## #   longitude <chr>

Both attempts to fix to the date, the second one works the first one doesn’t, but I still want the code so I’m keeping it in.

UFO$datetime <- dmy_hm(UFO$datetime)
## Warning: 52733 failed to parse.
UFO$year_only <- year(UFO$datetime)

head(UFO)
## # A tibble: 6 × 12
##   datetime            city              state country shape `duration_(seconds)`
##   <dttm>              <chr>             <chr> <chr>   <chr>                <dbl>
## 1 1949-10-10 20:30:00 san marcos        tx    us      cyli…                 2700
## 2 1949-10-10 21:00:00 lackland afb      tx    <NA>    light                 7200
## 3 1955-10-10 17:00:00 chester (uk/engl… <NA>  gb      circ…                   20
## 4 1956-10-10 21:00:00 edna              tx    us      circ…                   20
## 5 1960-10-10 20:00:00 kaneohe           hi    us      light                  900
## 6 1961-10-10 19:00:00 bristol           tn    us      sphe…                  300
## # ℹ 6 more variables: `duration_(hours/min)` <chr>, comments <chr>,
## #   date_posted <chr>, latitude <chr>, longitude <chr>, year_only <dbl>

Chi Square Test for Association/independence.

Checking to see which variables are numeric/character/integer.

str(UFO)
## spc_tbl_ [88,875 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ datetime            : POSIXct[1:88875], format: "1949-10-10 20:30:00" "1949-10-10 21:00:00" ...
##  $ city                : chr [1:88875] "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
##  $ state               : chr [1:88875] "tx" "tx" NA "tx" ...
##  $ country             : chr [1:88875] "us" NA "gb" "us" ...
##  $ shape               : chr [1:88875] "cylinder" "light" "circle" "circle" ...
##  $ duration_(seconds)  : num [1:88875] 2700 7200 20 20 900 300 180 1200 180 120 ...
##  $ duration_(hours/min): chr [1:88875] "45 minutes" "1-2 hrs" "20 seconds" "1/2 hour" ...
##  $ comments            : chr [1:88875] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "1949 Lackland AFB&#44 TX.  Lights racing across the sky &amp; making 90 degree turns on a dime." "Green/Orange circular disc over Chester&#44 England" "My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I "| __truncated__ ...
##  $ date_posted         : chr [1:88875] "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
##  $ latitude            : chr [1:88875] "29.8830556" "29.38421" "53.2" "28.9783333" ...
##  $ longitude           : chr [1:88875] "-97.9411111" "-98.581082" "-2.916667" "-96.6458333" ...
##  $ year_only           : num [1:88875] 1949 1949 1955 1956 1960 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   datetime = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   country = col_character(),
##   ..   shape = col_character(),
##   ..   `duration (seconds)` = col_double(),
##   ..   `duration (hours/min)` = col_character(),
##   ..   comments = col_character(),
##   ..   `date posted` = col_character(),
##   ..   latitude = col_character(),
##   ..   longitude = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Removing the NAs from my desired variables

tabledata1 <- UFO |>
  select(c("state", "shape")) |>
filter(!is.na(state), !is.na(shape))
head(tabledata1)
## # A tibble: 6 × 2
##   state shape   
##   <chr> <chr>   
## 1 tx    cylinder
## 2 tx    light   
## 3 tx    circle  
## 4 hi    light   
## 5 tn    sphere  
## 6 ct    disk
table1 <- table(tabledata1$state, tabledata1$shape)
head(table1)
##     
##      changed changing chevron cigar circle cone crescent cross cylinder delta
##   ab       0        4       2    11     34    4        0     1        6     0
##   ak       0        6       0     8     36    2        0     0        6     0
##   al       0       16       9    16     75    1        0     3       14     0
##   ar       0       19      12    28     72    1        0     2       17     0
##   az       0       84      34    58    265   15        0     8       46     0
##   bc       0       15       9    15     72    1        0     3       16     1
##     
##      diamond disk dome egg fireball flare flash formation hexagon light other
##   ab      10   24    0   4       14     0     7        17       0    74    30
##   ak       6   33    0   1       20     0    13         7       0   106    24
##   al      16   52    0  12       49     0    15        16       0   149    48
##   ar       9   48    0   6       30     0    15        13       0   157    44
##   az      35  154    0  25      195     0    46       131       0   663   236
##   bc      12   68    0   9       45     0    22        25       0   223    94
##     
##      oval pyramid rectangle round sphere teardrop triangle unknown
##   ab   16       0         7     0     24        3       37      38
##   ak   24       0         5     0     29        1       29      23
##   al   38       0        12     0     57        7       88      51
##   ar   26       0        10     0     34        8       98      62
##   az  124       0        41     0    184       27      273     201
##   bc   33       0         8     0     43        5       73      65

Now I am removing all of the columns that are just zero.

tabledata2 <- UFO |>
  select(c("state", "shape")) |>
  filter(!shape %in% c("changed", "crescent", "delta", "dome", "flare", "hexagon", "pyramid", "round"))
tablethingy2 <- table(tabledata2$state, tabledata2$shape)
head(tablethingy2)
##     
##      changing chevron cigar circle cone cross cylinder diamond disk egg
##   ab        4       2    11     34    4     1        6      10   24   4
##   ak        6       0     8     36    2     0        6       6   33   1
##   al       16       9    16     75    1     3       14      16   52  12
##   ar       19      12    28     72    1     2       17       9   48   6
##   az       84      34    58    265   15     8       46      35  154  25
##   bc       15       9    15     72    1     3       16      12   68   9
##     
##      fireball flash formation light other oval rectangle sphere teardrop
##   ab       14     7        17    74    30   16         7     24        3
##   ak       20    13         7   106    24   24         5     29        1
##   al       49    15        16   149    48   38        12     57        7
##   ar       30    15        13   157    44   26        10     34        8
##   az      195    46       131   663   236  124        41    184       27
##   bc       45    22        25   223    94   33         8     43        5
##     
##      triangle unknown
##   ab       37      38
##   ak       29      23
##   al       88      51
##   ar       98      62
##   az      273     201
##   bc       73      65

Performing the actual test

chisqrUFO <- chisq.test(tablethingy2)
## Warning in chisq.test(tablethingy2): Chi-squared approximation may be incorrect
chisqrUFO
## 
##  Pearson's Chi-squared test
## 
## data:  tablethingy2
## X-squared = 2295.7, df = 1340, p-value < 2.2e-16

\(H_0\): There is no association between state and the shape of the UFO reported

\(H_a\): There is at least some association between the state and the shape of the UFO reported.

Due to the pvalue being very low–around 0.00000000000000022–we reject the null. There is an association between the state and the shape of the UFO reported.

Graphing

Filtered the data for my graph, but removing the shapes that were not reported in the united states at all.

graphFiltered <- UFO |>
  select(c("state", "shape", "country", "year_only")) |>
  filter(!shape %in% c("changed", "crescent", "delta", "dome", "flare", "hexagon", "pyramid", "round"), country == "us")

Creating a bar graph

finding the states with the highest counts, which I will then graph.

statecounts <- count(graphFiltered, state) 
  
statecounts |>
arrange(desc(n))
## # A tibble: 52 × 2
##    state     n
##    <chr> <int>
##  1 ca     9573
##  2 wa     4290
##  3 fl     4155
##  4 tx     3741
##  5 ny     3234
##  6 il     2698
##  7 az     2617
##  8 pa     2518
##  9 oh     2464
## 10 mi     1980
## # ℹ 42 more rows

There was definitely a more efficient way to do this, but oh well.

filtering for the states with the highest counts

UFOGraphData <- graphFiltered |>
  filter(shape %in% c("circle", "light", "triangle", "sphere", "disk", "fireball"), year_only %in% c("1980", "1990", "2000", "2010"), state %in% c("ca", "wa", "fl", "tx"))
head(UFOGraphData)
## # A tibble: 6 × 4
##   state shape  country year_only
##   <chr> <chr>  <chr>       <dbl>
## 1 tx    sphere us           1980
## 2 ca    disk   us           2010
## 3 ca    light  us           2000
## 4 ca    light  us           2000
## 5 fl    sphere us           2010
## 6 ca    circle us           2010

Checking to see how many years and included. If there aren’t that many I will facet wrap by year.

unique(UFOGraphData$year_only)
## [1] 1980 2010 2000 1990

There weren’t that many years, so I facet wrapped by year.

bargraph <- ggplot(UFOGraphData, aes(state, fill = shape)) +
  geom_bar() + 
  facet_wrap(~year_only) +
  scale_fill_manual(values = c("#1aa21c", "#b12020", "#561aa2", "#b12099", "#1a94a2", "#bf7d0a"),
                    labels = c("circle", "disk", "fireball", "light", "sphere", "triangle"), 
                    name = "Shape of UFO Reported") +
  labs(title = "Counts for each shape reported in the four most populated states")
bargraph

Creating a leaflet

Filtering for the data that

leafletdata <- UFO |>
  select("country", "state", "shape", "duration_(seconds)", "latitude", "longitude", "year_only", "duration_(hours/min)") |>
  filter(country == "us", year_only %in% c("1980", "1990", "2000", "2010"))
head(leafletdata)
## # A tibble: 6 × 8
##   country state shape    `duration_(seconds)` latitude   longitude   year_only
##   <chr>   <chr> <chr>                   <dbl> <chr>      <chr>           <dbl>
## 1 us      tx    sphere                    180 29.7630556 -95.3630556      1980
## 2 us      tx    unknown                   300 32.7833333 -96.8000000      1980
## 3 us      nh    light                     300 42.9955556 -71.4552778      1980
## 4 us      ga    triangle                  120 31.7058333 -83.6533333      1990
## 5 us      mo    oval                      180 37.7241667 -89.8611111      2000
## 6 us      fl    other                      10 26.5250000 -80.0666667      2000
## # ℹ 1 more variable: `duration_(hours/min)` <chr>

Setting the latitude, longitude, and the duration in seconds to numeric.

Creating the popup for the leaflet

popupUFO <- paste0(
      "<b>Year: </b>", leafletdata$year_only, "<br>",
      "<b>Duration: </b>", leafletdata$`duration_(hours/min)`, "<br>",
      "<b>State: </b>", leafletdata$state, "<br>",
       "<b>Shape: </b>", leafletdata$shape, "<br>"
    )

Graphing the actual leaflet.

For some reason the leaflet won’t change where the graph first spawns(?), as in the setView just isn’t working for some reason. So, when you first run the chunk you will be in the arctic ocean and will need to move over to the United States.

leaflet() |>
  setView(lng = 38.7946, lat = 106.5348, zoom = 5) |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
   data = leafletdata,
   color = "black",
   fillColor = "white",
   fillOpacity = 0.25,
   popup = popupUFO
  )
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Conclusion

To answer my research question I performed a Chi Square test with the variables “state” and “shape.” In my analysis I found that the shape of the UFO is associated with which state it was reported from. This means that depending on the state, the most commonly reported shape of UFO will be different. I found from my first graph that in the states with the highest counts of reports, the UFO is most often reported as light, and not a specifc shape. In my second graph, a leaflet, I found that the the distribution of the UFO sightings is identical to a map of the USA at night, or a population density map of the USA. In my opinion it’s indicating that most UFO sightings are actually due to something humans are doing: a reflection of light, a drone, a plane, etc. Because of this, in the future, I could try to find a dataset that can be used to map the activities of planes and helicopters to see if there is any correlation. An alternative would be a dataset that tracks use of electricity.

Works Cited

UFO Sightings. 13 Nov. 2019, www.kaggle.com/datasets/NUFORC/ufo-sightings.

shout out to my cat for being present during the completion of both my projects

Thank you professor! Have a good summer <3