My research question is: Is there an association between the state and the shape of the UFO sighted? To answer this question I will use the “UFO Sightings” data set. I found this data set on Kaggle, but the data was sourced from NUFORC. NUFORC stands for “National UFO Reporting Center,” a non-profit corporation. The data set has 88875 observations and 12 variables. The variables of importance include duration_(seconds), state, country, shape, and datetime. Duration_(seconds) provides the length of time the UFO was visible in seconds. State is the state the observation was reported from. Country includes what country the sighting was from. Shape indicates what shape the UFO sighted was, for example, circle, sphere, or square. The datetime variable gives the month, day, year, and time of the observation. This variable will be used to create the year_only variable, which will exclude the month, day, and time.
The dataset can be accessed here: https://www.kaggle.com/datasets/NUFORC/ufo-sightings
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.4.2
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.4.3
library(sf)
## Warning: package 'sf' was built under R version 4.4.3
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
setwd("~/Data 101")
UFO <- read_csv("complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl (1): duration (seconds)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(UFO)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… san … tx us cyli… 2700 45 minutes
## 2 10/10/1… lack… tx <NA> light 7200 1-2 hrs
## 3 10/10/1… ches… <NA> gb circ… 20 20 seconds
## 4 10/10/1… edna tx us circ… 20 1/2 hour
## 5 10/10/1… kane… hi us light 900 15 minutes
## 6 10/10/1… bris… tn us sphe… 300 5 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
dim(UFO)
## [1] 88875 11
replacing spaces with underscores
names(UFO) <- gsub(" ","_",names(UFO))
head(UFO)
## # A tibble: 6 × 11
## datetime city state country shape `duration_(seconds)` `duration_(hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… san … tx us cyli… 2700 45 minutes
## 2 10/10/1… lack… tx <NA> light 7200 1-2 hrs
## 3 10/10/1… ches… <NA> gb circ… 20 20 seconds
## 4 10/10/1… edna tx us circ… 20 1/2 hour
## 5 10/10/1… kane… hi us light 900 15 minutes
## 6 10/10/1… bris… tn us sphe… 300 5 minutes
## # ℹ 4 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## # longitude <chr>
Both attempts to fix to the date, the second one works the first one doesn’t, but I still want the code so I’m keeping it in.
UFO$datetime <- dmy_hm(UFO$datetime)
## Warning: 52733 failed to parse.
UFO$year_only <- year(UFO$datetime)
head(UFO)
## # A tibble: 6 × 12
## datetime city state country shape `duration_(seconds)`
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1949-10-10 20:30:00 san marcos tx us cyli… 2700
## 2 1949-10-10 21:00:00 lackland afb tx <NA> light 7200
## 3 1955-10-10 17:00:00 chester (uk/engl… <NA> gb circ… 20
## 4 1956-10-10 21:00:00 edna tx us circ… 20
## 5 1960-10-10 20:00:00 kaneohe hi us light 900
## 6 1961-10-10 19:00:00 bristol tn us sphe… 300
## # ℹ 6 more variables: `duration_(hours/min)` <chr>, comments <chr>,
## # date_posted <chr>, latitude <chr>, longitude <chr>, year_only <dbl>
Checking to see which variables are numeric/character/integer.
str(UFO)
## spc_tbl_ [88,875 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:88875], format: "1949-10-10 20:30:00" "1949-10-10 21:00:00" ...
## $ city : chr [1:88875] "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
## $ state : chr [1:88875] "tx" "tx" NA "tx" ...
## $ country : chr [1:88875] "us" NA "gb" "us" ...
## $ shape : chr [1:88875] "cylinder" "light" "circle" "circle" ...
## $ duration_(seconds) : num [1:88875] 2700 7200 20 20 900 300 180 1200 180 120 ...
## $ duration_(hours/min): chr [1:88875] "45 minutes" "1-2 hrs" "20 seconds" "1/2 hour" ...
## $ comments : chr [1:88875] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime." "Green/Orange circular disc over Chester, England" "My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I "| __truncated__ ...
## $ date_posted : chr [1:88875] "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
## $ latitude : chr [1:88875] "29.8830556" "29.38421" "53.2" "28.9783333" ...
## $ longitude : chr [1:88875] "-97.9411111" "-98.581082" "-2.916667" "-96.6458333" ...
## $ year_only : num [1:88875] 1949 1949 1955 1956 1960 ...
## - attr(*, "spec")=
## .. cols(
## .. datetime = col_character(),
## .. city = col_character(),
## .. state = col_character(),
## .. country = col_character(),
## .. shape = col_character(),
## .. `duration (seconds)` = col_double(),
## .. `duration (hours/min)` = col_character(),
## .. comments = col_character(),
## .. `date posted` = col_character(),
## .. latitude = col_character(),
## .. longitude = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Removing the NAs from my desired variables
tabledata1 <- UFO |>
select(c("state", "shape")) |>
filter(!is.na(state), !is.na(shape))
head(tabledata1)
## # A tibble: 6 × 2
## state shape
## <chr> <chr>
## 1 tx cylinder
## 2 tx light
## 3 tx circle
## 4 hi light
## 5 tn sphere
## 6 ct disk
table1 <- table(tabledata1$state, tabledata1$shape)
head(table1)
##
## changed changing chevron cigar circle cone crescent cross cylinder delta
## ab 0 4 2 11 34 4 0 1 6 0
## ak 0 6 0 8 36 2 0 0 6 0
## al 0 16 9 16 75 1 0 3 14 0
## ar 0 19 12 28 72 1 0 2 17 0
## az 0 84 34 58 265 15 0 8 46 0
## bc 0 15 9 15 72 1 0 3 16 1
##
## diamond disk dome egg fireball flare flash formation hexagon light other
## ab 10 24 0 4 14 0 7 17 0 74 30
## ak 6 33 0 1 20 0 13 7 0 106 24
## al 16 52 0 12 49 0 15 16 0 149 48
## ar 9 48 0 6 30 0 15 13 0 157 44
## az 35 154 0 25 195 0 46 131 0 663 236
## bc 12 68 0 9 45 0 22 25 0 223 94
##
## oval pyramid rectangle round sphere teardrop triangle unknown
## ab 16 0 7 0 24 3 37 38
## ak 24 0 5 0 29 1 29 23
## al 38 0 12 0 57 7 88 51
## ar 26 0 10 0 34 8 98 62
## az 124 0 41 0 184 27 273 201
## bc 33 0 8 0 43 5 73 65
Now I am removing all of the columns that are just zero.
tabledata2 <- UFO |>
select(c("state", "shape")) |>
filter(!shape %in% c("changed", "crescent", "delta", "dome", "flare", "hexagon", "pyramid", "round"))
tablethingy2 <- table(tabledata2$state, tabledata2$shape)
head(tablethingy2)
##
## changing chevron cigar circle cone cross cylinder diamond disk egg
## ab 4 2 11 34 4 1 6 10 24 4
## ak 6 0 8 36 2 0 6 6 33 1
## al 16 9 16 75 1 3 14 16 52 12
## ar 19 12 28 72 1 2 17 9 48 6
## az 84 34 58 265 15 8 46 35 154 25
## bc 15 9 15 72 1 3 16 12 68 9
##
## fireball flash formation light other oval rectangle sphere teardrop
## ab 14 7 17 74 30 16 7 24 3
## ak 20 13 7 106 24 24 5 29 1
## al 49 15 16 149 48 38 12 57 7
## ar 30 15 13 157 44 26 10 34 8
## az 195 46 131 663 236 124 41 184 27
## bc 45 22 25 223 94 33 8 43 5
##
## triangle unknown
## ab 37 38
## ak 29 23
## al 88 51
## ar 98 62
## az 273 201
## bc 73 65
Performing the actual test
chisqrUFO <- chisq.test(tablethingy2)
## Warning in chisq.test(tablethingy2): Chi-squared approximation may be incorrect
chisqrUFO
##
## Pearson's Chi-squared test
##
## data: tablethingy2
## X-squared = 2295.7, df = 1340, p-value < 2.2e-16
\(H_0\): There is no association between state and the shape of the UFO reported
\(H_a\): There is at least some association between the state and the shape of the UFO reported.
Due to the pvalue being very low–around 0.00000000000000022–we reject the null. There is an association between the state and the shape of the UFO reported.
Filtered the data for my graph, but removing the shapes that were not reported in the united states at all.
graphFiltered <- UFO |>
select(c("state", "shape", "country", "year_only")) |>
filter(!shape %in% c("changed", "crescent", "delta", "dome", "flare", "hexagon", "pyramid", "round"), country == "us")
finding the states with the highest counts, which I will then graph.
statecounts <- count(graphFiltered, state)
statecounts |>
arrange(desc(n))
## # A tibble: 52 × 2
## state n
## <chr> <int>
## 1 ca 9573
## 2 wa 4290
## 3 fl 4155
## 4 tx 3741
## 5 ny 3234
## 6 il 2698
## 7 az 2617
## 8 pa 2518
## 9 oh 2464
## 10 mi 1980
## # ℹ 42 more rows
There was definitely a more efficient way to do this, but oh well.
filtering for the states with the highest counts
UFOGraphData <- graphFiltered |>
filter(shape %in% c("circle", "light", "triangle", "sphere", "disk", "fireball"), year_only %in% c("1980", "1990", "2000", "2010"), state %in% c("ca", "wa", "fl", "tx"))
head(UFOGraphData)
## # A tibble: 6 × 4
## state shape country year_only
## <chr> <chr> <chr> <dbl>
## 1 tx sphere us 1980
## 2 ca disk us 2010
## 3 ca light us 2000
## 4 ca light us 2000
## 5 fl sphere us 2010
## 6 ca circle us 2010
Checking to see how many years and included. If there aren’t that many I will facet wrap by year.
unique(UFOGraphData$year_only)
## [1] 1980 2010 2000 1990
There weren’t that many years, so I facet wrapped by year.
bargraph <- ggplot(UFOGraphData, aes(state, fill = shape)) +
geom_bar() +
facet_wrap(~year_only) +
scale_fill_manual(values = c("#1aa21c", "#b12020", "#561aa2", "#b12099", "#1a94a2", "#bf7d0a"),
labels = c("circle", "disk", "fireball", "light", "sphere", "triangle"),
name = "Shape of UFO Reported") +
labs(title = "Counts for each shape reported in the four most populated states")
bargraph
Filtering for the data that
leafletdata <- UFO |>
select("country", "state", "shape", "duration_(seconds)", "latitude", "longitude", "year_only", "duration_(hours/min)") |>
filter(country == "us", year_only %in% c("1980", "1990", "2000", "2010"))
head(leafletdata)
## # A tibble: 6 × 8
## country state shape `duration_(seconds)` latitude longitude year_only
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 us tx sphere 180 29.7630556 -95.3630556 1980
## 2 us tx unknown 300 32.7833333 -96.8000000 1980
## 3 us nh light 300 42.9955556 -71.4552778 1980
## 4 us ga triangle 120 31.7058333 -83.6533333 1990
## 5 us mo oval 180 37.7241667 -89.8611111 2000
## 6 us fl other 10 26.5250000 -80.0666667 2000
## # ℹ 1 more variable: `duration_(hours/min)` <chr>
Setting the latitude, longitude, and the duration in seconds to numeric.
Creating the popup for the leaflet
popupUFO <- paste0(
"<b>Year: </b>", leafletdata$year_only, "<br>",
"<b>Duration: </b>", leafletdata$`duration_(hours/min)`, "<br>",
"<b>State: </b>", leafletdata$state, "<br>",
"<b>Shape: </b>", leafletdata$shape, "<br>"
)
For some reason the leaflet won’t change where the graph first spawns(?), as in the setView just isn’t working for some reason. So, when you first run the chunk you will be in the arctic ocean and will need to move over to the United States.
leaflet() |>
setView(lng = 38.7946, lat = 106.5348, zoom = 5) |>
addProviderTiles("Esri.NatGeoWorldMap") |>
addCircles(
data = leafletdata,
color = "black",
fillColor = "white",
fillOpacity = 0.25,
popup = popupUFO
)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
To answer my research question I performed a Chi Square test with the variables “state” and “shape.” In my analysis I found that the shape of the UFO is associated with which state it was reported from. This means that depending on the state, the most commonly reported shape of UFO will be different. I found from my first graph that in the states with the highest counts of reports, the UFO is most often reported as light, and not a specifc shape. In my second graph, a leaflet, I found that the the distribution of the UFO sightings is identical to a map of the USA at night, or a population density map of the USA. In my opinion it’s indicating that most UFO sightings are actually due to something humans are doing: a reflection of light, a drone, a plane, etc. Because of this, in the future, I could try to find a dataset that can be used to map the activities of planes and helicopters to see if there is any correlation. An alternative would be a dataset that tracks use of electricity.
UFO Sightings. 13 Nov. 2019, www.kaggle.com/datasets/NUFORC/ufo-sightings.
shout out to my cat for being present during the completion of both
my projects
Thank you professor! Have a good summer <3