The data compilers suggested the following questions: What areas of the country are most likely to have UFO sightings? Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal? Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers? What are the most common UFO descriptions?
We decided to look at the data and formulate our own questions, many of which coincide with what the compilers laid out. I will be exploring two questions: 1) are sightings clustered in time; and 2) are there more sightings in certain countries and in areas of those countries.
Acknowledgement
This dataset was scraped, geolocated, and time standardized from NUFORC data by Sigmond Axel https://github.com/planetsig/ufo-reports. We accessed it from kaggle https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv.
# load tidyverse to read in data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ufo_sightings <- read_csv("scrubbed.csv")
## Parsed with column specification:
## cols(
## datetime = col_character(),
## city = col_character(),
## state = col_character(),
## country = col_character(),
## shape = col_character(),
## `duration (seconds)` = col_double(),
## `duration (hours/min)` = col_character(),
## comments = col_character(),
## `date posted` = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
## Warning: 4 parsing failures.
## row col expected actual file
## 27823 duration (seconds) no trailing characters ` 'scrubbed.csv'
## 35693 duration (seconds) no trailing characters ` 'scrubbed.csv'
## 43783 latitude no trailing characters q.200088 'scrubbed.csv'
## 58592 duration (seconds) no trailing characters ` 'scrubbed.csv'
Examine the parsing errors
errors <- ufo_sightings[c(27823, 35693, 43783, 58592), ]
errors
## # A tibble: 4 x 11
## datetime city state country shape `duration (seco… `duration (hour… comments
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2/2/200… bouse az us <NA> NA each a few seco… Driving…
## 2 4/10/20… sant… ca us <NA> NA eight seconds 2 red l…
## 3 5/22/19… mesc… nm <NA> rect… 180 two hours Huge re…
## 4 7/21/20… ibag… <NA> <NA> circ… NA 1/2 segundo Viajaba…
## # … with 3 more variables: `date posted` <chr>, latitude <dbl>, longitude <dbl>
These have errors because in three cases description of the duration in seconds is missing, and, in one case, because the latitude is missing.
We should delete the one missing latitude before mapping the data points; if we need to work with the duration, we should decide whether to estimate a value for “a few seconds” or if we should just delete that entry.
We can repair the seconds in two of the other three because it is written in the duration (hours/min) column.
ufo_sightings[35693, 6] <- 8
ufo_sightings[58592, 6] <- .5
ufo_sightings[c(35693, 58592), ]
## # A tibble: 2 x 11
## datetime city state country shape `duration (seco… `duration (hour… comments
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 4/10/20… sant… ca us <NA> 8 eight seconds 2 red l…
## 2 7/21/20… ibag… <NA> <NA> circ… 0.5 1/2 segundo Viajaba…
## # … with 3 more variables: `date posted` <chr>, latitude <dbl>, longitude <dbl>