Context
This dataset contains over 80,000 reports of UFO sightings over the last century.
Content
This dataset includes 80,332 rows and 11 columns. The columns represent the date/time of the sighting, the location (city, state and country) of the sighting, the shape of the object, the duration (in hours/minutes/seconds) of the sighting, comments describing the sighting, the date posted, and the latitude and longitude of the sighting. The reports date back to the 20th century, some older data might be obscured.
The reports come from the National UFO Reporting Center’s (NUFORC’s) website. Further information on NUFORC and up-to-date datasets are available here: http://www.nuforc.org/.
Questions
We decided to look at the data and formulate our own questions, many of which coincide with what the compilers laid out. I will be exploring two questions: 1) Do certain shapes tend to have a longer/shorter duration? and 2) Do certain months tend to have a longer/shorter duration?
Acknowledgement
This dataset was scraped, geolocated, and time standardized from NUFORC data by Sigmond Axel https://github.com/planetsig/ufo-reports. We accessed it from kaggle https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv.
# load tidyverse to read in data
library("tidyverse")
## ── Attaching packages ─────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ufo_sightings <- read_csv("scrubbed.csv")
## Parsed with column specification:
## cols(
## datetime = col_character(),
## city = col_character(),
## state = col_character(),
## country = col_character(),
## shape = col_character(),
## `duration (seconds)` = col_double(),
## `duration (hours/min)` = col_character(),
## comments = col_character(),
## `date posted` = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
## Warning: 4 parsing failures.
## row col expected actual file
## 27823 duration (seconds) no trailing characters ` 'scrubbed.csv'
## 35693 duration (seconds) no trailing characters ` 'scrubbed.csv'
## 43783 latitude no trailing characters q.200088 'scrubbed.csv'
## 58592 duration (seconds) no trailing characters ` 'scrubbed.csv'
Examine the parsing errors
errors <- ufo_sightings[c(27823, 35693, 43783, 58592), ]
errors
## # A tibble: 4 x 11
## datetime city state country shape `duration (seco… `duration (hour… comments
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2/2/200… bouse az us <NA> NA each a few seco… Driving…
## 2 4/10/20… sant… ca us <NA> NA eight seconds 2 red l…
## 3 5/22/19… mesc… nm <NA> rect… 180 two hours Huge re…
## 4 7/21/20… ibag… <NA> <NA> circ… NA 1/2 segundo Viajaba…
## # … with 3 more variables: `date posted` <chr>, latitude <dbl>, longitude <dbl>
These have errors because of NAs in multiple columns, including two that I’m looking into–shape and duration (seconds). I can repair one of them by adding data into the duration (seconds) column, using what is written in the duration (hours/min) column (1/2 segundo). The two errors that are missing shape info I will have to delete later. The fourth error doesn’t concern my data at this point so I will leave it alone.
ufo_sightings[58592, 6] <- .5
Code to create new “year” and “month” columns
newufo <- mutate(newufo, year = year(datetime))
newufo <- mutate(newufo, month = month(datetime, label = TRUE))
newufo <- mutate(newufo, hour = hour(datetime))
newufo <- mutate(newufo, minute = minute(datetime))
Remove the two errors that are missing shape data.
Since I couldn’t make up shapes for these rows, I deleted them entirely so that I did not end up with NAs once I started working with the data.
newufo <- newufo[-c(27823,35693), ]
Look at the structure, the first few rows, and last few rows of the data.
str(newufo)
## tibble [80,330 × 15] (S3: tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:80330], format: "1949-10-10 20:30:00" "1949-10-10 21:00:00" ...
## $ city : chr [1:80330] "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
## $ state : chr [1:80330] "tx" "tx" NA "tx" ...
## $ country : chr [1:80330] "us" NA "gb" "us" ...
## $ shape : chr [1:80330] "cylinder" "light" "circle" "circle" ...
## $ duration (seconds) : num [1:80330] 2700 7200 20 20 900 300 180 1200 180 120 ...
## $ duration (hours/min): chr [1:80330] "45 minutes" "1-2 hrs" "20 seconds" "1/2 hour" ...
## $ comments : chr [1:80330] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime." "Green/Orange circular disc over Chester, England" "My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I "| __truncated__ ...
## $ date posted : chr [1:80330] "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
## $ latitude : num [1:80330] 29.9 29.4 53.2 29 21.4 ...
## $ longitude : num [1:80330] -97.94 -98.58 -2.92 -96.65 -157.8 ...
## $ year : num [1:80330] 1949 1949 1955 1956 1960 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 10 10 10 10 10 10 10 10 10 ...
## $ hour : int [1:80330] 20 21 17 21 20 19 21 23 20 21 ...
## $ minute : int [1:80330] 30 0 0 0 0 0 0 45 0 0 ...
## - attr(*, "problems")= tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:4] 27823 35693 43783 58592
## ..$ col : chr [1:4] "duration (seconds)" "duration (seconds)" "latitude" "duration (seconds)"
## ..$ expected: chr [1:4] "no trailing characters" "no trailing characters" "no trailing characters" "no trailing characters"
## ..$ actual : chr [1:4] "`" "`" "q.200088" "`"
## ..$ file : chr [1:4] "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'"
head(newufo)
## # A tibble: 6 x 15
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1949-10-10 20:30:00 san … tx us cyli… 2700
## 2 1949-10-10 21:00:00 lack… tx <NA> light 7200
## 3 1955-10-10 17:00:00 ches… <NA> gb circ… 20
## 4 1956-10-10 21:00:00 edna tx us circ… 20
## 5 1960-10-10 20:00:00 kane… hi us light 900
## 6 1961-10-10 19:00:00 bris… tn us sphe… 300
## # … with 9 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>,
## # hour <int>, minute <int>
tail(newufo)
## # A tibble: 6 x 15
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 2013-09-09 21:00:00 wood… ga us sphe… 20
## 2 2013-09-09 21:15:00 nash… tn us light 600
## 3 2013-09-09 22:00:00 boise id us circ… 1200
## 4 2013-09-09 22:00:00 napa ca us other 1200
## 5 2013-09-09 22:20:00 vien… va us circ… 5
## 6 2013-09-09 23:00:00 edmo… ok us cigar 1020
## # … with 9 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>,
## # hour <int>, minute <int>
Create two new columns called “duration_hours” and “duration_mins.”
I created two new columns to show duration in hours and in minutes, using the existing column “duration (seconds),” because the column called “duration (hours/min)” was not uniform in its descriptions and contained lots of text. The “duration (seconds)” column was all numeric, so I was easily able to convert it into hours and minutes columns simply by dividing the seconds by 3600 and 60, respectively.
newufo <- mutate(newufo, duration_hours = newufo$`duration (seconds)` / 3600)
newufo <- mutate(newufo, duration_mins = newufo$`duration (seconds)` / 60)
Fixing my own mistake
While trying to figure out how to round my two new columns to 2 decimal places, I created a couple new columns by accident, which I removed here.
newufo$digits <- NULL
newufo$`round(digits = 2)` <- NULL
Round the duration_hours and duration_mins columns to 2 decimal places.
I did this because the data contained a lot of decimals, making it harder to read and messier to look at.
newufo <- mutate(newufo, across(16:17, round, 2))
####Look at the average duration in minutes by shape.
I created a dataframe to look at shape and my new column, duration_mins. I took the means of duration and grouped by shape.
df_duration_shape <- newufo %>%
select(duration_mins, shape) %>%
group_by(shape) %>%
summarise_all(funs(mean))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
df_duration_shape
## # A tibble: 30 x 2
## shape duration_mins
## <chr> <dbl>
## 1 changed 60
## 2 changing 34.7
## 3 chevron 7.67
## 4 cigar 32.0
## 5 circle 79.5
## 6 cone 1380.
## 7 crescent 315.
## 8 cross 12.5
## 9 cylinder 57.1
## 10 delta 38.5
## # … with 20 more rows
Look at the average duration in descending order
Looking at it in this order, sightings with shape as “NA” had the second longest duration on average which I find interesting. Perhaps the shape was described in the “comments” column, but for this project, we are not yet knowledgeable on how to deal with large amounts of string data, so I was unable to investigate further, although I am curious to see the comments for these sightings with shape as NA.
Besides that, it is interesting to see that cone shaped sightings had far longer duration on average than all of the other shapes. About 1,000 more minutes on average. This would be interesting to look into further.
df_duration_shape_desc <- df_duration_shape %>%
arrange(desc(duration_mins))
df_duration_shape_desc
## # A tibble: 30 x 2
## shape duration_mins
## <chr> <dbl>
## 1 cone 1380.
## 2 <NA> 783.
## 3 sphere 363.
## 4 other 344.
## 5 crescent 315.
## 6 light 220.
## 7 unknown 92.4
## 8 flash 88.8
## 9 circle 79.5
## 10 fireball 67.1
## # … with 20 more rows
Create a barplot for this data
You can clearly see that cone has a much longer duration than all of the other shapes, followed by NA. I am indeed very curious to know if the shape was just not entered in the shape column as a inputting error, or if there is an explanation to why the people who reported sightings with quite long duration on average were not able to report the shape.
shape_duration_plot <- ggplot(df_duration_shape, aes(shape, duration_mins)) +
geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+labs(y = "Duration (Mins)", x = "Shape") + ggtitle("Average Duration in Minutes of UFO Sightings by Shape") +theme(plot.title = element_text(size = 11)) + coord_flip()
shape_duration_plot

Look at the same data, but now with duration in hours instead of minutes
I chose to do this because I wanted to see if the data was easier to digest and comprehend with a different scale of duration.
df_hours_shape <- newufo %>%
select(duration_hours, shape) %>%
group_by(shape) %>%
summarise_all(funs(mean))
df_hours_shape
## # A tibble: 30 x 2
## shape duration_hours
## <chr> <dbl>
## 1 changed 1
## 2 changing 0.577
## 3 chevron 0.127
## 4 cigar 0.532
## 5 circle 1.32
## 6 cone 23.0
## 7 crescent 5.25
## 8 cross 0.208
## 9 cylinder 0.951
## 10 delta 0.641
## # … with 20 more rows
Create a barplot looking at shape and duration in hours.
You can see that the scale of Duration (Hours) is a bit easier to understand, since it goes by increments of 5 hours instead of 500 minutes. I think having more tick marks makes the plot more representative of the data.
shape_hours_plot <- ggplot(df_hours_shape, aes(shape, duration_hours)) +
geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+ labs(y = "Duration (Hours)", x = "Shape") + ggtitle("Average Duration in Hours of UFO Sightings by Shape") +theme(plot.title = element_text(size = 11)) + coord_flip()
shape_hours_plot

####Look at the sum duration in hours by month
I created a dataframe to look at month and my new column, duration_hours. I took the sum of duration and grouped by month. Since I was taking the sum instead of the mean, I chose to use duration of hours since I knew the totals would be quite large.
df_duration_month <- newufo %>%
select(duration_hours, month) %>%
group_by(month) %>%
summarise_all(funs(sum))
df_duration_month
## # A tibble: 12 x 2
## month duration_hours
## <ord> <dbl>
## 1 Jan 6972.
## 2 Feb 6163.
## 3 Mar 11476.
## 4 Apr 17680.
## 5 May 4205.
## 6 Jun 42929.
## 7 Jul 4529.
## 8 Aug 37568.
## 9 Sep 22080.
## 10 Oct 35495.
## 11 Nov 3041.
## 12 Dec 9043.
Create a barplot looking at this data
You can see that June, August and October have the greatest total duration in hours of all the months. Perhaps this could be explained by those months have a greater number of sightings–this would be interesting to investigate.
month_hours_plot <- ggplot(df_duration_month, aes(month, duration_hours)) +
geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+ labs(y = "Duration (Hours)", x = "Month") + ggtitle("Total Duration in Hours of UFO Sightings per Month") +theme(plot.title = element_text(size = 12))
month_hours_plot

Now look at the average duration in hours by month
I was curious to see if there would be noticeable differences in the sums and the averages of the durations each month, so I decided to look at the means as well.
df_duration_month2 <- newufo %>%
select(duration_hours, month) %>%
group_by(month) %>%
summarise_all(funs(mean))
df_duration_month2
## # A tibble: 12 x 2
## month duration_hours
## <ord> <dbl>
## 1 Jan 1.23
## 2 Feb 1.32
## 3 Mar 2.11
## 4 Apr 3.20
## 5 May 0.795
## 6 Jun 5.28
## 7 Jul 0.475
## 8 Aug 4.35
## 9 Sep 2.91
## 10 Oct 4.79
## 11 Nov 0.451
## 12 Dec 1.60
Create a barplot looking at the average duration of sightings per month.
As you can see, the plots are very similar in appearance, with a couple months being slightly lower/higher than they were in the previous plot, such as August being higher than October in total, but lower than October on average. It would be interesting to look into why these graphs are so similar as well as what caused the few shifts.
month_hours_plot2 <- ggplot(df_duration_month2, aes(month, duration_hours)) +
geom_bar(stat = "identity", fill = "chartreuse3", color = "white") + labs(y = "Duration (Hours)", x = "Month") + ggtitle("Average Duration in Hours of UFO Sightings per Month") +theme(plot.title = element_text(size = 12))
month_hours_plot2
