Data101Project2

Context

This dataset contains over 80,000 reports of UFO sightings over the last century.

Content

This dataset includes 80,332 rows and 11 columns. The columns represent the date/time of the sighting, the location (city, state and country) of the sighting, the shape of the object, the duration (in hours/minutes/seconds) of the sighting, comments describing the sighting, the date posted, and the latitude and longitude of the sighting. The reports date back to the 20th century, some older data might be obscured.

The reports come from the National UFO Reporting Center’s (NUFORC’s) website. Further information on NUFORC and up-to-date datasets are available here: http://www.nuforc.org/.

Questions

We decided to look at the data and formulate our own questions, many of which coincide with what the compilers laid out. I will be exploring two questions: 1) Do certain shapes tend to have a longer/shorter duration? and 2) Do certain months tend to have a longer/shorter duration?

Acknowledgement

This dataset was scraped, geolocated, and time standardized from NUFORC data by Sigmond Axel https://github.com/planetsig/ufo-reports. We accessed it from kaggle https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv.

# load tidyverse to read in data

library("tidyverse")

## ── Attaching packages ─────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ufo_sightings <- read_csv("scrubbed.csv")

## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_double(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

## Warning: 4 parsing failures.
##   row                col               expected   actual           file
## 27823 duration (seconds) no trailing characters `        'scrubbed.csv'
## 35693 duration (seconds) no trailing characters `        'scrubbed.csv'
## 43783 latitude           no trailing characters q.200088 'scrubbed.csv'
## 58592 duration (seconds) no trailing characters `        'scrubbed.csv'

Examine the parsing errors

errors <- ufo_sightings[c(27823, 35693, 43783, 58592), ]
errors

## # A tibble: 4 x 11
##   datetime city  state country shape `duration (seco… `duration (hour… comments
##   <chr>    <chr> <chr> <chr>   <chr>            <dbl> <chr>            <chr>   
## 1 2/2/200… bouse az    us      <NA>                NA each a few seco… Driving…
## 2 4/10/20… sant… ca    us      <NA>                NA eight seconds    2 red l…
## 3 5/22/19… mesc… nm    <NA>    rect…              180 two hours        Huge re…
## 4 7/21/20… ibag… <NA>  <NA>    circ…               NA 1/2 segundo      Viajaba…
## # … with 3 more variables: `date posted` <chr>, latitude <dbl>, longitude <dbl>

These have errors because of NAs in multiple columns, including two that I’m looking into–shape and duration (seconds). I can repair one of them by adding data into the duration (seconds) column, using what is written in the duration (hours/min) column (1/2 segundo). The two errors that are missing shape info I will have to delete later. The fourth error doesn’t concern my data at this point so I will leave it alone.

ufo_sightings[58592, 6] <- .5

Code to change datetime format

To more easily work with the dates and times in our data, we will convert the datetime column from “character” format to “datetime” format using the lubridate package.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# convert the "datetime" column into date format from character format
newufo <- mutate(ufo_sightings, datetime = mdy_hm(ufo_sightings$datetime))

To see patterns by date and by time, we create separate columns for years, months, hours and minutes (of sighting). I will be examining the “month” column later.

Code to create new “year” and “month” columns

newufo <- mutate(newufo, year = year(datetime))
newufo <- mutate(newufo, month = month(datetime, label = TRUE))
newufo <- mutate(newufo, hour = hour(datetime))
newufo <- mutate(newufo, minute = minute(datetime))

Remove the two errors that are missing shape data.

Since I couldn’t make up shapes for these rows, I deleted them entirely so that I did not end up with NAs once I started working with the data.

newufo <- newufo[-c(27823,35693), ]

Look at the structure, the first few rows, and last few rows of the data.

str(newufo)

## tibble [80,330 × 15] (S3: tbl_df/tbl/data.frame)
##  $ datetime            : POSIXct[1:80330], format: "1949-10-10 20:30:00" "1949-10-10 21:00:00" ...
##  $ city                : chr [1:80330] "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
##  $ state               : chr [1:80330] "tx" "tx" NA "tx" ...
##  $ country             : chr [1:80330] "us" NA "gb" "us" ...
##  $ shape               : chr [1:80330] "cylinder" "light" "circle" "circle" ...
##  $ duration (seconds)  : num [1:80330] 2700 7200 20 20 900 300 180 1200 180 120 ...
##  $ duration (hours/min): chr [1:80330] "45 minutes" "1-2 hrs" "20 seconds" "1/2 hour" ...
##  $ comments            : chr [1:80330] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "1949 Lackland AFB&#44 TX.  Lights racing across the sky &amp; making 90 degree turns on a dime." "Green/Orange circular disc over Chester&#44 England" "My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I "| __truncated__ ...
##  $ date posted         : chr [1:80330] "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
##  $ latitude            : num [1:80330] 29.9 29.4 53.2 29 21.4 ...
##  $ longitude           : num [1:80330] -97.94 -98.58 -2.92 -96.65 -157.8 ...
##  $ year                : num [1:80330] 1949 1949 1955 1956 1960 ...
##  $ month               : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ hour                : int [1:80330] 20 21 17 21 20 19 21 23 20 21 ...
##  $ minute              : int [1:80330] 30 0 0 0 0 0 0 45 0 0 ...
##  - attr(*, "problems")= tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:4] 27823 35693 43783 58592
##   ..$ col     : chr [1:4] "duration (seconds)" "duration (seconds)" "latitude" "duration (seconds)"
##   ..$ expected: chr [1:4] "no trailing characters" "no trailing characters" "no trailing characters" "no trailing characters"
##   ..$ actual  : chr [1:4] "`" "`" "q.200088" "`"
##   ..$ file    : chr [1:4] "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'"

head(newufo)

## # A tibble: 6 x 15
##   datetime            city  state country shape `duration (seco…
##   <dttm>              <chr> <chr> <chr>   <chr>            <dbl>
## 1 1949-10-10 20:30:00 san … tx    us      cyli…             2700
## 2 1949-10-10 21:00:00 lack… tx    <NA>    light             7200
## 3 1955-10-10 17:00:00 ches… <NA>  gb      circ…               20
## 4 1956-10-10 21:00:00 edna  tx    us      circ…               20
## 5 1960-10-10 20:00:00 kane… hi    us      light              900
## 6 1961-10-10 19:00:00 bris… tn    us      sphe…              300
## # … with 9 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## #   posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>,
## #   hour <int>, minute <int>

tail(newufo)

## # A tibble: 6 x 15
##   datetime            city  state country shape `duration (seco…
##   <dttm>              <chr> <chr> <chr>   <chr>            <dbl>
## 1 2013-09-09 21:00:00 wood… ga    us      sphe…               20
## 2 2013-09-09 21:15:00 nash… tn    us      light              600
## 3 2013-09-09 22:00:00 boise id    us      circ…             1200
## 4 2013-09-09 22:00:00 napa  ca    us      other             1200
## 5 2013-09-09 22:20:00 vien… va    us      circ…                5
## 6 2013-09-09 23:00:00 edmo… ok    us      cigar             1020
## # … with 9 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## #   posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>,
## #   hour <int>, minute <int>

Create two new columns called “duration_hours” and “duration_mins.”

I created two new columns to show duration in hours and in minutes, using the existing column “duration (seconds),” because the column called “duration (hours/min)” was not uniform in its descriptions and contained lots of text. The “duration (seconds)” column was all numeric, so I was easily able to convert it into hours and minutes columns simply by dividing the seconds by 3600 and 60, respectively.

newufo <- mutate(newufo, duration_hours = newufo$`duration (seconds)` / 3600)
newufo <- mutate(newufo, duration_mins = newufo$`duration (seconds)` / 60)

Fixing my own mistake

While trying to figure out how to round my two new columns to 2 decimal places, I created a couple new columns by accident, which I removed here.

newufo$digits <- NULL
newufo$`round(digits = 2)` <- NULL

Round the duration_hours and duration_mins columns to 2 decimal places.

I did this because the data contained a lot of decimals, making it harder to read and messier to look at.

newufo <- mutate(newufo, across(16:17, round, 2))

####Look at the average duration in minutes by shape.

I created a dataframe to look at shape and my new column, duration_mins. I took the means of duration and grouped by shape.

df_duration_shape <- newufo %>% 
select(duration_mins, shape) %>%
group_by(shape) %>%
summarise_all(funs(mean))

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

df_duration_shape

## # A tibble: 30 x 2
##    shape    duration_mins
##    <chr>            <dbl>
##  1 changed          60   
##  2 changing         34.7 
##  3 chevron           7.67
##  4 cigar            32.0 
##  5 circle           79.5 
##  6 cone           1380.  
##  7 crescent        315.  
##  8 cross            12.5 
##  9 cylinder         57.1 
## 10 delta            38.5 
## # … with 20 more rows

Look at the average duration in descending order

Looking at it in this order, sightings with shape as “NA” had the second longest duration on average which I find interesting. Perhaps the shape was described in the “comments” column, but for this project, we are not yet knowledgeable on how to deal with large amounts of string data, so I was unable to investigate further, although I am curious to see the comments for these sightings with shape as NA.

Besides that, it is interesting to see that cone shaped sightings had far longer duration on average than all of the other shapes. About 1,000 more minutes on average. This would be interesting to look into further.

df_duration_shape_desc <- df_duration_shape %>%
  arrange(desc(duration_mins))
df_duration_shape_desc

## # A tibble: 30 x 2
##    shape    duration_mins
##    <chr>            <dbl>
##  1 cone            1380. 
##  2 <NA>             783. 
##  3 sphere           363. 
##  4 other            344. 
##  5 crescent         315. 
##  6 light            220. 
##  7 unknown           92.4
##  8 flash             88.8
##  9 circle            79.5
## 10 fireball          67.1
## # … with 20 more rows

Create a barplot for this data

You can clearly see that cone has a much longer duration than all of the other shapes, followed by NA. I am indeed very curious to know if the shape was just not entered in the shape column as a inputting error, or if there is an explanation to why the people who reported sightings with quite long duration on average were not able to report the shape.

shape_duration_plot <- ggplot(df_duration_shape, aes(shape, duration_mins)) +
  geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+labs(y = "Duration (Mins)", x = "Shape") + ggtitle("Average Duration in Minutes of UFO Sightings by Shape") +theme(plot.title = element_text(size = 11)) + coord_flip()
shape_duration_plot

Look at the same data, but now with duration in hours instead of minutes

I chose to do this because I wanted to see if the data was easier to digest and comprehend with a different scale of duration.

df_hours_shape <- newufo %>% 
select(duration_hours, shape) %>%
group_by(shape) %>%
summarise_all(funs(mean))
df_hours_shape

## # A tibble: 30 x 2
##    shape    duration_hours
##    <chr>             <dbl>
##  1 changed           1    
##  2 changing          0.577
##  3 chevron           0.127
##  4 cigar             0.532
##  5 circle            1.32 
##  6 cone             23.0  
##  7 crescent          5.25 
##  8 cross             0.208
##  9 cylinder          0.951
## 10 delta             0.641
## # … with 20 more rows

Create a barplot looking at shape and duration in hours.

You can see that the scale of Duration (Hours) is a bit easier to understand, since it goes by increments of 5 hours instead of 500 minutes. I think having more tick marks makes the plot more representative of the data.

shape_hours_plot <- ggplot(df_hours_shape, aes(shape, duration_hours)) +
  geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+ labs(y = "Duration (Hours)", x = "Shape") + ggtitle("Average Duration in Hours of UFO Sightings by Shape") +theme(plot.title = element_text(size = 11)) + coord_flip()
shape_hours_plot

####Look at the sum duration in hours by month

I created a dataframe to look at month and my new column, duration_hours. I took the sum of duration and grouped by month. Since I was taking the sum instead of the mean, I chose to use duration of hours since I knew the totals would be quite large.

df_duration_month <- newufo %>% 
select(duration_hours, month) %>%
group_by(month) %>%
summarise_all(funs(sum))
df_duration_month

## # A tibble: 12 x 2
##    month duration_hours
##    <ord>          <dbl>
##  1 Jan            6972.
##  2 Feb            6163.
##  3 Mar           11476.
##  4 Apr           17680.
##  5 May            4205.
##  6 Jun           42929.
##  7 Jul            4529.
##  8 Aug           37568.
##  9 Sep           22080.
## 10 Oct           35495.
## 11 Nov            3041.
## 12 Dec            9043.

Create a barplot looking at this data

You can see that June, August and October have the greatest total duration in hours of all the months. Perhaps this could be explained by those months have a greater number of sightings–this would be interesting to investigate.

month_hours_plot <- ggplot(df_duration_month, aes(month, duration_hours)) +
  geom_bar(stat = "identity", fill = "chartreuse3", color = "white")+ labs(y = "Duration (Hours)", x = "Month") + ggtitle("Total Duration in Hours of UFO Sightings per Month") +theme(plot.title = element_text(size = 12))
month_hours_plot

Now look at the average duration in hours by month

I was curious to see if there would be noticeable differences in the sums and the averages of the durations each month, so I decided to look at the means as well.

df_duration_month2 <- newufo %>% 
select(duration_hours, month) %>%
group_by(month) %>%
summarise_all(funs(mean))
df_duration_month2

## # A tibble: 12 x 2
##    month duration_hours
##    <ord>          <dbl>
##  1 Jan            1.23 
##  2 Feb            1.32 
##  3 Mar            2.11 
##  4 Apr            3.20 
##  5 May            0.795
##  6 Jun            5.28 
##  7 Jul            0.475
##  8 Aug            4.35 
##  9 Sep            2.91 
## 10 Oct            4.79 
## 11 Nov            0.451
## 12 Dec            1.60

Create a barplot looking at the average duration of sightings per month.

As you can see, the plots are very similar in appearance, with a couple months being slightly lower/higher than they were in the previous plot, such as August being higher than October in total, but lower than October on average. It would be interesting to look into why these graphs are so similar as well as what caused the few shifts.

month_hours_plot2 <- ggplot(df_duration_month2, aes(month, duration_hours)) +
  geom_bar(stat = "identity", fill = "chartreuse3", color = "white") + labs(y = "Duration (Hours)", x = "Month") + ggtitle("Average Duration in Hours of UFO Sightings per Month") +theme(plot.title = element_text(size = 12))
month_hours_plot2