The existence of extraterrestrial beings remains a mystery to mankind. Claims of UFO sightings have been recorded throughout the world and discovering evidence that a UFO truly does exist will almost certainly lead to the discovery of alien life. The purpose of this project is to perform exploratory analysis and provide data visualization for the UFO Sightings data set that can be found at Kaggle.com. Through RStudio, this project utilizes 80,000 reports of UFO sightings over the last century to show the most common places, times of day/month/year, and object shapes for reported UFO sightings worldwide.
The project will look at variables such as country, state, city, date and time of a sighting as well as the shape of the UFO. Through the use of basic descriptive statistics, such as the mode, as well as data visualization in the form of bar charts, this project aims to provide a comprehensive analysis of reported UFO sightings. This project uses exploratory analysis to identify where and when the most common UFO sightings occur as well as the most commonly reported shape.
By describing where and when UFO sightings most often occur, along with the most commonly reported shape, the results from this project can be used to help finally unveil the mystery of alien life. If you dare to go hunting :)
Below is a brief overview of the packages usesd throughout this project:
library("tidyr") # used to tidy data
library("tibble") # used to create tibbles
library("DT") # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library("dplyr") # used for data manipulation, works well with tidyr
library("ggplot2")#visualizatin package
library("readr") # used to more easily read in original data set from csv format
The source data can be found at Kaggle.com.
scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv")
## Parsed with column specification:
## cols(
## datetime = col_character(),
## city = col_character(),
## state = col_character(),
## country = col_character(),
## shape = col_character(),
## `duration (seconds)` = col_integer(),
## `duration (hours/min)` = col_character(),
## comments = col_character(),
## `date posted` = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
as_tibble(scrubbed)
## # A tibble: 80,332 x 11
## datetime city state country shape
## <chr> <chr> <chr> <chr> <chr>
## 1 10/10/1949 20:30 san marcos tx us cylinder
## 2 10/10/1949 21:00 lackland afb tx <NA> light
## 3 10/10/1955 17:00 chester (uk/england) <NA> gb circle
## 4 10/10/1956 21:00 edna tx us circle
## 5 10/10/1960 20:00 kaneohe hi us light
## 6 10/10/1961 19:00 bristol tn us sphere
## 7 10/10/1965 21:00 penarth (uk/wales) <NA> gb circle
## 8 10/10/1965 23:45 norwalk ct us disk
## 9 10/10/1966 20:00 pell city al us disk
## 10 10/10/1966 21:00 live oak fl us disk
## # ... with 80,322 more rows, and 6 more variables: `duration
## # (seconds)` <int>, `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>
The original data set contains a total of 80,332 observations and 11 variables. The 11 variables are listed below:
The original data set, scrubbed.csv, is read in to a data table, scrubbed, using the read_csv() function. Next, following a pipe function, the datetime variable is separated into Date and Time. I decided to clean the original data set, now with 12 variables, and create a new data table named reported_sightings including Date, Time, city, state, country, shape, latitude, and longitude. Then, following a pipe function, I separated the Date variable into Month, Day and Year. Now, I have a data table with 10 variables of interest. The tibble, sightings_tib, was created to represent the reported_sightings data table in a simpler fashion.
#Reading in csv:
scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv") %>%
#Separating datetime variable into Date and Time by (space):
separate(datetime, into = c("Date", "Time"), sep = " ")
#Creating reported_sightings data table to used for analysis:
reported_sightings <- select(scrubbed, Date, Time, city, state, country, shape, latitude, longitude) %>%
#Separating the Date variable into Month, Day and Year by /:
separate(Date, into = c("Month", "Day", "Year"), sep = "/")
View(reported_sightings)
#Store
library(tibble)
as_tibble(reported_sightings)
## # A tibble: 80,332 x 10
## Month Day Year Time city state country shape
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 10 10 1949 20:30 san marcos tx us cylinder
## 2 10 10 1949 21:00 lackland afb tx <NA> light
## 3 10 10 1955 17:00 chester (uk/england) <NA> gb circle
## 4 10 10 1956 21:00 edna tx us circle
## 5 10 10 1960 20:00 kaneohe hi us light
## 6 10 10 1961 19:00 bristol tn us sphere
## 7 10 10 1965 21:00 penarth (uk/wales) <NA> gb circle
## 8 10 10 1965 23:45 norwalk ct us disk
## 9 10 10 1966 20:00 pell city al us disk
## 10 10 10 1966 21:00 live oak fl us disk
## # ... with 80,322 more rows, and 2 more variables: latitude <dbl>,
## # longitude <dbl>
For an exploratory analysis for the reported_sightings data table, this study created data tables for the individual variables of interest (Month, Day, Year, Time, city, state, country, shape, latitude and longitude). Then, vectors were created for each of these data tables and each variable’s respective mode was found for both categorical and numerical (only longitude and latitude) variables. The summary shows that we are dealing with mainly categorical variables. For these categorical variables, we are interested in the most commonly occuring character value, or the mode for each variable. The mode is the descriptive statistic that will identify where and when the most UFO sightings occur along with the most often reported shape.
summary(reported_sightings)
## Month Day Year
## Length:80332 Length:80332 Length:80332
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Time city state
## Length:80332 Length:80332 Length:80332
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## country shape latitude longitude
## Length:80332 Length:80332 Min. :-82.86 Min. :-176.66
## Class :character Class :character 1st Qu.: 34.13 1st Qu.:-112.07
## Mode :character Mode :character Median : 39.41 Median : -87.90
## Mean : 38.12 Mean : -86.77
## 3rd Qu.: 42.79 3rd Qu.: -78.75
## Max. : 72.70 Max. : 178.44
## NA's :1
month <- select(reported_sightings, Month)
month <- table(as.vector(month))
names(month)[month == max(month)] #July
## [1] "7"
day <- select(reported_sightings, Day)
day <- table(as.vector(day))
names(day)[day == max(day)] #15th most common with frequency of 5968
## [1] "15"
year <- select(reported_sightings, Year)
year <- table(as.vector(year))
names(year)[year == max(year)] #2012
## [1] "2012"
time <- select(reported_sightings, Time)
time <- table(as.vector(time))
names(time)[time == max(time)] #22:00 or 10:00 PM
## [1] "22:00"
city <- select(reported_sightings, city)
city <- table(as.vector(city))
names(city)[city == max(city)] #Seattle
## [1] "seattle"
state <- select(reported_sightings, state)
state <- table(as.vector(state))
names(state)[state == max(state)] #California
## [1] "ca"
country <- select(reported_sightings, country)
country <- table(as.vector(country))
names(country)[country == max(country)] #US
## [1] "us"
shape1 <- select(reported_sightings, shape)
shape1 <- table(as.vector(shape1))
names(shape1)[shape1 == max(shape1)] #light
## [1] "light"
latitude <- select(reported_sightings, latitude)
latitude <- table(as.vector(latitude))
names(latitude)[latitude == max(latitude)] #46.6063889
## [1] "47.6063889"
longitude <- select(reported_sightings, longitude)
longitude <- table(as.vector(longitude))
names(longitude)[longitude == max(longitude)] #-122.3308333
## [1] "-122.3308333"
This study decided to focus on the United States because this was the country with the most reported sightings. After filtering the reported_sightings data table for us character values in the variable country, I created a new data table, US, that contains 65,114 observations (the number of reported sighitings in the US over the past century). Next, I filtered the US data table for the state variable with the most reported sightings. This was the character value ca, or California.
# Using filter() function to create new data table only including observations in the
#country, US
US <- filter(reported_sightings, country == "us")
# Usig filter() function to creat new data table only including observations in the state,
# ca
State <- filter(US, state == "ca")
# Using group_by() and summarize() functions to find the city in California with the highest
#count
City <- group_by(State,city)
( sumCity <- summarize(City,count=n()) )
## # A tibble: 1,203 x 2
## city count
## <chr> <int>
## 1 acampo 1
## 2 acton 5
## 3 acton (approx.) 1
## 4 adelanto 4
## 5 agoura hills 6
## 6 agua dulce 2
## 7 aguanga 1
## 8 ahwahnee 2
## 9 alameda 14
## 10 alamo 1
## # ... with 1,193 more rows
# Using as_tibble(), arrange(), and desc() functions to find the city in California with
# the highest count
as_tibble(sumCity) %>%
arrange(desc(count))
## # A tibble: 1,203 x 2
## city count
## <chr> <int>
## 1 los angeles 352
## 2 san diego 336
## 3 sacramento 201
## 4 san francisco 186
## 5 san jose 186
## 6 fresno 107
## 7 long beach 79
## 8 bakersfield 78
## 9 burbank 77
## 10 modesto 77
## # ... with 1,193 more rows
# The city with the highest count was Los Angeles, 352
# Using a filter() function to create new data table only including observations in the city,
# Los Angeles
LA <- filter(State, city == "los angeles")
# Using group_by() and summarize() functions to find the Time that most UFO sightings are
# reported in Los Angeles
Time <- group_by(LA,Time)
( sumTime <- summarize(Time,count=n()) )
## # A tibble: 151 x 2
## Time count
## <chr> <int>
## 1 00:00 4
## 2 00:03 1
## 3 00:04 1
## 4 00:06 1
## 5 00:10 2
## 6 00:22 1
## 7 00:27 1
## 8 00:30 1
## 9 01:00 6
## 10 01:08 1
## # ... with 141 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest count
# in Los angeles
as_tibble(sumTime) %>%
arrange(desc(count))
## # A tibble: 151 x 2
## Time count
## <chr> <int>
## 1 21:00 20
## 2 22:00 15
## 3 23:00 14
## 4 18:00 9
## 5 20:30 9
## 6 22:30 9
## 7 17:00 8
## 8 20:00 8
## 9 13:00 7
## 10 15:00 7
## # ... with 141 more rows
# The Time with the highest count was 21:00, 20 and 22:00, our worldwide mode, at a close
# second with a count of 15
# Using group_by() and summarize() functions to find the Month that most UFO sightings are
# reported in Los Angeles
Month <- group_by(LA,Month)
( sumMonth <- summarize(Month,count=n()) )
## # A tibble: 12 x 2
## Month count
## <chr> <int>
## 1 1 31
## 2 10 30
## 3 11 29
## 4 12 34
## 5 2 19
## 6 3 30
## 7 4 28
## 8 5 22
## 9 6 35
## 10 7 35
## 11 8 32
## 12 9 27
# Using as_tibble(), arrange(), and desc(), functions to find the Month with the highest count
# in Los angeles
as_tibble(sumMonth) %>%
arrange(desc(count))
## # A tibble: 12 x 2
## Month count
## <chr> <int>
## 1 6 35
## 2 7 35
## 3 12 34
## 4 8 32
## 5 1 31
## 6 10 30
## 7 3 30
## 8 11 29
## 9 4 28
## 10 9 27
## 11 5 22
## 12 2 19
# There are two months tied for the highest count, June and July, at 35
# This is consistent with our woldwide mode of July, with Summer being the hottest
# season for UFO sightings (no pun intended)
# Using group_by() and summarize() functions to find the Day that most UFO sightings are
# reported in Los Angeles
Day <- group_by(LA,Day)
( sumDay <- summarize(Day,count=n()) )
## # A tibble: 31 x 2
## Day count
## <chr> <int>
## 1 1 19
## 2 10 9
## 3 11 16
## 4 12 17
## 5 13 18
## 6 14 12
## 7 15 21
## 8 16 8
## 9 17 11
## 10 18 12
## # ... with 21 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest count
# in Los angeles
as_tibble(sumDay) %>%
arrange(desc(count))
## # A tibble: 31 x 2
## Day count
## <chr> <int>
## 1 15 21
## 2 1 19
## 3 13 18
## 4 12 17
## 5 11 16
## 6 28 16
## 7 2 13
## 8 23 13
## 9 14 12
## 10 18 12
## # ... with 21 more rows
# The Day of the month with the higest count in Los Angeles is the 15th, with 21
# This is consistent with our worldwide mode
# Using group_by() and summarize() functions to find the shape of most UFOs when sightings are
# reported in Los Angeles
Shape <- group_by(LA,shape)
( sumShape <- summarize(Shape,count=n()) )
## # A tibble: 21 x 2
## shape count
## <chr> <int>
## 1 changing 13
## 2 chevron 5
## 3 cigar 16
## 4 circle 36
## 5 cone 3
## 6 cylinder 1
## 7 diamond 4
## 8 disk 29
## 9 egg 5
## 10 fireball 29
## # ... with 11 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the shapes with the highest count
# in Los angeles
as_tibble(sumShape) %>%
arrange(desc(count))
## # A tibble: 21 x 2
## shape count
## <chr> <int>
## 1 light 63
## 2 circle 36
## 3 triangle 30
## 4 disk 29
## 5 fireball 29
## 6 unknown 26
## 7 sphere 25
## 8 other 21
## 9 cigar 16
## 10 oval 16
## # ... with 11 more rows
# The shapes with the highest count in Los Angeles are light(63), circle(36) and #triangle(30)
library(ggplot2)
# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# The country with the most reported UFO sightings using reported_sightings
ggplot(data = reported_sightings, aes(x = country)) +
geom_bar()
# The US has the most reported sightings out of the list of 5 countries, by far
# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# the state in the US with the most reported UFO sightings
ggplot(data = US, aes (x=state)) +
geom_bar()
# Most common in California; Florida and Washington essentially tied for second
# Using filter() function to create data table, top_cities, with the 10 top cities in
# California and their respective counts
top_cities <- filter(sumCity, count >= 77)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the city in California with the most reported sightings
ggplot(data = top_cities, aes(x=city, y = count)) +
geom_point()
# The city in California with the most reported UFO sightings is Los Angeles
# Using filter() function to create data table, top_time, with the 10 top times of the day
# in Los Angeles for reports of UFO sightings
top_time <- filter(sumTime, count >= 7)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the time of day in Los Angeles, CA with the most reported sightings
ggplot(data = top_time, aes(x=Time, y = count)) +
geom_point()
# The time of day in Los Angeles, CA when the most UFO sightings are reported is 21:00
# Using filter() function to create data table, top_month, with the 10 top months of
# the year for UFO sightings in LA
top_month <- filter(sumMonth, count >= 19)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the month of the year, in LA, with the most reported sightings
ggplot(data = top_month, aes(x=Month, y = count)) +
geom_point()
# The months of the year in Los Angeles, CA when the most UFO sightings are reported are
# June and July
# Using filter() function to create data table, top_day, with the 10 top days of the month,
# in LA, with the most reported sightings
top_day <- filter(sumDay, count >= 12)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the day of the month, in LA, with the most reported sightings
ggplot(data = top_day, aes(x=Day, y = count)) +
geom_point()
# The day of the month in LA when the most UFO sightings are reported is the 15th
# Using filter() function to create data table, top_shape, with the 10 top shapes of UFOs
# reported to be seen in LA
top_shape <- filter(sumShape, count >= 16)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the most commonly reported shape of UFOs In LA sightings: light, circle and triangle
ggplot(data = top_shape, aes(x=shape, y = count)) +
geom_point()
Based on early exploratory analysis and data visualization, this project has uncovered insights about UFO sightings in the United States. According to the data, the United States is the country with the most reported UFO sightings. The state with the most reported sightings is California, and the city within CA is Los Angeles. In Los Angeles, CA, most UFO sightings occur between 9:00 and 10:00 PM, in June and July, on the 15th day of the month. Also, the most commonly reported shapes of UFOs sighted in LA are light, triangle, and circle.