Introduction:

The existence of extraterrestrial beings remains a mystery to mankind. Claims of UFO sightings have been recorded throughout the world and discovering evidence that a UFO truly does exist will almost certainly lead to the discovery of alien life. The purpose of this project is to perform exploratory analysis and provide data visualization for the UFO Sightings data set that can be found at Kaggle.com. Through RStudio, this project utilizes 80,000 reports of UFO sightings over the last century to show the most common places, times of day/month/year, and object shapes for reported UFO sightings worldwide.

The project will look at variables such as country, state, city, date and time of a sighting as well as the shape of the UFO. Through the use of basic descriptive statistics, such as the mode, as well as data visualization in the form of bar charts, this project aims to provide a comprehensive analysis of reported UFO sightings. This project uses exploratory analysis to identify where and when the most common UFO sightings occur as well as the most commonly reported shape.

By describing where and when UFO sightings most often occur, along with the most commonly reported shape, the results from this project can be used to help finally unveil the mystery of alien life. If you dare to go hunting :)

Packages Required:

Below is a brief overview of the packages usesd throughout this project:

library("tidyr") # used to tidy data
library("tibble") # used to create tibbles
library("DT") # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library("dplyr") # used for data manipulation, works well with tidyr
library("ggplot2")#visualizatin package
library("readr") # used to more easily read in original data set from csv format

Data Preparation:

Source Data:

The source data can be found at Kaggle.com.

scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv")
## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_integer(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
as_tibble(scrubbed)
## # A tibble: 80,332 x 11
##            datetime                 city state country    shape
##               <chr>                <chr> <chr>   <chr>    <chr>
##  1 10/10/1949 20:30           san marcos    tx      us cylinder
##  2 10/10/1949 21:00         lackland afb    tx    <NA>    light
##  3 10/10/1955 17:00 chester (uk/england)  <NA>      gb   circle
##  4 10/10/1956 21:00                 edna    tx      us   circle
##  5 10/10/1960 20:00              kaneohe    hi      us    light
##  6 10/10/1961 19:00              bristol    tn      us   sphere
##  7 10/10/1965 21:00   penarth (uk/wales)  <NA>      gb   circle
##  8 10/10/1965 23:45              norwalk    ct      us     disk
##  9 10/10/1966 20:00            pell city    al      us     disk
## 10 10/10/1966 21:00             live oak    fl      us     disk
## # ... with 80,322 more rows, and 6 more variables: `duration
## #   (seconds)` <int>, `duration (hours/min)` <chr>, comments <chr>, `date
## #   posted` <chr>, latitude <dbl>, longitude <dbl>

The original data set contains a total of 80,332 observations and 11 variables. The 11 variables are listed below:

Code Book:

Data Cleaning:

The original data set, scrubbed.csv, is read in to a data table, scrubbed, using the read_csv() function. Next, following a pipe function, the datetime variable is separated into Date and Time. I decided to clean the original data set, now with 12 variables, and create a new data table named reported_sightings including Date, Time, city, state, country, shape, latitude, and longitude. Then, following a pipe function, I separated the Date variable into Month, Day and Year. Now, I have a data table with 10 variables of interest. The tibble, sightings_tib, was created to represent the reported_sightings data table in a simpler fashion.

#Reading in csv:
scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv") %>% 
#Separating datetime variable into Date and Time by (space):
separate(datetime, into = c("Date", "Time"), sep = " ")
#Creating reported_sightings data table to used for analysis:
reported_sightings <- select(scrubbed, Date, Time, city, state, country, shape, latitude, longitude) %>%
#Separating the Date variable into Month, Day and Year by /:
separate(Date, into = c("Month", "Day", "Year"), sep = "/")
View(reported_sightings)
#Store
library(tibble)
as_tibble(reported_sightings)
## # A tibble: 80,332 x 10
##    Month   Day  Year  Time                 city state country    shape
##  * <chr> <chr> <chr> <chr>                <chr> <chr>   <chr>    <chr>
##  1    10    10  1949 20:30           san marcos    tx      us cylinder
##  2    10    10  1949 21:00         lackland afb    tx    <NA>    light
##  3    10    10  1955 17:00 chester (uk/england)  <NA>      gb   circle
##  4    10    10  1956 21:00                 edna    tx      us   circle
##  5    10    10  1960 20:00              kaneohe    hi      us    light
##  6    10    10  1961 19:00              bristol    tn      us   sphere
##  7    10    10  1965 21:00   penarth (uk/wales)  <NA>      gb   circle
##  8    10    10  1965 23:45              norwalk    ct      us     disk
##  9    10    10  1966 20:00            pell city    al      us     disk
## 10    10    10  1966 21:00             live oak    fl      us     disk
## # ... with 80,322 more rows, and 2 more variables: latitude <dbl>,
## #   longitude <dbl>

Exploratory Analysis:

Descriptive Statistics:

For an exploratory analysis for the reported_sightings data table, this study created data tables for the individual variables of interest (Month, Day, Year, Time, city, state, country, shape, latitude and longitude). Then, vectors were created for each of these data tables and each variable’s respective mode was found for both categorical and numerical (only longitude and latitude) variables. The summary shows that we are dealing with mainly categorical variables. For these categorical variables, we are interested in the most commonly occuring character value, or the mode for each variable. The mode is the descriptive statistic that will identify where and when the most UFO sightings occur along with the most often reported shape.

summary(reported_sightings)
##     Month               Day                Year          
##  Length:80332       Length:80332       Length:80332      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Time               city              state          
##  Length:80332       Length:80332       Length:80332      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    country             shape              latitude        longitude      
##  Length:80332       Length:80332       Min.   :-82.86   Min.   :-176.66  
##  Class :character   Class :character   1st Qu.: 34.13   1st Qu.:-112.07  
##  Mode  :character   Mode  :character   Median : 39.41   Median : -87.90  
##                                        Mean   : 38.12   Mean   : -86.77  
##                                        3rd Qu.: 42.79   3rd Qu.: -78.75  
##                                        Max.   : 72.70   Max.   : 178.44  
##                                        NA's   :1
month <- select(reported_sightings, Month)
month <- table(as.vector(month))
names(month)[month == max(month)] #July
## [1] "7"
day <- select(reported_sightings, Day)
day <- table(as.vector(day))
names(day)[day == max(day)] #15th most common with frequency of 5968
## [1] "15"
year <- select(reported_sightings, Year)
year <- table(as.vector(year))
names(year)[year == max(year)] #2012
## [1] "2012"
time <- select(reported_sightings, Time)
time <- table(as.vector(time))
names(time)[time == max(time)] #22:00 or 10:00 PM
## [1] "22:00"
city <- select(reported_sightings, city)
city <- table(as.vector(city))
names(city)[city == max(city)] #Seattle
## [1] "seattle"
state <- select(reported_sightings, state)
state <- table(as.vector(state))
names(state)[state == max(state)] #California
## [1] "ca"
country <- select(reported_sightings, country)
country <- table(as.vector(country))
names(country)[country == max(country)] #US
## [1] "us"
shape1 <- select(reported_sightings, shape)
shape1 <- table(as.vector(shape1))
names(shape1)[shape1 == max(shape1)] #light
## [1] "light"
latitude <- select(reported_sightings, latitude)
latitude <- table(as.vector(latitude))
names(latitude)[latitude == max(latitude)] #46.6063889
## [1] "47.6063889"
longitude <- select(reported_sightings, longitude)
longitude <- table(as.vector(longitude))
names(longitude)[longitude == max(longitude)] #-122.3308333
## [1] "-122.3308333"

This study decided to focus on the United States because this was the country with the most reported sightings. After filtering the reported_sightings data table for us character values in the variable country, I created a new data table, US, that contains 65,114 observations (the number of reported sighitings in the US over the past century). Next, I filtered the US data table for the state variable with the most reported sightings. This was the character value ca, or California.

# Using filter() function to create new data table only including observations in the
#country, US
US <- filter(reported_sightings, country == "us")
# Usig filter() function to creat new data table only including observations in the state,
# ca
State <- filter(US, state == "ca")
# Using group_by() and summarize() functions to find the city in California with the highest
#count
City  <- group_by(State,city)
  ( sumCity <- summarize(City,count=n()) )
## # A tibble: 1,203 x 2
##               city count
##              <chr> <int>
##  1          acampo     1
##  2           acton     5
##  3 acton (approx.)     1
##  4        adelanto     4
##  5    agoura hills     6
##  6      agua dulce     2
##  7         aguanga     1
##  8        ahwahnee     2
##  9         alameda    14
## 10           alamo     1
## # ... with 1,193 more rows
# Using as_tibble(), arrange(), and desc() functions to find the city in California with
# the highest count
as_tibble(sumCity) %>%
  arrange(desc(count))
## # A tibble: 1,203 x 2
##             city count
##            <chr> <int>
##  1   los angeles   352
##  2     san diego   336
##  3    sacramento   201
##  4 san francisco   186
##  5      san jose   186
##  6        fresno   107
##  7    long beach    79
##  8   bakersfield    78
##  9       burbank    77
## 10       modesto    77
## # ... with 1,193 more rows
# The city with the highest count was Los Angeles, 352
# Using a filter() function to create new data table only including observations in the city,
# Los Angeles
LA <- filter(State, city == "los angeles")
# Using group_by() and summarize() functions to find the Time that most UFO sightings are
# reported in Los Angeles
Time <- group_by(LA,Time)
  ( sumTime <- summarize(Time,count=n()) )
## # A tibble: 151 x 2
##     Time count
##    <chr> <int>
##  1 00:00     4
##  2 00:03     1
##  3 00:04     1
##  4 00:06     1
##  5 00:10     2
##  6 00:22     1
##  7 00:27     1
##  8 00:30     1
##  9 01:00     6
## 10 01:08     1
## # ... with 141 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest count
# in Los angeles
as_tibble(sumTime) %>%
  arrange(desc(count))
## # A tibble: 151 x 2
##     Time count
##    <chr> <int>
##  1 21:00    20
##  2 22:00    15
##  3 23:00    14
##  4 18:00     9
##  5 20:30     9
##  6 22:30     9
##  7 17:00     8
##  8 20:00     8
##  9 13:00     7
## 10 15:00     7
## # ... with 141 more rows
# The Time with the highest count was 21:00, 20 and 22:00, our worldwide mode, at a close
# second with a count of 15

# Using group_by() and summarize() functions to find the Month that most UFO sightings are
# reported in Los Angeles
Month <- group_by(LA,Month)
( sumMonth <- summarize(Month,count=n()) )
## # A tibble: 12 x 2
##    Month count
##    <chr> <int>
##  1     1    31
##  2    10    30
##  3    11    29
##  4    12    34
##  5     2    19
##  6     3    30
##  7     4    28
##  8     5    22
##  9     6    35
## 10     7    35
## 11     8    32
## 12     9    27
# Using as_tibble(), arrange(), and desc(), functions to find the Month with the highest count
# in Los angeles
as_tibble(sumMonth) %>%
  arrange(desc(count))
## # A tibble: 12 x 2
##    Month count
##    <chr> <int>
##  1     6    35
##  2     7    35
##  3    12    34
##  4     8    32
##  5     1    31
##  6    10    30
##  7     3    30
##  8    11    29
##  9     4    28
## 10     9    27
## 11     5    22
## 12     2    19
# There are two months tied for the highest count, June and July, at 35
# This is consistent with our woldwide mode of July, with Summer being the hottest
# season for UFO sightings (no pun intended)

# Using group_by() and summarize() functions to find the Day that most UFO sightings are
# reported in Los Angeles
Day <- group_by(LA,Day)
( sumDay <- summarize(Day,count=n()) )
## # A tibble: 31 x 2
##      Day count
##    <chr> <int>
##  1     1    19
##  2    10     9
##  3    11    16
##  4    12    17
##  5    13    18
##  6    14    12
##  7    15    21
##  8    16     8
##  9    17    11
## 10    18    12
## # ... with 21 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest count
# in Los angeles
as_tibble(sumDay) %>%
  arrange(desc(count))
## # A tibble: 31 x 2
##      Day count
##    <chr> <int>
##  1    15    21
##  2     1    19
##  3    13    18
##  4    12    17
##  5    11    16
##  6    28    16
##  7     2    13
##  8    23    13
##  9    14    12
## 10    18    12
## # ... with 21 more rows
# The Day of the month with the higest count in Los Angeles is the 15th, with 21
# This is consistent with our worldwide mode

# Using group_by() and summarize() functions to find the shape of most UFOs when sightings are
# reported in Los Angeles
Shape <- group_by(LA,shape)
( sumShape <- summarize(Shape,count=n()) )
## # A tibble: 21 x 2
##       shape count
##       <chr> <int>
##  1 changing    13
##  2  chevron     5
##  3    cigar    16
##  4   circle    36
##  5     cone     3
##  6 cylinder     1
##  7  diamond     4
##  8     disk    29
##  9      egg     5
## 10 fireball    29
## # ... with 11 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the shapes with the highest count
# in Los angeles
as_tibble(sumShape) %>%
  arrange(desc(count))
## # A tibble: 21 x 2
##       shape count
##       <chr> <int>
##  1    light    63
##  2   circle    36
##  3 triangle    30
##  4     disk    29
##  5 fireball    29
##  6  unknown    26
##  7   sphere    25
##  8    other    21
##  9    cigar    16
## 10     oval    16
## # ... with 11 more rows
# The shapes with the highest count in Los Angeles are light(63), circle(36) and #triangle(30)

Data Visualization

library(ggplot2)
# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# The country with the most reported UFO sightings using reported_sightings
ggplot(data = reported_sightings, aes(x = country)) +
  geom_bar() 

# The US has the most reported sightings out of the list of 5 countries, by far
# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# the state in the US with the most reported UFO sightings
ggplot(data = US, aes (x=state)) +
  geom_bar() 

# Most common in California; Florida and Washington essentially tied for second
# Using filter() function to create data table, top_cities, with the 10 top cities in
# California and their respective counts
top_cities <- filter(sumCity, count >= 77)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the city in California with the most reported sightings
ggplot(data = top_cities, aes(x=city, y = count)) +
  geom_point()

# The city in California with the most reported UFO sightings is Los Angeles
# Using filter() function to create data table, top_time, with the 10 top times of the day
# in Los Angeles for reports of UFO sightings
top_time <- filter(sumTime, count >= 7)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the time of day in Los Angeles, CA with the most reported sightings
ggplot(data = top_time, aes(x=Time, y = count)) +
  geom_point()

# The time of day in Los Angeles, CA when the most UFO sightings are reported is 21:00
# Using filter() function to create data table, top_month, with the 10 top months of
# the year for UFO sightings in LA
top_month <- filter(sumMonth, count >= 19)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the month of the year, in LA, with the most reported sightings
ggplot(data = top_month, aes(x=Month, y = count)) +
  geom_point()

# The months of the year in Los Angeles, CA when the most UFO sightings are reported are
# June and July
# Using filter() function to create data table, top_day, with the 10 top days of the month,
# in LA, with the most reported sightings
top_day <- filter(sumDay, count >= 12)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the day of the month, in LA, with the most reported sightings
ggplot(data = top_day, aes(x=Day, y = count)) +
  geom_point()

# The day of the month in LA when the most UFO sightings are reported is the 15th
# Using filter() function to create data table, top_shape, with the 10 top shapes of UFOs
# reported to be seen in LA
top_shape <- filter(sumShape, count >= 16)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the most commonly reported shape of UFOs In LA sightings: light, circle and triangle
ggplot(data = top_shape, aes(x=shape, y = count)) +
  geom_point()

Summary:

Based on early exploratory analysis and data visualization, this project has uncovered insights about UFO sightings in the United States. According to the data, the United States is the country with the most reported UFO sightings. The state with the most reported sightings is California, and the city within CA is Los Angeles. In Los Angeles, CA, most UFO sightings occur between 9:00 and 10:00 PM, in June and July, on the 15th day of the month. Also, the most commonly reported shapes of UFOs sighted in LA are light, triangle, and circle.