Introduction:

The existence of extraterrestrial beings remains a mystery to mankind. Claims of UFO sightings have been recorded throughout the world and discovering evidence that a UFO truly does exist will almost certainly lead to the discovery of alien life. The purpose of this project is to perform exploratory analysis and provide data visualization for the UFO Sightings data set that can be found at Kaggle.com. Through RStudio, this project utilizes 80,000 reports of UFO sightings over the last century to show the most common places, times of day/month/year, and object shapes for reported UFO sightings worldwide.

The project will look at variables such as country, state, city, date and time of a sighting as well as the shape of the UFO. Through the use of basic descriptive statistics, such as the mode, as well as data visualization in the form of bar charts, this project aims to provide a comprehensive analysis of reported UFO sightings. This project uses exploratory analysis to identify where and when the most common UFO sightings occured, as well as the most commonly reported shape, over the past century.

By describing where and when reported UFO sightings most often occured, the results from this project will identify when and where the most alien activity has occured throughout our recent history. These insights can be used in proceeding predictive analysis with the hopes of unveiling the ancient mistery of alien life. If you dare go hunting :)

Packages Required:

Below is a brief overview of the packages used throughout this project:

library("tidyverse") # Used to load packages dplyr, tidyr, tibble, readr, readxl and #ggplot2
library("DT") # Used to make data tables
library("knitr") # knit function
library("maps") #Used for data visualization
library("mapdata") #Used for data visualization
library("devtools")
library("stringr")

Packages:

tidyverse: Used to load packages dplyr, tidyr, tibble, readr, readxl and ggplot2
DT: Used to make data tables
knitr: knit function
maps: Used for data visualization
mapdata: Used for data visualization
devtools: Used to configure package behavior

Note: The tidyverse package is a package of packages and automatically loads other packages used in this project.

dplyr: Used for data manipulation
tidyr: Used to tidy data
tibble: Used to represent data as a tibble
readr: Used to read in .csv files
readxl: Used to read in excel files
ggplot2: Used for data visualization
stringr: Used for common string operations

Data Preparation:

Source Data:

The source data can be found at Kaggle.com.

scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv")
as_tibble(scrubbed)
## # A tibble: 80,332 x 11
##            datetime                 city state country    shape
##               <chr>                <chr> <chr>   <chr>    <chr>
##  1 10/10/1949 20:30           san marcos    tx      us cylinder
##  2 10/10/1949 21:00         lackland afb    tx    <NA>    light
##  3 10/10/1955 17:00 chester (uk/england)  <NA>      gb   circle
##  4 10/10/1956 21:00                 edna    tx      us   circle
##  5 10/10/1960 20:00              kaneohe    hi      us    light
##  6 10/10/1961 19:00              bristol    tn      us   sphere
##  7 10/10/1965 21:00   penarth (uk/wales)  <NA>      gb   circle
##  8 10/10/1965 23:45              norwalk    ct      us     disk
##  9 10/10/1966 20:00            pell city    al      us     disk
## 10 10/10/1966 21:00             live oak    fl      us     disk
## # ... with 80,322 more rows, and 6 more variables: `duration
## #   (seconds)` <int>, `duration (hours/min)` <chr>, comments <chr>, `date
## #   posted` <chr>, latitude <dbl>, longitude <dbl>

The original data set contains a total of 80,332 observations and 11 variables. The first 20 rows of the data set and columns for the original 11 variables can be found below:

datatable(head(scrubbed, 20))

Data Cleaning:

The original data set, scrubbed.csv, is read in to a data table, scrubbed, using the read_csv() function. Next, following a pipe function, the datetime variable is separated into Date and Time. I decided to clean the original data set, now with 12 variables, and create a new data table named reported_sightings including Date, Time, city, state, country, shape, latitude, and longitude. Then, following a pipe function, I separated the Date variable into Month, Day and Year. Now, I have a data table with 10 variables of interest. The tibble, sightings_tib, was created to represent the reported_sightings data table in a simpler fashion.

#Reading in csv:
scrubbed <- read_csv("~/R programs/Intro R Final Project/scrubbed.csv/scrubbed.csv") %>% 
#Separating datetime variable into Date and Time by (space):
separate(datetime, into = c("Date", "Time"), sep = " ")
#Creating reported_sightings data table to used for analysis:
reported_sightings <- select(scrubbed, Date, Time, city, state, country, shape, latitude, longitude) %>%
#Separating the Date variable into Month, Day and Year by /:
separate(Date, into = c("Month", "Day", "Year"), sep = "/")
View(reported_sightings)
#Store
as_tibble(reported_sightings)
## # A tibble: 80,332 x 10
##    Month   Day  Year  Time                 city state country    shape
##  * <chr> <chr> <chr> <chr>                <chr> <chr>   <chr>    <chr>
##  1    10    10  1949 20:30           san marcos    tx      us cylinder
##  2    10    10  1949 21:00         lackland afb    tx    <NA>    light
##  3    10    10  1955 17:00 chester (uk/england)  <NA>      gb   circle
##  4    10    10  1956 21:00                 edna    tx      us   circle
##  5    10    10  1960 20:00              kaneohe    hi      us    light
##  6    10    10  1961 19:00              bristol    tn      us   sphere
##  7    10    10  1965 21:00   penarth (uk/wales)  <NA>      gb   circle
##  8    10    10  1965 23:45              norwalk    ct      us     disk
##  9    10    10  1966 20:00            pell city    al      us     disk
## 10    10    10  1966 21:00             live oak    fl      us     disk
## # ... with 80,322 more rows, and 2 more variables: latitude <dbl>,
## #   longitude <dbl>
datatable(head(reported_sightings, 20))

Data Dictionary:

var_name <- colnames(reported_sightings)
var_des <- c("Month of the year sighting was reported", "Day of the month sighting was reported", "Year sighting was reported", "Time of the day sighting was reported", "City in which sighting was reported", "State in which sighting was reported", "Country in which sighting was reported", "Shape that was reported", "Latitudinal coordinate where sighting was reported", "Longitudinal coordinate where sighting was reported")
var_type <- sapply(reported_sightings, class)
data_dictionary <- as.data.frame(cbind(var_name, var_type, var_des))
kable(data_dictionary, caption = "Variable Dictionary:")
Variable Dictionary:
var_name var_type var_des
Month Month character Month of the year sighting was reported
Day Day character Day of the month sighting was reported
Year Year character Year sighting was reported
Time Time character Time of the day sighting was reported
city city character City in which sighting was reported
state state character State in which sighting was reported
country country character Country in which sighting was reported
shape shape character Shape that was reported
latitude latitude numeric Latitudinal coordinate where sighting was reported
longitude longitude numeric Longitudinal coordinate where sighting was reported

Exploratory Analysis:

Descriptive Statistics:

International:

For an exploratory analysis of the reported_sightings data table, this study created data tables for the individual variables of interest (Month, Day, Year, Time, city, state, country, shape, latitude and longitude). Then, vectors were created for each of these data tables and each variable’s respective mode was found for the categorical variables (all excluding latitude and longitude). The summary shows that we are dealing with mainly categorical variables. For these categorical variables, we are interested in the most commonly occuring character value, or the mode for each variable. The mode is the descriptive statistic that will identify where and when the most UFO sightings occured along with the most often reported shape.

summary(reported_sightings)
##     Month               Day                Year          
##  Length:80332       Length:80332       Length:80332      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Time               city              state          
##  Length:80332       Length:80332       Length:80332      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    country             shape              latitude        longitude      
##  Length:80332       Length:80332       Min.   :-82.86   Min.   :-176.66  
##  Class :character   Class :character   1st Qu.: 34.13   1st Qu.:-112.07  
##  Mode  :character   Mode  :character   Median : 39.41   Median : -87.90  
##                                        Mean   : 38.12   Mean   : -86.77  
##                                        3rd Qu.: 42.79   3rd Qu.: -78.75  
##                                        Max.   : 72.70   Max.   : 178.44  
##                                        NA's   :1
month <- select(reported_sightings, Month)
month <- table(as.vector(month))
names(month)[month == max(month)] #July
## [1] "7"

Mode - Month: July

day <- select(reported_sightings, Day)
day <- table(as.vector(day))
names(day)[day == max(day)] #15th most common with frequency of 5968
## [1] "15"

Mode - Day: 15th

year <- select(reported_sightings, Year)
year <- table(as.vector(year))
names(year)[year == max(year)] #2012
## [1] "2012"

Mode - Year: 2012

time <- select(reported_sightings, Time)
time <- table(as.vector(time))
names(time)[time == max(time)] #22:00 or 10:00 PM
## [1] "22:00"

Mode - Time: 22:00

city <- select(reported_sightings, city)
city <- table(as.vector(city))
names(city)[city == max(city)] #Seattle
## [1] "seattle"

Mode - City: Seattle

state <- select(reported_sightings, state)
state <- table(as.vector(state))
names(state)[state == max(state)] #California
## [1] "ca"

Mode - Country: US

country <- select(reported_sightings, country)
country <- table(as.vector(country))
names(country)[country == max(country)] #US
## [1] "us"

Mode - Shape: light

shape1 <- select(reported_sightings, shape)
shape1 <- table(as.vector(shape1))
names(shape1)[shape1 == max(shape1)] #light
## [1] "light"

Based on descriptive statistics found in the above exploratory analysis, important insights into where and when the most UFO sightings were reported, as well as their most common shape were determined on a global scale:

  Month: July
  Day: 15th day of the month
  Year: 2012
  Time: 22:00 or 10:00PM
  City: Seattle, Washington, USA
  Country: United States
  Shape: Light

United States:

This study decided to focus on the United States because this was the country with the most reported sightings. After filtering the reported_sightings data table for us character values in the variable country, I created a new data table, US, that contains 65,114 observations (the number of reported sighitings in the US over the past century). Using the US data table, this project ran further exploratory analysis to determine where, when and in what shape most UFO sightings are reported in the US. I filtered the US data table for the state variable with the most reported sightings, ca or California (count 8912) and then created a new data table named State.

The number of reported UFO sightings in California: 8912

Found in the tibble below

# Using filter() function to create new data table only including observations in the
# country, US
US <- filter(reported_sightings, country == "us")
# Using filter() function to create new data table only including observations in the      # state, ca and group_by() and summarize() functions to find the state in the US with #highest count
State <- filter(US, state == "ca")
  ( sumState <- summarize(State,count=n()) ) # count 8912 for ca
## # A tibble: 1 x 1
##   count
##   <int>
## 1  8912

Los Angeles, CA:

To further focus the exploratory analysis, this study determined the city in California with the most reported UFO sightings, los angeles or Los Angeles.

Now that we knew California was the state with the most reported sightings, this study explored for the city in California where the most UFO sightings were reported. Now we can determine an individual city in CA and finally figure out exactly “where” the most UFO sightings were reported in the US.

The number of reported UFO sightings in Los Angeles, California: 352

Found in the tibble below

# Using group_by() and summarize() functions to find the city in California with the #highest
#count
City  <- group_by(State,city)
  ( sumCity <- summarize(City,count=n()) )
## # A tibble: 1,203 x 2
##               city count
##              <chr> <int>
##  1          acampo     1
##  2           acton     5
##  3 acton (approx.)     1
##  4        adelanto     4
##  5    agoura hills     6
##  6      agua dulce     2
##  7         aguanga     1
##  8        ahwahnee     2
##  9         alameda    14
## 10           alamo     1
## # ... with 1,193 more rows
# Using as_tibble(), arrange(), and desc() functions to find the city in California with
# the highest count
as_tibble(sumCity) %>%
 arrange(desc(count))
## # A tibble: 1,203 x 2
##             city count
##            <chr> <int>
##  1   los angeles   352
##  2     san diego   336
##  3    sacramento   201
##  4 san francisco   186
##  5      san jose   186
##  6        fresno   107
##  7    long beach    79
##  8   bakersfield    78
##  9       burbank    77
## 10       modesto    77
## # ... with 1,193 more rows
# The city with the highest count was Los Angeles, 352

Now that the “where” aspect of our exploratory analysis is complete, this study pivots to determining what time of the day, month, and year, in Los Angeles, that the most UFO sightings were reported along with their most commonly reported shapes.

With a data table, LA, now the analysis focuses on only the observations from this city.

Note: This study ignores the year in Los Angeles, CA with the most reported UFO sightings because the variable is not very useful in any future predictive analysis that tis exploratory analysis hopes to inspire.

The time of the day with the most reported sightings in LA: 21:00 or 9:00PM

Found in the tibble below

# Using a filter() function to create new data table only including observations in the city,
# Los Angeles
LA <- filter(State, city == "los angeles")
# Using group_by() and summarize() functions to find the Time that most UFO sightings are
# reported in Los Angeles
Time <- group_by(LA,Time)
  ( sumTime <- summarize(Time,count=n()) )
## # A tibble: 151 x 2
##     Time count
##    <chr> <int>
##  1 00:00     4
##  2 00:03     1
##  3 00:04     1
##  4 00:06     1
##  5 00:10     2
##  6 00:22     1
##  7 00:27     1
##  8 00:30     1
##  9 01:00     6
## 10 01:08     1
## # ... with 141 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest #count
# in Los angeles
as_tibble(sumTime) %>%
  arrange(desc(count))
## # A tibble: 151 x 2
##     Time count
##    <chr> <int>
##  1 21:00    20
##  2 22:00    15
##  3 23:00    14
##  4 18:00     9
##  5 20:30     9
##  6 22:30     9
##  7 17:00     8
##  8 20:00     8
##  9 13:00     7
## 10 15:00     7
## # ... with 141 more rows
# The Time with the highest count was 21:00, 20 and 22:00, our worldwide mode, at a close
# second with a count of 15

The time of the month with the most reported sightings in LA: June and July

Found in the tibble below

# Using group_by() and summarize() functions to find the Month that most UFO sightings are
# reported in Los Angeles
Month <- group_by(LA,Month)
( sumMonth <- summarize(Month,count=n()) )
## # A tibble: 12 x 2
##    Month count
##    <chr> <int>
##  1     1    31
##  2    10    30
##  3    11    29
##  4    12    34
##  5     2    19
##  6     3    30
##  7     4    28
##  8     5    22
##  9     6    35
## 10     7    35
## 11     8    32
## 12     9    27
# Using as_tibble(), arrange(), and desc(), functions to find the Month with the highest #count
# in Los angeles
as_tibble(sumMonth) %>%
  arrange(desc(count))
## # A tibble: 12 x 2
##    Month count
##    <chr> <int>
##  1     6    35
##  2     7    35
##  3    12    34
##  4     8    32
##  5     1    31
##  6    10    30
##  7     3    30
##  8    11    29
##  9     4    28
## 10     9    27
## 11     5    22
## 12     2    19
# There are two months tied for the highest count, June and July, at 35
# This is consistent with our woldwide mode of July, with Summer being the hottest
# season for UFO sightings (no pun intended)

The day of the month with the most reported sightings in LA: 15th

Found in the tibble below

# Using group_by() and summarize() functions to find the Day that most UFO sightings are
# reported in Los Angeles
Day <- group_by(LA,Day)
( sumDay <- summarize(Day,count=n()) )
## # A tibble: 31 x 2
##      Day count
##    <chr> <int>
##  1     1    19
##  2    10     9
##  3    11    16
##  4    12    17
##  5    13    18
##  6    14    12
##  7    15    21
##  8    16     8
##  9    17    11
## 10    18    12
## # ... with 21 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the Time with the highest #count
# in Los angeles
as_tibble(sumDay) %>%
  arrange(desc(count))
## # A tibble: 31 x 2
##      Day count
##    <chr> <int>
##  1    15    21
##  2     1    19
##  3    13    18
##  4    12    17
##  5    11    16
##  6    28    16
##  7     2    13
##  8    23    13
##  9    14    12
## 10    18    12
## # ... with 21 more rows
# The Day of the month with the higest count in Los Angeles is the 15th, with 21
# This is consistent with our worldwide mode

The most commonly reported shape in LA: light, followed by circle and triangle

Found in the tibble below

# Using group_by() and summarize() functions to find the shape of most UFOs when sightings # are
# reported in Los Angeles
Shape <- group_by(LA,shape)
( sumShape <- summarize(Shape,count=n()) )
## # A tibble: 21 x 2
##       shape count
##       <chr> <int>
##  1 changing    13
##  2  chevron     5
##  3    cigar    16
##  4   circle    36
##  5     cone     3
##  6 cylinder     1
##  7  diamond     4
##  8     disk    29
##  9      egg     5
## 10 fireball    29
## # ... with 11 more rows
# Using as_tibble(), arrange(), and desc(), functions to find the shapes with the highest #count
# in Los angeles
as_tibble(sumShape) %>%
  arrange(desc(count))
## # A tibble: 21 x 2
##       shape count
##       <chr> <int>
##  1    light    63
##  2   circle    36
##  3 triangle    30
##  4     disk    29
##  5 fireball    29
##  6  unknown    26
##  7   sphere    25
##  8    other    21
##  9    cigar    16
## 10     oval    16
## # ... with 11 more rows
# The shapes with the highest count in Los Angeles are light(63), circle(36) and #triangle(30)

Based on descriptive statistics found in the above exploratory analysis, important insights into where and when the most UFO sightings were reported, as well as their most common shape, were determined for Los Angeles, California, USA:

Time: 21:00 or 9:00PM (Second highest count was 22:00, our worldwide mode).
Month: June and July are tied at a count of 35 (Our worldwide mode was July)
Day: The 15th of the month (This is the same as our worldwide mode)
Shape: light (count = 63), circle, triangle (Our worldwide mode is also light)

Data Visualization

International:

In the graph below, data visualization for the country with the most reported sightings is displayed in the form of a bar chart. It is clear that the US is the country that had the most reported sightings amongst our original data set.

# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# The country with the most reported UFO sightings using reported_sightings
countries <- select(reported_sightings, country)
countries_1 <- na.omit(countries)
ggplot(data = countries_1, aes(x = country)) +
  geom_bar() +
  ggtitle("Country vs. Count")

# The US has the most reported sightings out of the list of 5 countries, by far

United States:

Now, our visualization focuses solely on the US. Below, a bar chart displaying the US states and their counts along with a map of the US shaded based on their counts, n, prove that California is the state in the US in which most UFO sightings were reported.

Note: The latitude and longitude values for observations in the US were used to develop the map visualization, with states shaded based on their respective count of reported sightings, n.

# Using ggplot2 package and geom_bar() function to create a bar chart that identifies
# the state in the US with the most reported UFO sightings
ggplot(data = US, aes (x=state)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = -45)) +
  ggtitle("State vs. Count")

# Most common in California; Florida and Washington essentially tied for second
# loading readxl package
library(readxl)
us <- filter(reported_sightings, country == "us")
# Using map_data() function to get US state data table with longitude and latitude and #state name and creating a new data table, states
states <- map_data("state")
# creating new data table, count_states, from the us data table and using group_by() #function to group by state and the tally() function to create counts
count_states <- us %>% group_by(state) %>%
  tally()


capFirst <- function(s) {
  paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}

states$region <- capFirst(states$region)
colnames(states) <- c("long", "lat", "group", "order", "State", "subregion")

State_Names <- read_excel("C:\\Users\\morga\\OneDrive\\Documents\\SAS data sets\\State Names.xlsx")
View(State_Names)
a <- merge(count_states, State_Names, by="state", all.x=T)
View(a)
# Creating final data table to create map of US states by merging the states data table #with the a data table
state_Map <- merge(states, a, by="State", all.x = T)
# ordering state_Map data set by order
state_Map <- state_Map[order(state_Map$order),]
# Using ggplot to create map of US states shaded based on count of latitude and longitude
ggplot(state_Map, aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = n)) +
  geom_path() +
  scale_fill_gradientn(colours=rev("rainbow"(10)),na.value="grey90") +
  coord_map() +
  ggtitle("Reported UFO Sightings: USA")

Los Angeles, CA:

Next, this study used a bar chart to visualize the cities in California with the top 10 counts of reported UFO sightings. The evidence is clear; Los Angeles was the city with the most reported UFO sightings in California.

# Using filter() function to create data table, top_cities, with the 10 top cities in
# California and their respective counts
top_cities <- filter(sumCity, count >= 77)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the city in California with the most reported sightings
ggplot(data = top_cities, aes(x=city, y = count)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = -45, hjust = -0.002)) +
  ggtitle("City vs. Count")

# The city in California with the most reported UFO sightings is Los Angeles

Below is a bar chart displaying the times of the day, in Los Angeles, with the top 10 counts for reported UFO sightings.

Time of day in LA with most reported sightings: 21:00 or 9:00PM

# Using filter() function to create data table, top_time, with the 10 top times of the day
# in Los Angeles for reports of UFO sightings
top_time <- filter(sumTime, count >= 7)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the time of day in Los Angeles, CA with the most reported sightings
ggplot(data = top_time, aes(x=Time, y = count)) +
  geom_point() +
  ggtitle("Time vs. Count")

# The time of day in Los Angeles, CA when the most UFO sightings are reported is 21:00

Below is a bar chart displaying the months of the year, in Los Angeles, with the top 10 counts for reported UFO sightings.

Month of the year in LA with most reported sightings: June and July (tied)

# Using filter() function to create data table, top_month, with the 10 top months of
# the year for UFO sightings in LA
top_month <- filter(sumMonth, count >= 19)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the month of the year, in LA, with the most reported sightings
ggplot(data = top_month, aes(x=Month, y = count)) +
  geom_point() +
  ggtitle("Month vs. Count")

# The months of the year in Los Angeles, CA when the most UFO sightings are reported are
# June and July

Below is a bar chart displaying the days of the month, in Los Angeles, with the top 10 counts for reported UFO sightings.

Day of the month in LA with most reported sightings: 15th

# Using filter() function to create data table, top_day, with the 10 top days of the month,
# in LA, with the most reported sightings
top_day <- filter(sumDay, count >= 12)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the day of the month, in LA, with the most reported sightings
ggplot(data = top_day, aes(x=Day, y = count)) +
  geom_point() +
  ggtitle("Day vs. Count")

# The day of the month in LA when the most UFO sightings are reported is the 15th

Below is a bar chart displaying the shapes of UFOs, in Los Angeles, with the top 10 counts for reported UFO sightings.

Shape in LA with the most reported sightings: light, followed by circle and triangle

# Using filter() function to create data table, top_shape, with the 10 top shapes of UFOs
# reported to be seen in LA
top_shape <- filter(sumShape, count >= 16)
# Using ggplot2 package and geom_point() function to create a point diagram that identifies
# the most commonly reported shape of UFOs In LA sightings: light, circle and triangle
ggplot(data = top_shape, aes(x=shape, y = count)) +
  geom_point() +
  ggtitle("Shape vs. Count")

Summary:

Based on exploratory analysis and data visualization, this project has uncovered insights about reported UFO sightings throughout the world, in the US, in the US state of California, and in the city of Los Angeles, California. According to the data, the United States was the country with the most reported UFO sightings over the past century. The US state with the most reported sightings was California, and the city within CA was Los Angeles. In Los Angeles, CA, most UFO sightings occured between 9:00 and 10:00 PM, in June and July and on the 15th day of the month. Also, the most commonly reported shapes of UFOs sighted in LA were light, triangle, and circle.

Overall, based on this study’s exploratory analysis, if you want to see a UFO you should probably scan the sky in Los Angeles, CA, on the 15th day of June or July, between 21:00 and 22:00 in the evening. Also, keep your eyes peeled for UFOs that light the sky in the shape of a triangle or circle.

Finally, I hope this data set and proceeding analysis can be used for future predictive analysis that could predict where and when the next UFO sighting will most likely occur. Let the hunting begin :)