The Data

This is a dataset that consists of data from Unidentified Flying Objects (UFO) sightings across the world. It contains over 80,000 reports of UFO sightings over the last century which were reported to the National UFO Reporting Center (NUFORC). This dataset is freely available here here and the data can be downloaded upon registration.

Let us now load the libraries that are used in the course of the analysis.

if (!require(ggplot2))  install.packages("ggplot2");
if (!require(dplyr))    install.packages("dplyr");
# dates
if (!require(lubridate)) install.packages("lubridate");
# stats
if (!require(mclust)) install.packages("mclust");
# maps
if (!require(maps))    install.packages("maps");
if (!require(mapdata)) install.packages("mapdata");

library(ggplot2);
library(dplyr);
library(lubridate);
library(mclust);
library(maps);
library(mapdata);

Data Desctiption.

The original dataset contains the following fields:

  • datetime - Date and Time of the UFO sighiting.
  • city - City where the UFO was sighted.
  • state - The State based on the city where it was sighted.
  • country - The country where it was sighted based ont the state.
  • shape - Shape of the UFO sighted.
  • duration(seconds) - Length of the sighting in seconds.
  • duration(minutes) - Length of the sighting in minutes.
  • comments - Comments by the person who sighted the UFO.
  • date posted - Date when the sighting was reported to the NUFORC.
  • latitude - Latitude of the sighting.
  • longitude - Longitude of the sighting.

After a preliminary analysis of the dataset, we observe that there are some data which are incomplete in the sense that there are missing columns. Also, some of the information in this dataset are not quite useful for our analysis. Hence we clean the data a bit so that it is easier for analysis. The following code shows how the data is being cleaned.

dataset <- read.csv(file="../data/scrubbed.csv", header=T);

dataset_clear <- dataset %>%
  dplyr::select(datetime,
                date.posted,
                city, 
                state, 
                country, 
                latitude, 
                longitude, 
                shape, 
                duration = duration..seconds.);

dataset_clear$datetime    <- mdy_hm(dataset_clear$datetime);
dataset_clear$date.posted <- mdy(dataset_clear$date.posted);
dataset_clear$latitude    <- as.numeric(as.character(dataset_clear$latitude));
dataset_clear$longitude   <- as.numeric(as.character(dataset_clear$longitude));
dataset_clear$duration    <- as.numeric(as.character(dataset_clear$duration));
dataset_clear$country <- as.factor(dataset$country);

dataset_clear <- na.omit(dataset_clear);
dataset_usa <- filter(dataset_clear, country=="us" & !(state %in% c("ak", "hi", "pr")));
head(dataset);
##           datetime                 city state country    shape
## 1 10/10/1949 20:30           san marcos    tx      us cylinder
## 2 10/10/1949 21:00         lackland afb    tx            light
## 3 10/10/1955 17:00 chester (uk/england)            gb   circle
## 4 10/10/1956 21:00                 edna    tx      us   circle
## 5 10/10/1960 20:00              kaneohe    hi      us    light
## 6 10/10/1961 19:00              bristol    tn      us   sphere
##   duration..seconds. duration..hours.min.
## 1               2700           45 minutes
## 2               7200              1-2 hrs
## 3                 20           20 seconds
## 4                 20             1/2 hour
## 5                900           15 minutes
## 6                300            5 minutes
##                                                                                                                                                     comments
## 1                    This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit
## 2                                                            1949 Lackland AFB&#44 TX.  Lights racing across the sky &amp; making 90 degree turns on a dime.
## 3                                                                                                        Green/Orange circular disc over Chester&#44 England
## 4                 My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home
## 5 AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a &quot;clean&quot; aircraft (no ordinan
## 6                 My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi
##   date.posted   latitude   longitude
## 1   4/27/2004 29.8830556  -97.941111
## 2  12/16/2005   29.38421  -98.581082
## 3   1/21/2008       53.2   -2.916667
## 4   1/17/2004 28.9783333  -96.645833
## 5   1/22/2004 21.4180556 -157.803611
## 6   4/27/2007 36.5950000  -82.188889

As you can see above, the data field “comments” is present in this particular dataset. A scientific text analysis would take over the entire study, which explains why we prefered to remove it and focus on the other informations. Also the Duration(minutes) was removed as this feature is redundant with the Duration(seconds). The following is a more accurate dataset description that we use for the analysis:

  • datetime - Date and Time of the UFO sighiting.
  • city - City where the UFO was sighted.
  • state - The State based on the city where it was sighted.
  • country - The country where it was sighted based ont the state.
  • shape - Shape of the UFO sighted.
  • duration - Length of the sighting in seconds.
  • date posted - Date when the sighting was reported to the NUFORC.
  • latitude - Latitude of the sighting.
  • longitude - Longitude of sighting.
head(dataset_clear)
##              datetime date.posted                 city state country
## 1 1949-10-10 20:30:00  2004-04-27           san marcos    tx      us
## 2 1949-10-10 21:00:00  2005-12-16         lackland afb    tx        
## 3 1955-10-10 17:00:00  2008-01-21 chester (uk/england)            gb
## 4 1956-10-10 21:00:00  2004-01-17                 edna    tx      us
## 5 1960-10-10 20:00:00  2004-01-22              kaneohe    hi      us
## 6 1961-10-10 19:00:00  2007-04-27              bristol    tn      us
##   latitude   longitude    shape duration
## 1 29.88306  -97.941111 cylinder     2700
## 2 29.38421  -98.581082    light     7200
## 3 53.20000   -2.916667   circle       20
## 4 28.97833  -96.645833   circle       20
## 5 21.41806 -157.803611    light      900
## 6 36.59500  -82.188889   sphere      300

Hypothesis

The dataset has some interesting information as seen above. With this in mind, we would like to ask a set of questions that could be answered by an effective analysis, in order to extract interesting informations and conclude.

  • Considering the dataset globally, how are the sighitings spread across the world and which country has most number of sightings?
  • Interestingly, what are the kind of shapes that are reported and how are they related to each country in the world?
  • Since we have time information about the date reported and the date of appearance, is there a link between them? How about the location and time in which they occur?
  • Do aliens exist?

Which country has the most UFO Sightings?

The very first analysis that we do is to understand which country in the world has had the most number of sightings in the last century according to our dataset.

levels(dataset_clear$country) <- c("Rest of the world", "Australia", "Canada", "Germany", "Great Britain", "US");
ggplot(dataset_clear, aes(x=reorder(country, country, FUN=length), fill=country)) +
  stat_count() + 
  theme_bw() + xlab("Country") + ylab("Number of appearances") + ggtitle("UFO sightings by country");

Conclusion

According to the histogram above, it is pretty clear that the United States has the most number sightings that are recorded in the last century. We think that since the NUFORC is an organization from the United States, mostly American people were aware that they could report what they saw. This means that we could focus our analysis of this data more towards the USA.

How dense are the sighitings across the world?

We have already established that the sightings in the USA are more than any other part of the world. But let us plot a world map to understand how dense the sighting are in each country to understand if the sightings are see in just one part of the country or everythere.

# Some point from "rest of the world" show up in other countries
# Do we keep it? It seems rather biaised
ggplot(dataset_clear, aes(x=longitude, y=latitude, colour=country)) + 
  borders("world", colour="black", fill="gray50") +
  geom_point(shape=18) +
  theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Map of UFO sighitings across the world");

Conclusion

The map also proves our assumption that the number of sightings are denser in the USA than any other part of the world. Further, lets understand how they are split statewise in the states.

How are they split across the states of USA?

Having established that the USA has the most sightings and the densest sightings, let us take a look at the numnber of sightings across each state.

ggplot(dataset_usa, aes(x=reorder(state, state, FUN=length), fill=state)) + 
  stat_count() +
  theme_bw() + theme(axis.text.x = element_text(angle=50, size=8, hjust=1)) + 
  xlab("State") + ylab("Number of appearances") + ggtitle("UFO sightings by states in the USA");

ggplot(dataset_usa, aes(x=longitude, y=latitude, colour=state)) + 
  geom_point(size=1,stroke=0, shape=18) +
  borders("state", colour="black", fill="gray50") +
  geom_point(size=0.5,stroke=0, shape=19) +
  #stat_count() +
  theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Map of UFO sighitings across the USA");

Conclusion

It can be observed in the histogram that the state named “ca”, which is Califonia, has the highest sighitings of UFO in the last century.

How dense are the sightings in the USA based on the states population?

To validate our argument that the state California has the densest and most sightings in the USA with numbers, here are the top 10 states compared to their population (Population from 2015 Wikipedia).

dataset_state <- dataset_usa %>%
  group_by(state) %>%
  summarize(count=n()) %>%
  arrange(state);
dataset_state$pop <- c(4858979, 2978204, 6828065, 39114818, 5456574, 3590886, 672228, 945934, 20271272, 10214860, 3123899, 1654930, 12859995, 6619680, 2911641, 4425092, 4670724, 6794422, 6006401, 1329328, 9922576, 5489594, 6083672, 2992333, 1032949, 10042802, 756927, 1896190, 1330608, 8958013, 2085109, 2890845, 19795791, 11614373, 3911338, 4028977, 12802503, 1056298, 4896146, 858469, 6600299, 27469114, 2995919, 8382993, 626042, 7170351, 5771337, 1844128, 586107);

dataset_state %>% mutate(density=count / pop) %>% arrange(-density) %>% head(10);
## # A tibble: 10 x 4
##    state  count     pop  density
##    <fctr> <int>   <dbl>    <dbl>
##  1 wa      3966 7170351 0.000553
##  2 mt       478 1032949 0.000463
##  3 or      1747 4028977 0.000434
##  4 me       558 1329328 0.000420
##  5 vt       260  626042 0.000415
##  6 nh       486 1330608 0.000365
##  7 az      2413 6828065 0.000353
##  8 nm       720 2085109 0.000345
##  9 id       521 1654930 0.000315
## 10 wy       175  586107 0.000299

Conclusion

From the numbers above, we can see that we were wrong. The most densest state compared to its population is the state of Washington. We can observe a pattern: the 6 first states are located on the north of the country, and most of them are clustered in the region of Washington. One explaination would be that Washington was the first state in the USA to legalize recreational use of marijuana.

What are the most common appearing shape?

Now that we have analysed by country and state, let us take a look at the most common shapes that appear across the world.

ggplot(dataset_clear, aes(x=reorder(shape, shape, FUN=length), fill=shape)) + 
  geom_bar(show.legend=F) +
  coord_flip() +
  theme_bw() + xlab("Number of appearances") + ylab("Shape") + ggtitle("UFO shapes seen worldwide");

Conclusion

From the bar chart presented, it is clear that the most common shape that was reported is a generic shape called “light”, which is often explained by a bright halo from people that have experienced this.

What are the most Frequent Shapes Per country?

First, we want get an insight about the shapes that appears the most. Then, since the USA has reported far more UFO sightings than the other countries, we want to look what happens countrywise.

dataset_clear %>%
  group_by(country, shape) %>%
  summarize(count=n()) %>%
  filter(count == max(count)) %>%
  arrange(-count);
## # A tibble: 6 x 3
## # Groups: country [6]
##   country           shape  count
##   <fctr>            <fctr> <int>
## 1 US                light  13473
## 2 Rest of the world light   1937
## 3 Canada            light    655
## 4 Great Britain     light    361
## 5 Australia         light    119
## 6 Germany           light     20

We can clearly see that the light shape is the most frequent overall, and also for each country individually.

Is there a correlation between the time (of the day/season) and the shape?

Next, we want to figure out if there is any existing link between the time of appearance and the shape reported.

First let us analyse if there is any corelation between number of apprearences and the hour of the day. We first extract the hour of the day, then we relate by month and finally by year.

ggplot(dataset_clear, aes(x=hour(datetime))) + 
  geom_histogram(bins=24) +
  theme_bw() + xlab("Hour of the day") + ylab("Number of appearances") + ggtitle("UFO sightings during the day");

From the histogram above, it is quite evident that most of the appearences happen when there is no light or little light (night and evening). But interestingly, it can be observed that there are also reports of UFO’s during the daytime as well.

Next, let us take a look at the correlation between the shapes and the time of the day. This could tell us why the light shape is most often reported shape.

shapes_daytime <- 
  dataset_clear %>% 
  group_by(hour=hour(datetime), shape, duration) %>% 
  summarize(count=n());
ggplot(shapes_daytime, aes(x=hour, y=shape, size=count)) + 
  geom_point() + 
  theme_bw() + xlab("Hour of the day") + ylab("Shape") + ggtitle("UFO sightings durations per shape during the day");

With the graph above, we can see that the shapes are more predominant/persistant during the night time as well. We can observe that the light tends to appear a lot during the night and evening, but less than other standard shapes (circle, sphere) during the day. Now, we run a chi-square test to understand if there is any correlation between the time of the day and the shapes.

We run the Chi-Square independence test, assuming that: + Each sample observation is independent + Each case contributes to at least of entry

chisq.test(dataset_clear$shape, hour(dataset_clear$datetime), simulate.p.value=T);
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  dataset_clear$shape and hour(dataset_clear$datetime)
## X-squared = 6127.1, df = NA, p-value = 0.0004998

From these values, we can see that the p-value is less than 0.05 i.e p < 0.05 which indicates that the shapes of the UFO do really depend on the time of the day.

Have the appearences increased over the years?

After studying the UFO sightings on static times, we want to analyse them over time. Here, we will see how many were reported and how many were sighted at the same time.

# Appearances per year
appearances_year <- 
  dataset_clear %>% group_by(year=year(datetime)) %>% 
  summarize(count=n());
# Reports per year
reports_year <- 
  dataset_clear %>% group_by(year=year(date.posted)) %>% 
  summarize(count=n());
ggplot(appearances_year, aes(x=year, y=count)) + 
  geom_line(size=1, colour="red") + 
  geom_line(data=reports_year, aes(y=count), size=1, colour="green") + 
  geom_smooth(method="lm") +
  theme_bw() + xlab("Year") + ylab("Count (red=appearances, green=reports)") + ggtitle("Comparison of UFO appearances and UFO reports each year");

It can be clearly seen that there is an increase in the number of cases reported and sighted across the world and not just in the USA. From the linear regression draw, we can see that we are far from a linear increasing of UFO sightings.

How big is time difference between occurance date and date reported?

Seeing how there was an increase in the number of sightings and the number of reportings, we would like to analyse the time difference between the reportings and the sightings. Because of the time difference across countries and the fact that the reportings were made to the NUFORC in the US which is on the west coast, it appears that there are some reports predate the sightings themselves. These are some outliers that we have removed.

# Remove outliers (entries that has been reported before being seen)
report_time <- 
  dataset_clear %>% 
  filter(date.posted > datetime);
report_time$duration <- as.numeric(difftime(report_time$date.posted, report_time$datetime, units = c("days")), units="days");

ggplot(report_time, aes(x=country, y=duration)) + 
  geom_boxplot() + 
  theme_bw() + xlab("Country") + ylab("Duration of appearance (days)") + ggtitle("Duration of UFO appearance per country (showing outliers)");

As we can see from the boxplot, there are many outliers, and our statistical data doesn’t mean anything. We try to remove outliers based on their duration. We think that UFO sightings that are reported 1 month after being seen are mostly errors and reports from someone else (family, etc) and are not important for our study.

# Remove outliers (based on duration > 1 month)
report_time_filtered <-
  report_time %>%
  filter(duration < 30);

ggplot(report_time_filtered, aes(x=country, y=duration)) + 
  geom_boxplot() + 
  theme_bw() + xlab("Country") + ylab("Duration of appearance") + ggtitle("Duration of UFO appearance per country");

summary(report_time_filtered$duration);
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.000694  4.208333  9.256944 11.168260 17.062500 29.999306
sd(report_time_filtered$duration);
## [1] 8.07716

Now, we have coherent data from the boxplot, and we are able to compute some statistical moments. The mean time we have obtained here is around 11.2 days. This is the time it took on an average between the time the UFO was sighted and the time the UFO was reported.

Is there a state where UFO appears longer than others?

Concentrating on the USA, we would like to understand how long does the sightings occur in each state of the USA on an average. Here as well, there have been some outliers that were removed. Some of the sightings were reported for almost more than a day, due to errors or multiple different sightings, and hence every such instance where it occurs for more than a day have been removed.

# Remove other outliers (entries for which the appearances lasts more than 1 day)
# day_to_seconds = 24 * 60 * 60 = 86400
durations_state <- 
  dataset_usa %>% 
  filter(duration < 86400) %>% 
  group_by(state) %>% 
  summarize(mean=mean(duration));
ggplot(durations_state, aes(x=state, y=mean)) + 
  geom_point() + 
  theme_bw() + theme(axis.text.x = element_text(angle=50, size=8, hjust=1)) +
  xlab("State") + ylab("Mean appearance duration") + ggtitle("Mean UFO sighting duration for each state in the USA");

The state with the longest appearence is the state of New mexico. We could say that this is the place closest to Area 51 and hence its normal that this has the longest duration.

Do apparitions at the same time appears at the same place?

#EM (expectation maximization clustering)
spatial_clusters <- Mclust(subset(dataset_usa, select=c("longitude","latitude")), 5);
summary(spatial_clusters);
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 5 components:
## 
##  log.likelihood     n df       BIC       ICL
##       -417875.8 64506 29 -836072.8 -843411.9
## 
## Clustering table:
##     1     2     3     4     5 
## 22821  9597 12004 15898  4186
usa_clustered <- 
  dataset_usa %>%
  mutate(uncertainty    = spatial_clusters$uncertainty,
         classification = factor(spatial_clusters$classification));

ggplot(usa_clustered, aes(x=longitude, y=latitude, size=uncertainty, colour=classification)) +
  geom_point(size=0.5, stroke=0) +
  guides(size=F, coulour=F) + 
  stat_ellipse(level=0.5, type="t") +
  theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Clustering of UFO sightings in the USA");

Obviously, the map clusters reflect what we see in the numbers: Our Bayesian Information criterion (BIC) is extremely large, which means that our data is not fitting the model at all (Here a Gaussian Mixture Model). Thus, we can not conclude about the clusters from the previous plot.

Conclusion

To conclude, we can say that our findings show that the countries with the most number of sighitings are the USA and that too especially the state of California but an interesting point to observe here is that based on the population density, the density of sightings are more in the state of Washington than in California. Both California and Washington have legalised use of the recreational use of Marijuana and this could explain why there are so many sightings in the west coast. Also, it can be observed that the shape called “light” was the most reported one of all time in every country possible and that they occur mostly in the night or when there is very little sunlight. It seems that the sightings and reportings have increased over time and have been dropping in recent times too. Interestingly, New Mexico which is closer to Area 51 has the longest duration of all times. Could this have an impact on the question of the existance of aliens? Maybe. Because they could have been testing of flights by the USA at very odd hours and at high altitudes. Also, we cannot conclude from this analysis whether aliens exist or not, as we only have information about sightings and not any interaction with these UFOs. This could easily have been on the most intersting data analysis we have done since, it shows us that the Western part of the world and mostly developed nations are the ones that have more predominantly reported the sightings of these UFO’s. Could these UFO’s be technological test of these developped countries? Could these USO’s be the vehicles of aliens who have been sent to see the strength of the developed nations or more so conquer these nations? Only time and more data wil give us the answer to these questions and that is for another data analysis expriment.