This is a dataset that consists of data from Unidentified Flying Objects (UFO) sightings across the world. It contains over 80,000 reports of UFO sightings over the last century which were reported to the National UFO Reporting Center (NUFORC). This dataset is freely available here here and the data can be downloaded upon registration.
Let us now load the libraries that are used in the course of the analysis.
if (!require(ggplot2)) install.packages("ggplot2");
if (!require(dplyr)) install.packages("dplyr");
# dates
if (!require(lubridate)) install.packages("lubridate");
# stats
if (!require(mclust)) install.packages("mclust");
# maps
if (!require(maps)) install.packages("maps");
if (!require(mapdata)) install.packages("mapdata");
library(ggplot2);
library(dplyr);
library(lubridate);
library(mclust);
library(maps);
library(mapdata);
The original dataset contains the following fields:
After a preliminary analysis of the dataset, we observe that there are some data which are incomplete in the sense that there are missing columns. Also, some of the information in this dataset are not quite useful for our analysis. Hence we clean the data a bit so that it is easier for analysis. The following code shows how the data is being cleaned.
dataset <- read.csv(file="../data/scrubbed.csv", header=T);
dataset_clear <- dataset %>%
dplyr::select(datetime,
date.posted,
city,
state,
country,
latitude,
longitude,
shape,
duration = duration..seconds.);
dataset_clear$datetime <- mdy_hm(dataset_clear$datetime);
dataset_clear$date.posted <- mdy(dataset_clear$date.posted);
dataset_clear$latitude <- as.numeric(as.character(dataset_clear$latitude));
dataset_clear$longitude <- as.numeric(as.character(dataset_clear$longitude));
dataset_clear$duration <- as.numeric(as.character(dataset_clear$duration));
dataset_clear$country <- as.factor(dataset$country);
dataset_clear <- na.omit(dataset_clear);
dataset_usa <- filter(dataset_clear, country=="us" & !(state %in% c("ak", "hi", "pr")));
head(dataset);
## datetime city state country shape
## 1 10/10/1949 20:30 san marcos tx us cylinder
## 2 10/10/1949 21:00 lackland afb tx light
## 3 10/10/1955 17:00 chester (uk/england) gb circle
## 4 10/10/1956 21:00 edna tx us circle
## 5 10/10/1960 20:00 kaneohe hi us light
## 6 10/10/1961 19:00 bristol tn us sphere
## duration..seconds. duration..hours.min.
## 1 2700 45 minutes
## 2 7200 1-2 hrs
## 3 20 20 seconds
## 4 20 1/2 hour
## 5 900 15 minutes
## 6 300 5 minutes
## comments
## 1 This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit
## 2 1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime.
## 3 Green/Orange circular disc over Chester, England
## 4 My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I took a different route home
## 5 AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise, I was at 50ꯠ' in a "clean" aircraft (no ordinan
## 6 My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he's still livi
## date.posted latitude longitude
## 1 4/27/2004 29.8830556 -97.941111
## 2 12/16/2005 29.38421 -98.581082
## 3 1/21/2008 53.2 -2.916667
## 4 1/17/2004 28.9783333 -96.645833
## 5 1/22/2004 21.4180556 -157.803611
## 6 4/27/2007 36.5950000 -82.188889
As you can see above, the data field “comments” is present in this particular dataset. A scientific text analysis would take over the entire study, which explains why we prefered to remove it and focus on the other informations. Also the Duration(minutes) was removed as this feature is redundant with the Duration(seconds). The following is a more accurate dataset description that we use for the analysis:
head(dataset_clear)
## datetime date.posted city state country
## 1 1949-10-10 20:30:00 2004-04-27 san marcos tx us
## 2 1949-10-10 21:00:00 2005-12-16 lackland afb tx
## 3 1955-10-10 17:00:00 2008-01-21 chester (uk/england) gb
## 4 1956-10-10 21:00:00 2004-01-17 edna tx us
## 5 1960-10-10 20:00:00 2004-01-22 kaneohe hi us
## 6 1961-10-10 19:00:00 2007-04-27 bristol tn us
## latitude longitude shape duration
## 1 29.88306 -97.941111 cylinder 2700
## 2 29.38421 -98.581082 light 7200
## 3 53.20000 -2.916667 circle 20
## 4 28.97833 -96.645833 circle 20
## 5 21.41806 -157.803611 light 900
## 6 36.59500 -82.188889 sphere 300
The dataset has some interesting information as seen above. With this in mind, we would like to ask a set of questions that could be answered by an effective analysis, in order to extract interesting informations and conclude.
The very first analysis that we do is to understand which country in the world has had the most number of sightings in the last century according to our dataset.
levels(dataset_clear$country) <- c("Rest of the world", "Australia", "Canada", "Germany", "Great Britain", "US");
ggplot(dataset_clear, aes(x=reorder(country, country, FUN=length), fill=country)) +
stat_count() +
theme_bw() + xlab("Country") + ylab("Number of appearances") + ggtitle("UFO sightings by country");
According to the histogram above, it is pretty clear that the United States has the most number sightings that are recorded in the last century. We think that since the NUFORC is an organization from the United States, mostly American people were aware that they could report what they saw. This means that we could focus our analysis of this data more towards the USA.
We have already established that the sightings in the USA are more than any other part of the world. But let us plot a world map to understand how dense the sighting are in each country to understand if the sightings are see in just one part of the country or everythere.
# Some point from "rest of the world" show up in other countries
# Do we keep it? It seems rather biaised
ggplot(dataset_clear, aes(x=longitude, y=latitude, colour=country)) +
borders("world", colour="black", fill="gray50") +
geom_point(shape=18) +
theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Map of UFO sighitings across the world");
The map also proves our assumption that the number of sightings are denser in the USA than any other part of the world. Further, lets understand how they are split statewise in the states.
Having established that the USA has the most sightings and the densest sightings, let us take a look at the numnber of sightings across each state.
ggplot(dataset_usa, aes(x=reorder(state, state, FUN=length), fill=state)) +
stat_count() +
theme_bw() + theme(axis.text.x = element_text(angle=50, size=8, hjust=1)) +
xlab("State") + ylab("Number of appearances") + ggtitle("UFO sightings by states in the USA");
ggplot(dataset_usa, aes(x=longitude, y=latitude, colour=state)) +
geom_point(size=1,stroke=0, shape=18) +
borders("state", colour="black", fill="gray50") +
geom_point(size=0.5,stroke=0, shape=19) +
#stat_count() +
theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Map of UFO sighitings across the USA");
It can be observed in the histogram that the state named “ca”, which is Califonia, has the highest sighitings of UFO in the last century.
To validate our argument that the state California has the densest and most sightings in the USA with numbers, here are the top 10 states compared to their population (Population from 2015 Wikipedia).
dataset_state <- dataset_usa %>%
group_by(state) %>%
summarize(count=n()) %>%
arrange(state);
dataset_state$pop <- c(4858979, 2978204, 6828065, 39114818, 5456574, 3590886, 672228, 945934, 20271272, 10214860, 3123899, 1654930, 12859995, 6619680, 2911641, 4425092, 4670724, 6794422, 6006401, 1329328, 9922576, 5489594, 6083672, 2992333, 1032949, 10042802, 756927, 1896190, 1330608, 8958013, 2085109, 2890845, 19795791, 11614373, 3911338, 4028977, 12802503, 1056298, 4896146, 858469, 6600299, 27469114, 2995919, 8382993, 626042, 7170351, 5771337, 1844128, 586107);
dataset_state %>% mutate(density=count / pop) %>% arrange(-density) %>% head(10);
## # A tibble: 10 x 4
## state count pop density
## <fctr> <int> <dbl> <dbl>
## 1 wa 3966 7170351 0.000553
## 2 mt 478 1032949 0.000463
## 3 or 1747 4028977 0.000434
## 4 me 558 1329328 0.000420
## 5 vt 260 626042 0.000415
## 6 nh 486 1330608 0.000365
## 7 az 2413 6828065 0.000353
## 8 nm 720 2085109 0.000345
## 9 id 521 1654930 0.000315
## 10 wy 175 586107 0.000299
From the numbers above, we can see that we were wrong. The most densest state compared to its population is the state of Washington. We can observe a pattern: the 6 first states are located on the north of the country, and most of them are clustered in the region of Washington. One explaination would be that Washington was the first state in the USA to legalize recreational use of marijuana.
Now that we have analysed by country and state, let us take a look at the most common shapes that appear across the world.
ggplot(dataset_clear, aes(x=reorder(shape, shape, FUN=length), fill=shape)) +
geom_bar(show.legend=F) +
coord_flip() +
theme_bw() + xlab("Number of appearances") + ylab("Shape") + ggtitle("UFO shapes seen worldwide");
From the bar chart presented, it is clear that the most common shape that was reported is a generic shape called “light”, which is often explained by a bright halo from people that have experienced this.
First, we want get an insight about the shapes that appears the most. Then, since the USA has reported far more UFO sightings than the other countries, we want to look what happens countrywise.
dataset_clear %>%
group_by(country, shape) %>%
summarize(count=n()) %>%
filter(count == max(count)) %>%
arrange(-count);
## # A tibble: 6 x 3
## # Groups: country [6]
## country shape count
## <fctr> <fctr> <int>
## 1 US light 13473
## 2 Rest of the world light 1937
## 3 Canada light 655
## 4 Great Britain light 361
## 5 Australia light 119
## 6 Germany light 20
We can clearly see that the light shape is the most frequent overall, and also for each country individually.
Next, we want to figure out if there is any existing link between the time of appearance and the shape reported.
First let us analyse if there is any corelation between number of apprearences and the hour of the day. We first extract the hour of the day, then we relate by month and finally by year.
ggplot(dataset_clear, aes(x=hour(datetime))) +
geom_histogram(bins=24) +
theme_bw() + xlab("Hour of the day") + ylab("Number of appearances") + ggtitle("UFO sightings during the day");
From the histogram above, it is quite evident that most of the appearences happen when there is no light or little light (night and evening). But interestingly, it can be observed that there are also reports of UFO’s during the daytime as well.
Next, let us take a look at the correlation between the shapes and the time of the day. This could tell us why the light shape is most often reported shape.
shapes_daytime <-
dataset_clear %>%
group_by(hour=hour(datetime), shape, duration) %>%
summarize(count=n());
ggplot(shapes_daytime, aes(x=hour, y=shape, size=count)) +
geom_point() +
theme_bw() + xlab("Hour of the day") + ylab("Shape") + ggtitle("UFO sightings durations per shape during the day");
With the graph above, we can see that the shapes are more predominant/persistant during the night time as well. We can observe that the light tends to appear a lot during the night and evening, but less than other standard shapes (circle, sphere) during the day. Now, we run a chi-square test to understand if there is any correlation between the time of the day and the shapes.
We run the Chi-Square independence test, assuming that: + Each sample observation is independent + Each case contributes to at least of entry
chisq.test(dataset_clear$shape, hour(dataset_clear$datetime), simulate.p.value=T);
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: dataset_clear$shape and hour(dataset_clear$datetime)
## X-squared = 6127.1, df = NA, p-value = 0.0004998
From these values, we can see that the p-value is less than 0.05 i.e p < 0.05 which indicates that the shapes of the UFO do really depend on the time of the day.
After studying the UFO sightings on static times, we want to analyse them over time. Here, we will see how many were reported and how many were sighted at the same time.
# Appearances per year
appearances_year <-
dataset_clear %>% group_by(year=year(datetime)) %>%
summarize(count=n());
# Reports per year
reports_year <-
dataset_clear %>% group_by(year=year(date.posted)) %>%
summarize(count=n());
ggplot(appearances_year, aes(x=year, y=count)) +
geom_line(size=1, colour="red") +
geom_line(data=reports_year, aes(y=count), size=1, colour="green") +
geom_smooth(method="lm") +
theme_bw() + xlab("Year") + ylab("Count (red=appearances, green=reports)") + ggtitle("Comparison of UFO appearances and UFO reports each year");
It can be clearly seen that there is an increase in the number of cases reported and sighted across the world and not just in the USA. From the linear regression draw, we can see that we are far from a linear increasing of UFO sightings.
Seeing how there was an increase in the number of sightings and the number of reportings, we would like to analyse the time difference between the reportings and the sightings. Because of the time difference across countries and the fact that the reportings were made to the NUFORC in the US which is on the west coast, it appears that there are some reports predate the sightings themselves. These are some outliers that we have removed.
# Remove outliers (entries that has been reported before being seen)
report_time <-
dataset_clear %>%
filter(date.posted > datetime);
report_time$duration <- as.numeric(difftime(report_time$date.posted, report_time$datetime, units = c("days")), units="days");
ggplot(report_time, aes(x=country, y=duration)) +
geom_boxplot() +
theme_bw() + xlab("Country") + ylab("Duration of appearance (days)") + ggtitle("Duration of UFO appearance per country (showing outliers)");
As we can see from the boxplot, there are many outliers, and our statistical data doesn’t mean anything. We try to remove outliers based on their duration. We think that UFO sightings that are reported 1 month after being seen are mostly errors and reports from someone else (family, etc) and are not important for our study.
# Remove outliers (based on duration > 1 month)
report_time_filtered <-
report_time %>%
filter(duration < 30);
ggplot(report_time_filtered, aes(x=country, y=duration)) +
geom_boxplot() +
theme_bw() + xlab("Country") + ylab("Duration of appearance") + ggtitle("Duration of UFO appearance per country");
summary(report_time_filtered$duration);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000694 4.208333 9.256944 11.168260 17.062500 29.999306
sd(report_time_filtered$duration);
## [1] 8.07716
Now, we have coherent data from the boxplot, and we are able to compute some statistical moments. The mean time we have obtained here is around 11.2 days. This is the time it took on an average between the time the UFO was sighted and the time the UFO was reported.
Concentrating on the USA, we would like to understand how long does the sightings occur in each state of the USA on an average. Here as well, there have been some outliers that were removed. Some of the sightings were reported for almost more than a day, due to errors or multiple different sightings, and hence every such instance where it occurs for more than a day have been removed.
# Remove other outliers (entries for which the appearances lasts more than 1 day)
# day_to_seconds = 24 * 60 * 60 = 86400
durations_state <-
dataset_usa %>%
filter(duration < 86400) %>%
group_by(state) %>%
summarize(mean=mean(duration));
ggplot(durations_state, aes(x=state, y=mean)) +
geom_point() +
theme_bw() + theme(axis.text.x = element_text(angle=50, size=8, hjust=1)) +
xlab("State") + ylab("Mean appearance duration") + ggtitle("Mean UFO sighting duration for each state in the USA");
The state with the longest appearence is the state of New mexico. We could say that this is the place closest to Area 51 and hence its normal that this has the longest duration.
#EM (expectation maximization clustering)
spatial_clusters <- Mclust(subset(dataset_usa, select=c("longitude","latitude")), 5);
summary(spatial_clusters);
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 5 components:
##
## log.likelihood n df BIC ICL
## -417875.8 64506 29 -836072.8 -843411.9
##
## Clustering table:
## 1 2 3 4 5
## 22821 9597 12004 15898 4186
usa_clustered <-
dataset_usa %>%
mutate(uncertainty = spatial_clusters$uncertainty,
classification = factor(spatial_clusters$classification));
ggplot(usa_clustered, aes(x=longitude, y=latitude, size=uncertainty, colour=classification)) +
geom_point(size=0.5, stroke=0) +
guides(size=F, coulour=F) +
stat_ellipse(level=0.5, type="t") +
theme_bw() + xlab("Geospatial longitude") + ylab("Geospatial latitude") + ggtitle("Clustering of UFO sightings in the USA");
Obviously, the map clusters reflect what we see in the numbers: Our Bayesian Information criterion (BIC) is extremely large, which means that our data is not fitting the model at all (Here a Gaussian Mixture Model). Thus, we can not conclude about the clusters from the previous plot.
To conclude, we can say that our findings show that the countries with the most number of sighitings are the USA and that too especially the state of California but an interesting point to observe here is that based on the population density, the density of sightings are more in the state of Washington than in California. Both California and Washington have legalised use of the recreational use of Marijuana and this could explain why there are so many sightings in the west coast. Also, it can be observed that the shape called “light” was the most reported one of all time in every country possible and that they occur mostly in the night or when there is very little sunlight. It seems that the sightings and reportings have increased over time and have been dropping in recent times too. Interestingly, New Mexico which is closer to Area 51 has the longest duration of all times. Could this have an impact on the question of the existance of aliens? Maybe. Because they could have been testing of flights by the USA at very odd hours and at high altitudes. Also, we cannot conclude from this analysis whether aliens exist or not, as we only have information about sightings and not any interaction with these UFOs. This could easily have been on the most intersting data analysis we have done since, it shows us that the Western part of the world and mostly developed nations are the ones that have more predominantly reported the sightings of these UFO’s. Could these UFO’s be technological test of these developped countries? Could these USO’s be the vehicles of aliens who have been sent to see the strength of the developed nations or more so conquer these nations? Only time and more data wil give us the answer to these questions and that is for another data analysis expriment.