UFO sightings have had the world in a chokehold since 1947, since the
very first known UFO sighting occurred. A businessman named Kenneth
Arnold reported seeing a group of 9 high-speed objects near Mount
Rainier in the state of Washington while flying a personal plane.
“Arnold estimated the speed of the crescent-shaped objects as several
thousand miles per hour and said they moved ‘like saucers skipping on
water.’ In the newspaper report that followed, it was mistakenly stated
that the objects were saucer-shaped, hence the term flying
saucer.”
Source: https://www.history.com/topics/paranormal/historyofufoshttps://www.history.com/topics/paranormal/history-of-ufos
In this project, I’ll be working with a dataset regarding UFO
sightings since 1949. My final visualization will include the date and
time of when the sighting took place, the state and city it took place
in, the shape of the UFO sighting, and the duration in seconds. This was
a very messy dataset and it took a lot of time cleaning it up. I made
sure to remove N/As, capitalize the variables I’m using, filter for a
certain country, filter for a certain time range, recode abbreviations
to their full names, coercing variables to numeric, and mutating to
create new columns.
I decided to use this particular dataset because ever since I was
little, I’ve always found the concept of UFOs so interesting. There were
actually a couple of time my sister and I have noticed strange lights in
the sky in Olney, Maryland. There were 3 oval shaped lights that would
constantly circle each other in the clouds and we could only see it at
nighttime. I thought of that time while looking at this dataset and
realized that this would be perfect to work on.
Source: Multiple Data Conversions from NUFORC Data by Sigmond Axel
First things first, let’s load in the libraries I’ll be working with today. The tidyverse library is the standard library that includes many different packages. The leaflet library is the library I will be using to plot a map with points representing little UFOs.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
I’m going to start by setting my working directory and using the read_csv() function to bring in my dataset. I just named it “ufo”. Then I used the head() function to provide a general understanding of what the dataset looks like.
setwd("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA110/PROJECTS/Project 2 - UFO Sightings/ufo sightings")
ufo <- read_csv("complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl (1): duration (seconds)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ufo)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… san … tx us cyli… 2700 45 minutes
## 2 10/10/1… lack… tx <NA> light 7200 1-2 hrs
## 3 10/10/1… ches… <NA> gb circ… 20 20 seconds
## 4 10/10/1… edna tx us circ… 20 1/2 hour
## 5 10/10/1… kane… hi us light 900 15 minutes
## 6 10/10/1… bris… tn us sphe… 300 5 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
Because I don’t want any N/As in my dataset, I’m going to first check for N/As and then remove them.
anyNA(ufo$datetime)
## [1] FALSE
anyNA(ufo$city)
## [1] TRUE
anyNA(ufo$state)
## [1] TRUE
anyNA(ufo$country)
## [1] TRUE
anyNA(ufo$shape)
## [1] TRUE
anyNA(ufo$`duration (seconds)`)
## [1] TRUE
anyNA(ufo$`duration (hours/min)`)
## [1] TRUE
anyNA(ufo$comments)
## [1] TRUE
anyNA(ufo$`date posted`)
## [1] FALSE
anyNA(ufo$latitude)
## [1] FALSE
anyNA(ufo$longitude)
## [1] FALSE
As you can see (excluding the datetime column), the first 7 variables all include N/As. To remove them, I’m going to use filter() and then utilize the !is.na() function remove all of the N/As from those 7 variables.
ufo2 <- ufo |>
filter(!is.na(city),
!is.na(state),
!is.na(country),
!is.na(shape),
!is.na(`duration (seconds)`),
!is.na(`duration (hours/min)`),
!is.na(comments))
head(ufo2)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… san … tx us cyli… 2700 45 minutes
## 2 10/10/1… edna tx us circ… 20 1/2 hour
## 3 10/10/1… kane… hi us light 900 15 minutes
## 4 10/10/1… bris… tn us sphe… 300 5 minutes
## 5 10/10/1… norw… ct us disk 1200 20 minutes
## 6 10/10/1… pell… al us disk 180 3 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
Now that the N/As are removed, let’s begin cleaning up. There is quite a lot to clean. The city, state, and country names are all in lowercase and the state and country names are abbreviated. The duration, latitude, and longitude variables aren’t as numeric, which will make it difficult to work with. Let’s clean all of this up, starting with the cities.
I’m going to use the str_to_title() function from the stringr package. This function converts to title case, where only the first letter of each word is capitalized.
ufo2$city <- str_to_title(ufo2$city)
head(ufo2)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… San … tx us cyli… 2700 45 minutes
## 2 10/10/1… Edna tx us circ… 20 1/2 hour
## 3 10/10/1… Kane… hi us light 900 15 minutes
## 4 10/10/1… Bris… tn us sphe… 300 5 minutes
## 5 10/10/1… Norw… ct us disk 1200 20 minutes
## 6 10/10/1… Pell… al us disk 180 3 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
Much better! I’m going to repeat this process for state, country, and shape as well.
ufo2$state <- str_to_title(ufo2$state)
ufo2$country <- str_to_title(ufo2$country)
ufo2$shape <- str_to_title(ufo2$shape)
head(ufo2)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… San … Tx Us Cyli… 2700 45 minutes
## 2 10/10/1… Edna Tx Us Circ… 20 1/2 hour
## 3 10/10/1… Kane… Hi Us Light 900 15 minutes
## 4 10/10/1… Bris… Tn Us Sphe… 300 5 minutes
## 5 10/10/1… Norw… Ct Us Disk 1200 20 minutes
## 6 10/10/1… Pell… Al Us Disk 180 3 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
Great. Now that we’ve fixed that, all of the states and countries are still abbreviated, not helping us in understanding what they are easily. I’m using the unique() function to find out how many kinds of values are under the country and state columns.
unique(ufo2$country)
## [1] "Us" "Ca" "Au" "Gb"
unique(ufo2$state)
## [1] "Tx" "Hi" "Tn" "Ct" "Al" "Fl" "Ca" "Nc" "Ny" "Ky" "Mi" "Ma" "Ks" "Sc" "Wa"
## [16] "Co" "Nh" "Wi" "Me" "Ga" "Pa" "Il" "Ar" "On" "Mo" "Oh" "In" "Az" "Mn" "Nv"
## [31] "Nf" "Ne" "Or" "Bc" "Ia" "Va" "Id" "Nm" "Nj" "Mb" "Wv" "Ok" "Ak" "Ri" "Nb"
## [46] "Vt" "La" "Nd" "Pr" "Ms" "Ut" "Md" "Ab" "Mt" "Sk" "Wy" "Sd" "De" "Pq" "Nt"
## [61] "Qc" "Sa" "Ns" "Pe" "Yk" "Yt" "Dc"
After some external research and a couple of educated guesses, I came to realize that “Us” is the United States, “Ca” is Canada, “Au” is Australia, and “Gb” is Great Britain. There was also “De” for Germany but I believe that it was removed because it had too many N/As.
There are a lot of states that we can’t comprehend because some of the countries don’t have states so abbreviations like “Wa” stand for West Australia. Obviously, these aren’t actual states so I’m going to stick with the country we all know. The United States. I’m going to filter out for just the United States and then pull out all of the unique states for the US.
ufo2 <- ufo2 |>
filter(country == "Us")
unique(ufo2$state)
## [1] "Tx" "Hi" "Tn" "Ct" "Al" "Fl" "Ca" "Nc" "Ny" "Ky" "Mi" "Ma" "Ks" "Sc" "Wa"
## [16] "Co" "Nh" "Wi" "Me" "Ga" "Pa" "Il" "Ar" "Mo" "Oh" "In" "Az" "Mn" "Nv" "Ne"
## [31] "Or" "Ia" "Va" "Id" "Nm" "Nj" "Wv" "Ok" "Ak" "Ri" "Vt" "La" "Nd" "Pr" "Ms"
## [46] "Ut" "Md" "Mt" "Wy" "Sd" "De" "Dc"
The state abbreviations are the same. However, upon closer inspection, there are 2 extra abbreviations: “Pr” which stands for Puerto Rico and “Dc” which stands for Washington DC.
Now I’m going to rename “Us” to “United States” and the state abbreviations to their proper, spelled out name. I’m doing so by using the mutate() function first, and within that, using the recode() function to rename the specific values.
ufo2 <- mutate(ufo2, country = recode(country,
'Us' = "United States"))
ufo2 <- mutate(ufo2, state = recode(state,
'Tx' = "Texas",
'Hi' = "Hawaii",
'Tn' = "Tennessee",
'Ct' = "Connecticut",
'Al' = "Alabama",
'Fl' = "Florida",
'Ca' = "California",
'Nc' = "North Carolina",
'Ny' = "New York",
'Ky' = "Kentucky",
'Mi' = "Michigan",
'Ma' = "Massachusetts",
'Ks' = "Kansas",
'Sc' = "South Carolina",
'Wa' = "Washington",
'Co' = "Colorado",
'Nh' = "New Hampshire",
'Wi' = "Wisconsin",
'Me' = "Maine",
'Ga' = "Georgia",
'Pa' = "Pennsylvania",
'Il' = "Illinois",
'Ar' = "Arkansas",
'Mo' = "Missouri",
'Oh' = "Ohio",
'In' = "Indiana",
'Az' = "Arizona",
'Mn' = "Minnesota",
'Nv' = "Nevada",
'Ne' = "Nebraska",
'Or' = "Oregon",
'Ia' = "Iowa",
'Va' = "Virginia",
'Id' = "Idaho",
'Nm' = "New Mexico",
'Nj' = "New Jersey",
'Wv' = "West Virginia",
'Ok' = "Oklahoma",
'Ak' = "Alaska",
'Ri' = "Rhode Island",
'Vt' = "Vermont",
'La' = "Louisiana",
'Nd' = "North Dakota",
'Pr' = "Puerto Rico",
'Ms' = "Mississippi",
'Ut' = "Utah",
'Md' = "Maryland",
'Mt' = "Montana",
'Wy' = "Wyoming",
'Sd' = "South Dakota",
'De' = "Delaware",
'Dc' = "Washington DC"))
head(ufo2)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… San … Texas United… Cyli… 2700 45 minutes
## 2 10/10/1… Edna Texas United… Circ… 20 1/2 hour
## 3 10/10/1… Kane… Hawa… United… Light 900 15 minutes
## 4 10/10/1… Bris… Tenn… United… Sphe… 300 5 minutes
## 5 10/10/1… Norw… Conn… United… Disk 1200 20 minutes
## 6 10/10/1… Pell… Alab… United… Disk 180 3 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## # longitude <chr>
After a grueling 10 minutes, it looks much better. Now that that problem is solved, let’s work on the numeric columns.
Let’s start with the numbers. First, I’m going to check the classes of all of my numeric variables.
class(ufo2$`duration (seconds)`)
## [1] "numeric"
class(ufo2$`duration (hours/min)`)
## [1] "character"
class(ufo2$latitude)
## [1] "character"
class(ufo2$longitude)
## [1] "character"
Looks like one of the durations and the coordinates are as characters. Let’s coerce the longitudes and latitudes first, since it’s easier.
ufo2$latitude <- as.numeric(ufo2$latitude)
ufo2$longitude <- as.numeric(ufo2$longitude)
class(ufo2$latitude)
## [1] "numeric"
class(ufo2$longitude)
## [1] "numeric"
The duration in hours/minutes variable is extremely messy and is difficult to understand. I’ll let you see for yourself by exploring the pages at the bottom right.
ufo2
## # A tibble: 66,411 × 11
## datetime city state country shape `duration (seconds)`
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 10/10/1949 20:30 San Marcos Texas United… Cyli… 2700
## 2 10/10/1956 21:00 Edna Texas United… Circ… 20
## 3 10/10/1960 20:00 Kaneohe Hawaii United… Light 900
## 4 10/10/1961 19:00 Bristol Tennessee United… Sphe… 300
## 5 10/10/1965 23:45 Norwalk Connecticut United… Disk 1200
## 6 10/10/1966 20:00 Pell City Alabama United… Disk 180
## 7 10/10/1966 21:00 Live Oak Florida United… Disk 120
## 8 10/10/1968 13:00 Hawthorne California United… Circ… 300
## 9 10/10/1968 19:00 Brevard North Carolina United… Fire… 180
## 10 10/10/1970 16:00 Bellmore New York United… Disk 1800
## # ℹ 66,401 more rows
## # ℹ 5 more variables: `duration (hours/min)` <chr>, comments <chr>,
## # `date posted` <chr>, latitude <dbl>, longitude <dbl>
So, I decided to mutate a new column with the duration in seconds and convert them into minutes.
ufo3 <- ufo2 |>
mutate(`duration (minutes)` = round(`duration (seconds)` / 60, 2))
head(ufo2)
## # A tibble: 6 × 11
## datetime city state country shape `duration (seconds)` `duration (hours/min)`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1… San … Texas United… Cyli… 2700 45 minutes
## 2 10/10/1… Edna Texas United… Circ… 20 1/2 hour
## 3 10/10/1… Kane… Hawa… United… Light 900 15 minutes
## 4 10/10/1… Bris… Tenn… United… Sphe… 300 5 minutes
## 5 10/10/1… Norw… Conn… United… Disk 1200 20 minutes
## 6 10/10/1… Pell… Alab… United… Disk 180 3 minutes
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <dbl>,
## # longitude <dbl>
Perfect. One last problem, however. There are just too many observations. 66,000+ observation would cause the map to look like a bunch of dots and you won’t be able to make anything out. So, I’m going to filter for the duration to be above 120 seconds or in other words, 2 minutes.
ufo4 <- ufo3 |>
filter(`duration (seconds)` > 120)
Now that our dataset is all cleaned up and ready to use, let’s analyze the variables being used in a linear regression analysis.
A linear regression analysis is a method to model the relationship between two or more quantitative variables. My regression shows the relationship between age and net worth. However, this dataset doesn’t have another numeric variable besides the duration. So let’s create one!
Here, I grouped by shape and used the count() function to pull out all of the different types of shapes, and how many times each one occurred. Then I printed the 3rd modified version of the ufo dataset.
ufo4 <- ufo4 |>
group_by(shape) |>
summarize(count = n(), datetime, city, state, country, shape, `duration (seconds)`, `duration (hours/min)`, comments, `date posted`, latitude, longitude, `duration (minutes)`)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'shape'. You can override using the
## `.groups` argument.
ufo4
## # A tibble: 34,132 × 13
## # Groups: shape [26]
## shape count datetime city state country `duration (seconds)`
## <chr> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 Changed 1 6/24/1996 00:30 Aurora Colo… United… 3600
## 2 Changing 1160 10/10/1998 02:30 Hollywood Cali… United… 300
## 3 Changing 1160 10/10/1999 00:01 Martinez Cali… United… 3600
## 4 Changing 1160 10/10/2001 21:30 Fresno Cali… United… 900
## 5 Changing 1160 10/10/2007 01:00 Stockbrid… Geor… United… 3600
## 6 Changing 1160 10/10/2007 04:00 Denver Colo… United… 2700
## 7 Changing 1160 10/10/2009 23:00 Michigan … Indi… United… 900
## 8 Changing 1160 10/10/2010 23:00 Miami Flor… United… 1200
## 9 Changing 1160 10/10/2012 20:00 Moundville Alab… United… 240
## 10 Changing 1160 10/10/2012 21:00 Austin Texas United… 1200
## # ℹ 34,122 more rows
## # ℹ 6 more variables: `duration (hours/min)` <chr>, comments <chr>,
## # `date posted` <chr>, latitude <dbl>, longitude <dbl>,
## # `duration (minutes)` <dbl>
To actually plot the regression analysis, I called upon ufo3 and attached ggplot() and the geom_point() function to it. I set the x-axis to the duration in minutes and the y-axis to the shape count. Then I set the x and y limits, each being the min and max of each variable. After that I labelled my graph and used theme_minimal() to change the basic R theme.
regression <- ufo4 |>
ggplot() +
geom_point(aes(x = `duration (minutes)`, y = count)) +
xlim(0, 1104600) +
ylim(1, 14014) +
labs(title = "UFO Sightings Duration vs. Shape Count",
x = "Duration (in minutes)",
y = "Shape Count",
caption = "Source:\nMultiple Data Conversions from NUFORC Data by Sigmond Axel") +
theme_minimal(base_size = 12)
Here I’m just adding on colors to the points and plotting the actual regression line.
regression <- regression +
geom_point(aes(x = `duration (minutes)`, y = count), col = "black") +
geom_smooth(aes(x = `duration (minutes)`, y = count),
method = 'lm', formula = y~x, color = "#8B8000", se = FALSE, linetype = "dotdash", size = 1) +
ggtitle("UFO Sightings Duration vs. Shape Count") +
theme(text = element_text(family = "serif"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
regression
It looks like there is a definitive positive correlation between the duration and the shape count. This means that the longer the duration of the sighting, the more likely it is to see a shape that is more common.
cor(ufo4$`duration (minutes)`, ufo4$count)
## [1] 0.01533615
fit1 <- lm(count ~ `duration (minutes)`, data = ufo4)
summary(fit1)
##
## Call:
## lm(formula = count ~ `duration (minutes)`, data = ufo4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3380 -1163 -978 57 4149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.379e+03 1.282e+01 263.524 < 2e-16 ***
## `duration (minutes)` 3.929e-03 1.387e-03 2.834 0.00461 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2369 on 34130 degrees of freedom
## Multiple R-squared: 0.0002352, Adjusted R-squared: 0.0002059
## F-statistic: 8.029 on 1 and 34130 DF, p-value: 0.004606
“cor()” stands for “correlation”. This value is always between -1 and 1. The correlation coefficient tells us how strong or weak the correlation is. Values closer to positive or negative 1 are strong correlation (the sign is determined by the linear slope and in this case, the linear slope is positive), values close to positive or negative 0.5 show a weak correlation, and values close to zero have no correlation.
Because my value is 0.01533615, it’s above 0 but somewhat close to 0.5, meaning that it has a weak positive correlation, but it’s still there.
For a linear regression, the equation (y = mx + b) must be used. The
equation for my model is : count =
0.003929(duration (minutes)) + 3379.015813
How do we interpret the equation? As the duration in minutes increases, there is a predicted increase in the shape count by 0.003929.
To check if the results are significant, we must look at the p-value. The levels of significance are typically 0.05, or 5%. My p-value is 0.004606, which is extremely close to 0. The p-value is considered very significant to this entire experiment when we are investigating the correlation, it means that there is a weak yet still positive correlation between the duration in minutes and the shape count.
Here is the code for the popup box that will occur when clicking on the plots of my final graph. I called it ufopop and used the paste0() function to show the date and time, the state and city, the shape, and the duration of the UFO sightings.
ufopop <- paste0(
"<b> Date & Time: </b>", ufo3$datetime, "<br>",
"<b> State: </b>", ufo3$state, "<br>",
"<b> City: </b>", ufo3$city, "<br>",
"<b> Shape: </b>", ufo3$shape, "<br>",
"<b> Duration (in seconds): </b>", ufo3$`duration (seconds)`, "<br>")
Here is the actual plot! I call the leaflet() library and set the latitude and longitude. A simple google search gave me the coordinates for the center of the United States. Since the whole country has to be in view, I set the zoom to 4. I thought a dark map would suit the theme of UFOs, since external research shows that most UFO sightings occur during the night. “CartoDB.DarkMatter” is the name of the map I decided to use. Then to plot the actual points, I call upon the correct dataset, the set the radius of the points to the duration in seconds. I had to divide it by 6000 because the points would fill up the whole WORLD map otherwise. I set the color to a dark yellow and the fill color to a simple white. The opacity to 0.05 and then set the popup to ufopop, the one I created earlier.
plot2 <- leaflet() |>
setView(lng = -98.5795, lat = 39.8283, zoom = 4) |>
addProviderTiles("CartoDB.DarkMatter") |>
addCircles(data = ufo3,
radius = ufo3$`duration (seconds)`/6000,
color = "#8B8000",
fillColor = "white",
fillOpacity = 0.05,
popup = ufopop)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
plot2