Introduction

UFO sightings have had the world in a chokehold since 1947, since the very first known UFO sighting occurred. A businessman named Kenneth Arnold reported seeing a group of 9 high-speed objects near Mount Rainier in the state of Washington while flying a personal plane. “Arnold estimated the speed of the crescent-shaped objects as several thousand miles per hour and said they moved ‘like saucers skipping on water.’ In the newspaper report that followed, it was mistakenly stated that the objects were saucer-shaped, hence the term flying saucer.”

Source: https://www.history.com/topics/paranormal/historyofufoshttps://www.history.com/topics/paranormal/history-of-ufos

In this project, I’ll be working with a dataset regarding UFO sightings since 1949. My final visualization will include the date and time of when the sighting took place, the state and city it took place in, the shape of the UFO sighting, and the duration in seconds. This was a very messy dataset and it took a lot of time cleaning it up. I made sure to remove N/As, capitalize the variables I’m using, filter for a certain country, filter for a certain time range, recode abbreviations to their full names, coercing variables to numeric, and mutating to create new columns.
I decided to use this particular dataset because ever since I was little, I’ve always found the concept of UFOs so interesting. There were actually a couple of time my sister and I have noticed strange lights in the sky in Olney, Maryland. There were 3 oval shaped lights that would constantly circle each other in the clouds and we could only see it at nighttime. I thought of that time while looking at this dataset and realized that this would be perfect to work on.

Source: Multiple Data Conversions from NUFORC Data by Sigmond Axel

Setting Up & Cleaning Up

First things first, let’s load in the libraries I’ll be working with today. The tidyverse library is the standard library that includes many different packages. The leaflet library is the library I will be using to plot a map with points representing little UFOs.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)

I’m going to start by setting my working directory and using the read_csv() function to bring in my dataset. I just named it “ufo”. Then I used the head() function to provide a general understanding of what the dataset looks like.

setwd("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA110/PROJECTS/Project 2 - UFO Sightings/ufo sightings")
ufo <- read_csv("complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl  (1): duration (seconds)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ufo)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… san … tx    us      cyli…                 2700 45 minutes            
## 2 10/10/1… lack… tx    <NA>    light                 7200 1-2 hrs               
## 3 10/10/1… ches… <NA>  gb      circ…                   20 20 seconds            
## 4 10/10/1… edna  tx    us      circ…                   20 1/2 hour              
## 5 10/10/1… kane… hi    us      light                  900 15 minutes            
## 6 10/10/1… bris… tn    us      sphe…                  300 5 minutes             
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>

Removing N/As

Because I don’t want any N/As in my dataset, I’m going to first check for N/As and then remove them.

anyNA(ufo$datetime)
## [1] FALSE
anyNA(ufo$city)
## [1] TRUE
anyNA(ufo$state)
## [1] TRUE
anyNA(ufo$country)
## [1] TRUE
anyNA(ufo$shape)
## [1] TRUE
anyNA(ufo$`duration (seconds)`)
## [1] TRUE
anyNA(ufo$`duration (hours/min)`)
## [1] TRUE
anyNA(ufo$comments)
## [1] TRUE
anyNA(ufo$`date posted`)
## [1] FALSE
anyNA(ufo$latitude)
## [1] FALSE
anyNA(ufo$longitude)
## [1] FALSE

As you can see (excluding the datetime column), the first 7 variables all include N/As. To remove them, I’m going to use filter() and then utilize the !is.na() function remove all of the N/As from those 7 variables.

ufo2 <- ufo |>
  filter(!is.na(city),
         !is.na(state),
         !is.na(country),
         !is.na(shape),
         !is.na(`duration (seconds)`),
         !is.na(`duration (hours/min)`),
         !is.na(comments))
head(ufo2)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… san … tx    us      cyli…                 2700 45 minutes            
## 2 10/10/1… edna  tx    us      circ…                   20 1/2 hour              
## 3 10/10/1… kane… hi    us      light                  900 15 minutes            
## 4 10/10/1… bris… tn    us      sphe…                  300 5 minutes             
## 5 10/10/1… norw… ct    us      disk                  1200 20 minutes            
## 6 10/10/1… pell… al    us      disk                   180 3  minutes            
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>

Now that the N/As are removed, let’s begin cleaning up. There is quite a lot to clean. The city, state, and country names are all in lowercase and the state and country names are abbreviated. The duration, latitude, and longitude variables aren’t as numeric, which will make it difficult to work with. Let’s clean all of this up, starting with the cities.

Renaming Values

I’m going to use the str_to_title() function from the stringr package. This function converts to title case, where only the first letter of each word is capitalized.

ufo2$city <- str_to_title(ufo2$city)
head(ufo2)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… San … tx    us      cyli…                 2700 45 minutes            
## 2 10/10/1… Edna  tx    us      circ…                   20 1/2 hour              
## 3 10/10/1… Kane… hi    us      light                  900 15 minutes            
## 4 10/10/1… Bris… tn    us      sphe…                  300 5 minutes             
## 5 10/10/1… Norw… ct    us      disk                  1200 20 minutes            
## 6 10/10/1… Pell… al    us      disk                   180 3  minutes            
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>

Much better! I’m going to repeat this process for state, country, and shape as well.

ufo2$state <- str_to_title(ufo2$state)
ufo2$country <- str_to_title(ufo2$country)
ufo2$shape <- str_to_title(ufo2$shape)
head(ufo2)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… San … Tx    Us      Cyli…                 2700 45 minutes            
## 2 10/10/1… Edna  Tx    Us      Circ…                   20 1/2 hour              
## 3 10/10/1… Kane… Hi    Us      Light                  900 15 minutes            
## 4 10/10/1… Bris… Tn    Us      Sphe…                  300 5 minutes             
## 5 10/10/1… Norw… Ct    Us      Disk                  1200 20 minutes            
## 6 10/10/1… Pell… Al    Us      Disk                   180 3  minutes            
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>

Great. Now that we’ve fixed that, all of the states and countries are still abbreviated, not helping us in understanding what they are easily. I’m using the unique() function to find out how many kinds of values are under the country and state columns.

unique(ufo2$country)
## [1] "Us" "Ca" "Au" "Gb"
unique(ufo2$state)
##  [1] "Tx" "Hi" "Tn" "Ct" "Al" "Fl" "Ca" "Nc" "Ny" "Ky" "Mi" "Ma" "Ks" "Sc" "Wa"
## [16] "Co" "Nh" "Wi" "Me" "Ga" "Pa" "Il" "Ar" "On" "Mo" "Oh" "In" "Az" "Mn" "Nv"
## [31] "Nf" "Ne" "Or" "Bc" "Ia" "Va" "Id" "Nm" "Nj" "Mb" "Wv" "Ok" "Ak" "Ri" "Nb"
## [46] "Vt" "La" "Nd" "Pr" "Ms" "Ut" "Md" "Ab" "Mt" "Sk" "Wy" "Sd" "De" "Pq" "Nt"
## [61] "Qc" "Sa" "Ns" "Pe" "Yk" "Yt" "Dc"

After some external research and a couple of educated guesses, I came to realize that “Us” is the United States, “Ca” is Canada, “Au” is Australia, and “Gb” is Great Britain. There was also “De” for Germany but I believe that it was removed because it had too many N/As.

There are a lot of states that we can’t comprehend because some of the countries don’t have states so abbreviations like “Wa” stand for West Australia. Obviously, these aren’t actual states so I’m going to stick with the country we all know. The United States. I’m going to filter out for just the United States and then pull out all of the unique states for the US.

ufo2 <- ufo2 |>
  filter(country == "Us")

unique(ufo2$state)
##  [1] "Tx" "Hi" "Tn" "Ct" "Al" "Fl" "Ca" "Nc" "Ny" "Ky" "Mi" "Ma" "Ks" "Sc" "Wa"
## [16] "Co" "Nh" "Wi" "Me" "Ga" "Pa" "Il" "Ar" "Mo" "Oh" "In" "Az" "Mn" "Nv" "Ne"
## [31] "Or" "Ia" "Va" "Id" "Nm" "Nj" "Wv" "Ok" "Ak" "Ri" "Vt" "La" "Nd" "Pr" "Ms"
## [46] "Ut" "Md" "Mt" "Wy" "Sd" "De" "Dc"

The state abbreviations are the same. However, upon closer inspection, there are 2 extra abbreviations: “Pr” which stands for Puerto Rico and “Dc” which stands for Washington DC.

Now I’m going to rename “Us” to “United States” and the state abbreviations to their proper, spelled out name. I’m doing so by using the mutate() function first, and within that, using the recode() function to rename the specific values.

ufo2 <- mutate(ufo2, country = recode(country,
                                    'Us' = "United States"))
ufo2 <- mutate(ufo2, state = recode(state,
                                  'Tx' = "Texas",
                                  'Hi' = "Hawaii",
                                  'Tn' = "Tennessee",
                                  'Ct' = "Connecticut",
                                  'Al' = "Alabama",
                                  'Fl' = "Florida",
                                  'Ca' = "California",
                                  'Nc' = "North Carolina",
                                  'Ny' = "New York",
                                  'Ky' = "Kentucky",
                                  'Mi' = "Michigan",
                                  'Ma' = "Massachusetts",
                                  'Ks' = "Kansas",
                                  'Sc' = "South Carolina",
                                  'Wa' = "Washington",
                                  'Co' = "Colorado",
                                  'Nh' = "New Hampshire",
                                  'Wi' = "Wisconsin",
                                  'Me' = "Maine",
                                  'Ga' = "Georgia",
                                  'Pa' = "Pennsylvania",
                                  'Il' = "Illinois",
                                  'Ar' = "Arkansas",
                                  'Mo' = "Missouri",
                                  'Oh' = "Ohio",
                                  'In' = "Indiana",
                                  'Az' = "Arizona",
                                  'Mn' = "Minnesota",
                                  'Nv' = "Nevada",
                                  'Ne' = "Nebraska",
                                  'Or' = "Oregon",
                                  'Ia' = "Iowa",
                                  'Va' = "Virginia",
                                  'Id' = "Idaho",
                                  'Nm' = "New Mexico",
                                  'Nj' = "New Jersey",
                                  'Wv' = "West Virginia",
                                  'Ok' = "Oklahoma",
                                  'Ak' = "Alaska",
                                  'Ri' = "Rhode Island",
                                  'Vt' = "Vermont",
                                  'La' = "Louisiana",
                                  'Nd' = "North Dakota",
                                  'Pr' = "Puerto Rico",
                                  'Ms' = "Mississippi",
                                  'Ut' = "Utah",
                                  'Md' = "Maryland",
                                  'Mt' = "Montana",
                                  'Wy' = "Wyoming",
                                  'Sd' = "South Dakota",
                                  'De' = "Delaware",
                                  'Dc' = "Washington DC"))
head(ufo2)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… San … Texas United… Cyli…                 2700 45 minutes            
## 2 10/10/1… Edna  Texas United… Circ…                   20 1/2 hour              
## 3 10/10/1… Kane… Hawa… United… Light                  900 15 minutes            
## 4 10/10/1… Bris… Tenn… United… Sphe…                  300 5 minutes             
## 5 10/10/1… Norw… Conn… United… Disk                  1200 20 minutes            
## 6 10/10/1… Pell… Alab… United… Disk                   180 3  minutes            
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <chr>,
## #   longitude <chr>

After a grueling 10 minutes, it looks much better. Now that that problem is solved, let’s work on the numeric columns.

Fixing Time & Numeric Values

Let’s start with the numbers. First, I’m going to check the classes of all of my numeric variables.

class(ufo2$`duration (seconds)`)
## [1] "numeric"
class(ufo2$`duration (hours/min)`)
## [1] "character"
class(ufo2$latitude)
## [1] "character"
class(ufo2$longitude)
## [1] "character"

Looks like one of the durations and the coordinates are as characters. Let’s coerce the longitudes and latitudes first, since it’s easier.

ufo2$latitude <- as.numeric(ufo2$latitude)
ufo2$longitude <- as.numeric(ufo2$longitude)
class(ufo2$latitude)
## [1] "numeric"
class(ufo2$longitude)
## [1] "numeric"

The duration in hours/minutes variable is extremely messy and is difficult to understand. I’ll let you see for yourself by exploring the pages at the bottom right.

ufo2
## # A tibble: 66,411 × 11
##    datetime         city       state          country shape `duration (seconds)`
##    <chr>            <chr>      <chr>          <chr>   <chr>                <dbl>
##  1 10/10/1949 20:30 San Marcos Texas          United… Cyli…                 2700
##  2 10/10/1956 21:00 Edna       Texas          United… Circ…                   20
##  3 10/10/1960 20:00 Kaneohe    Hawaii         United… Light                  900
##  4 10/10/1961 19:00 Bristol    Tennessee      United… Sphe…                  300
##  5 10/10/1965 23:45 Norwalk    Connecticut    United… Disk                  1200
##  6 10/10/1966 20:00 Pell City  Alabama        United… Disk                   180
##  7 10/10/1966 21:00 Live Oak   Florida        United… Disk                   120
##  8 10/10/1968 13:00 Hawthorne  California     United… Circ…                  300
##  9 10/10/1968 19:00 Brevard    North Carolina United… Fire…                  180
## 10 10/10/1970 16:00 Bellmore   New York       United… Disk                  1800
## # ℹ 66,401 more rows
## # ℹ 5 more variables: `duration (hours/min)` <chr>, comments <chr>,
## #   `date posted` <chr>, latitude <dbl>, longitude <dbl>

So, I decided to mutate a new column with the duration in seconds and convert them into minutes.

ufo3 <- ufo2 |>
  mutate(`duration (minutes)` = round(`duration (seconds)` / 60, 2))
head(ufo2)
## # A tibble: 6 × 11
##   datetime city  state country shape `duration (seconds)` `duration (hours/min)`
##   <chr>    <chr> <chr> <chr>   <chr>                <dbl> <chr>                 
## 1 10/10/1… San … Texas United… Cyli…                 2700 45 minutes            
## 2 10/10/1… Edna  Texas United… Circ…                   20 1/2 hour              
## 3 10/10/1… Kane… Hawa… United… Light                  900 15 minutes            
## 4 10/10/1… Bris… Tenn… United… Sphe…                  300 5 minutes             
## 5 10/10/1… Norw… Conn… United… Disk                  1200 20 minutes            
## 6 10/10/1… Pell… Alab… United… Disk                   180 3  minutes            
## # ℹ 4 more variables: comments <chr>, `date posted` <chr>, latitude <dbl>,
## #   longitude <dbl>

Perfect. One last problem, however. There are just too many observations. 66,000+ observation would cause the map to look like a bunch of dots and you won’t be able to make anything out. So, I’m going to filter for the duration to be above 120 seconds or in other words, 2 minutes.

ufo4 <- ufo3 |>
  filter(`duration (seconds)` > 120) 

Now that our dataset is all cleaned up and ready to use, let’s analyze the variables being used in a linear regression analysis.

Linear Regression Analysis

A linear regression analysis is a method to model the relationship between two or more quantitative variables. My regression shows the relationship between age and net worth. However, this dataset doesn’t have another numeric variable besides the duration. So let’s create one!

Here, I grouped by shape and used the count() function to pull out all of the different types of shapes, and how many times each one occurred. Then I printed the 3rd modified version of the ufo dataset.

ufo4 <- ufo4 |>
  group_by(shape) |>
  summarize(count = n(), datetime, city, state, country, shape, `duration (seconds)`, `duration (hours/min)`, comments, `date posted`, latitude, longitude, `duration (minutes)`)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'shape'. You can override using the
## `.groups` argument.
ufo4
## # A tibble: 34,132 × 13
## # Groups:   shape [26]
##    shape    count datetime         city       state country `duration (seconds)`
##    <chr>    <int> <chr>            <chr>      <chr> <chr>                  <dbl>
##  1 Changed      1 6/24/1996 00:30  Aurora     Colo… United…                 3600
##  2 Changing  1160 10/10/1998 02:30 Hollywood  Cali… United…                  300
##  3 Changing  1160 10/10/1999 00:01 Martinez   Cali… United…                 3600
##  4 Changing  1160 10/10/2001 21:30 Fresno     Cali… United…                  900
##  5 Changing  1160 10/10/2007 01:00 Stockbrid… Geor… United…                 3600
##  6 Changing  1160 10/10/2007 04:00 Denver     Colo… United…                 2700
##  7 Changing  1160 10/10/2009 23:00 Michigan … Indi… United…                  900
##  8 Changing  1160 10/10/2010 23:00 Miami      Flor… United…                 1200
##  9 Changing  1160 10/10/2012 20:00 Moundville Alab… United…                  240
## 10 Changing  1160 10/10/2012 21:00 Austin     Texas United…                 1200
## # ℹ 34,122 more rows
## # ℹ 6 more variables: `duration (hours/min)` <chr>, comments <chr>,
## #   `date posted` <chr>, latitude <dbl>, longitude <dbl>,
## #   `duration (minutes)` <dbl>

To actually plot the regression analysis, I called upon ufo3 and attached ggplot() and the geom_point() function to it. I set the x-axis to the duration in minutes and the y-axis to the shape count. Then I set the x and y limits, each being the min and max of each variable. After that I labelled my graph and used theme_minimal() to change the basic R theme.

regression <- ufo4 |>
  ggplot() +
  geom_point(aes(x = `duration (minutes)`, y = count)) +
  xlim(0, 1104600) +
  ylim(1, 14014) +
  labs(title = "UFO Sightings Duration vs. Shape Count",
       x = "Duration (in minutes)",
       y = "Shape Count",
       caption = "Source:\nMultiple Data Conversions from NUFORC Data by Sigmond Axel") +
  theme_minimal(base_size = 12)

Here I’m just adding on colors to the points and plotting the actual regression line.

regression <- regression +
  geom_point(aes(x = `duration (minutes)`, y = count), col = "black") +
  geom_smooth(aes(x = `duration (minutes)`, y = count), 
              method = 'lm', formula = y~x, color = "#8B8000", se = FALSE, linetype = "dotdash", size = 1) +
  ggtitle("UFO Sightings Duration vs. Shape Count") +
  theme(text = element_text(family = "serif")) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
regression

It looks like there is a definitive positive correlation between the duration and the shape count. This means that the longer the duration of the sighting, the more likely it is to see a shape that is more common.

cor(ufo4$`duration (minutes)`, ufo4$count)
## [1] 0.01533615
fit1 <- lm(count ~ `duration (minutes)`, data = ufo4)
summary(fit1)
## 
## Call:
## lm(formula = count ~ `duration (minutes)`, data = ufo4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3380  -1163   -978     57   4149 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.379e+03  1.282e+01 263.524  < 2e-16 ***
## `duration (minutes)` 3.929e-03  1.387e-03   2.834  0.00461 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2369 on 34130 degrees of freedom
## Multiple R-squared:  0.0002352,  Adjusted R-squared:  0.0002059 
## F-statistic: 8.029 on 1 and 34130 DF,  p-value: 0.004606

“cor()” stands for “correlation”. This value is always between -1 and 1. The correlation coefficient tells us how strong or weak the correlation is. Values closer to positive or negative 1 are strong correlation (the sign is determined by the linear slope and in this case, the linear slope is positive), values close to positive or negative 0.5 show a weak correlation, and values close to zero have no correlation.

Because my value is 0.01533615, it’s above 0 but somewhat close to 0.5, meaning that it has a weak positive correlation, but it’s still there.

For a linear regression, the equation (y = mx + b) must be used. The equation for my model is : count = 0.003929(duration (minutes)) + 3379.015813

How do we interpret the equation? As the duration in minutes increases, there is a predicted increase in the shape count by 0.003929.

To check if the results are significant, we must look at the p-value. The levels of significance are typically 0.05, or 5%. My p-value is 0.004606, which is extremely close to 0. The p-value is considered very significant to this entire experiment when we are investigating the correlation, it means that there is a weak yet still positive correlation between the duration in minutes and the shape count.

Graphing

Final Plot

Here is the code for the popup box that will occur when clicking on the plots of my final graph. I called it ufopop and used the paste0() function to show the date and time, the state and city, the shape, and the duration of the UFO sightings.

ufopop <- paste0(
  "<b> Date & Time: </b>", ufo3$datetime, "<br>",
  "<b> State: </b>", ufo3$state, "<br>",
  "<b> City: </b>", ufo3$city, "<br>",
  "<b> Shape: </b>", ufo3$shape, "<br>",
  "<b> Duration (in seconds): </b>", ufo3$`duration (seconds)`, "<br>")

Here is the actual plot! I call the leaflet() library and set the latitude and longitude. A simple google search gave me the coordinates for the center of the United States. Since the whole country has to be in view, I set the zoom to 4. I thought a dark map would suit the theme of UFOs, since external research shows that most UFO sightings occur during the night. “CartoDB.DarkMatter” is the name of the map I decided to use. Then to plot the actual points, I call upon the correct dataset, the set the radius of the points to the duration in seconds. I had to divide it by 6000 because the points would fill up the whole WORLD map otherwise. I set the color to a dark yellow and the fill color to a simple white. The opacity to 0.05 and then set the popup to ufopop, the one I created earlier.

plot2 <- leaflet() |>
 setView(lng = -98.5795, lat = 39.8283, zoom = 4) |>
 addProviderTiles("CartoDB.DarkMatter") |>
  addCircles(data = ufo3,
             radius = ufo3$`duration (seconds)`/6000,
             color = "#8B8000",
             fillColor = "white",
             fillOpacity = 0.05,
             popup = ufopop)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
plot2

Conclusion

Cleaning Up

I’ve learnt many new things while cleaning up my dataset. How to recode values under a variable so they say something else, how to capitalize all of the values under a variable, new things with mutating, filtering, grouping, etc. My cleaning up was a little bit difficult but I was able to push through and complete it. I can’t wait to forward this knowledge towards the final project.

Visualization Representation

My visualization shows the point of every UFO sighting over 2 minutes long in the United States since 1949. Looking at the map, it definitely looks like the east coast has had the most amount of UFO sightings in all of America. The middle of the US and the midwest have a lot less sightings than the rest of the country.

Reflection

If I’m being completely honest, this dataset was absolutely trash. The variables weren’t very clear and upon further inspection, it doesn’t really mean anything. It was after I had cleaned everything up and began the linear regression analysis that I realized that I was screwed. However, I had come too far to restart so I kept pushing. I’m not very happy with my final visualization, either. I know I could’ve done much better in terms of the visuals but this dataset really was extremely difficult to work with. Which is why I’m going to begin researching a dataset for the final project starting now, and come up with a list of my favorite ones that are actually usable. This is definitely my most disappointing visualization yet, but I’m not going to dwell on it. I’ve realized that it’s important to devote a good amount of time to researching datasets, and finding one that is both interesting and usable.