This vignette will take a quick peek at two useful data exploration plot types provided in the ggExtra package using a UFO sightings dataset. We aren’t going to worry about style or labels, just some quick plots to explore your data well before further analysis and presenting findings to others.
To have some fun, I picked a UFO dataset I found on Kaggle.
ufo <- read.csv(url("https://raw.githubusercontent.com/rachel-greenlee/data607/master/supports/ufo.csv"))
Let’s load the needed libraries.
library(ggplot2)
library(dplyr)
And here is a glimpse at the data.
glimpse(ufo)
#> Rows: 160,666
#> Columns: 11
#> $ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10...
#> $ city <chr> "san marcos", "lackland afb", "chester (uk/eng...
#> $ state <chr> "tx", "tx", "", "tx", "hi", "tn", "", "ct", "a...
#> $ country <chr> "us", "", "gb", "us", "us", "us", "gb", "us", ...
#> $ shape <chr> "cylinder", "light", "circle", "circle", "ligh...
#> $ duration..seconds. <chr> "2700", "7200", "20", "20", "900", "300", "180...
#> $ duration..hours.min. <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 ho...
#> $ comments <chr> "This event took place in early fall around 19...
#> $ date.posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/...
#> $ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333"...
#> $ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96...
We need to clean up the dataframe so we have good clean variables to plot.
#set datetime as a datetime varirable
ufo$datetime <- as.POSIXct(ufo$datetime, format="%m/%d/%Y %H:%M", tz=Sys.timezone())
#set seconds as numeric
ufo$duration..seconds. <- as.numeric(ufo$duration..seconds.)
#> Warning: NAs introduced by coercion
#convert from seconds to minutes for readability
ufo$duration_mins <- ufo$duration..seconds. / 60
#subset for just the variables we are eniterested in
ufo <- ufo[c("datetime", "state", "country", "shape", "duration_mins")]
#make a year only variable
ufo$year <- format(as.Date(ufo$datetime, format="%m/%d/%Y %H:%M"),"%Y")
ufo$year <- as.numeric(ufo$year)
Take one more glimpse at the data now that it’s cleaned up.
glimpse(ufo)
#> Rows: 160,666
#> Columns: 6
#> $ datetime <dttm> 1949-10-10 20:30:00, 1949-10-10 21:00:00, 1955-10-10...
#> $ state <chr> "tx", "tx", "", "tx", "hi", "tn", "", "ct", "al", "fl...
#> $ country <chr> "us", "", "gb", "us", "us", "us", "gb", "us", "us", "...
#> $ shape <chr> "cylinder", "light", "circle", "circle", "light", "sp...
#> $ duration_mins <dbl> 45.0000000, 120.0000000, 0.3333333, 0.3333333, 15.000...
#> $ year <dbl> 1949, 1949, 1955, 1956, 1960, 1961, 1965, 1965, 1966,...
library(ggExtra)
Here we can create the base of the chart and store it in “g”, and then uses the ggMarginal function from the ggExtra package to create a scatterplot that has histograms along each axis. You can also swap out type = “histogram” for “barplot”.
#it is common to store the base of your graph in something like "g" and then add the more advanced graph features to thase base
g <- ggplot(ufo, aes(x = year, y = duration_mins, color = country)) +
geom_count()
ggMarginal(g, type = "histogram", fill = "transparent")
We see above that we have some intense outliers with regards to the duration, we can set the y axis to 1 hour (60 minutes) if we want to zoom in. Let’s try that boxplot too.
#adding ylim to greatly decrease the y axis, so we can zoom in on the majority of the data points without changing our dataframe
g <- ggplot(ufo, aes(x = year, y = duration_mins, color = country)) +
geom_count() +
ylim(0, 60)
ggMarginal(g, type = "boxplot", fill = "transparent")
Another great feature in the ggExtra package is the ability to plot super quick frequency plots without having to manipulate your dataframe first.
#when you have a lot of labels they can be hard to read on the xaxis, coord_flip will swap the axis
plotCount(table(ufo$shape)) +
coord_flip()
plotCount(table(ufo$country)) +
coord_flip()
Rachel’s choice in UFO data and the ggExtra package caught my eye. Interesting data and further visualization skills - yes please :) Let’s explore!
Based on Rachel’s final plot we observe that the clear majority of activity was focused in the US. Where exactly was it concentrated though?
We explore this question via the plotCount() function for frequency plot and the removeGrid() function to see the plot without grid lines:
#subset our data to focus only on US data
ufo_us <- ufo[ which(ufo$country=='us'), ]
#plot the count per state
p <- plotCount(table(ufo_us$state)) +
coord_flip()
#remove grid lines
p + removeGrid()
From the above plot, we see that the most active states are California, Washington, Florida, Texas, and Nevada.
We can thus subset our data once again to focus exclusively on these states when we revisit the marginal plot (50 colors / variables would have been a bit much):
#subset our data to focus on the most active states
ufo_us2 <- ufo[ which(ufo$state=='ca' | ufo$state=='wa' | ufo$state=='fl' | ufo$state=='tx' | ufo$state=='nv'), ]
At this point, we’re ready to re-plot …
We plot the duration of a sighting vs. the year of the sighting, this time with the state the activity was reported under as the color-coding. We then revisit the ggMarginal() function by plotting a histogram without fill on the axes:
g2 <- ggplot(ufo_us2, aes(x = year, y = duration_mins, color = state)) +
geom_count() +
ylim(0, 60)
ggMarginal(g2, type = "histogram", fill = "transparent")
It’s interesting to note that the first recorded report (in our dataset) is from 1925 in Texas and that the count increased dramatically as we neared and then passed the year 2000.
Are people more seeking of attention or has UFO activity really spiked?
…
In addition to the ggMarginal plot from above, I thought it might be interesting to explore different variables and a different style of plot with the US data.
We plot the datetime of a sighting vs. the shape of the reported UFO. We then revisit the ggMarginal() function (again), this time as a densigram:
g2 <- ggplot(ufo_us2, aes(x = shape, y = datetime, color = state)) +
geom_count()
ggMarginal(g2 + rotateTextX(), type = "densigram", fill = "transparent")
With regard to datetime, it tells us the same thing that we observed before, that the duration of reports increased near and after the year 2000. It is interesting to observe the difference in densities based on shapes in this era though.
It’s also interesting to note the variance of reported shapes of these UFOs as time progressed. It seems the most common reports are “light”, “formation”, “fireball”, and “circle”.
All-in-all this was a very interesting dataset and library to explore and I’m thankful that Rachel elected it in the first place and provided me the opportunity to extend the vignette. Hope this was as interesting for you to peruse as it was for me!