Introduction

This vignette will take a quick peek at two useful data exploration plot types provided in the ggExtra package using a UFO sightings dataset. We aren’t going to worry about style or labels, just some quick plots to explore your data well before further analysis and presenting findings to others.

Setup

To have some fun, I picked a UFO dataset I found on Kaggle.

ufo <- read.csv(url("https://raw.githubusercontent.com/rachel-greenlee/data607/master/supports/ufo.csv"))

Let’s load the needed libraries.

library(ggplot2)
library(dplyr)

And here is a glimpse at the data.

glimpse(ufo)
#> Rows: 160,666
#> Columns: 11
#> $ datetime             <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10...
#> $ city                 <chr> "san marcos", "lackland afb", "chester (uk/eng...
#> $ state                <chr> "tx", "tx", "", "tx", "hi", "tn", "", "ct", "a...
#> $ country              <chr> "us", "", "gb", "us", "us", "us", "gb", "us", ...
#> $ shape                <chr> "cylinder", "light", "circle", "circle", "ligh...
#> $ duration..seconds.   <chr> "2700", "7200", "20", "20", "900", "300", "180...
#> $ duration..hours.min. <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 ho...
#> $ comments             <chr> "This event took place in early fall around 19...
#> $ date.posted          <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/...
#> $ latitude             <chr> "29.8830556", "29.38421", "53.2", "28.9783333"...
#> $ longitude            <chr> "-97.9411111", "-98.581082", "-2.916667", "-96...

We need to clean up the dataframe so we have good clean variables to plot.

#set datetime as a datetime varirable
ufo$datetime <- as.POSIXct(ufo$datetime, format="%m/%d/%Y %H:%M", tz=Sys.timezone())

#set seconds as numeric
ufo$duration..seconds. <- as.numeric(ufo$duration..seconds.)
#> Warning: NAs introduced by coercion

#convert from seconds to minutes for readability 
ufo$duration_mins <- ufo$duration..seconds. / 60

#subset for just the variables we are eniterested in
ufo <- ufo[c("datetime", "state", "country", "shape", "duration_mins")]

#make a year only variable
ufo$year <- format(as.Date(ufo$datetime, format="%m/%d/%Y %H:%M"),"%Y")
ufo$year <- as.numeric(ufo$year)

Take one more glimpse at the data now that it’s cleaned up.

glimpse(ufo)
#> Rows: 160,666
#> Columns: 6
#> $ datetime      <dttm> 1949-10-10 20:30:00, 1949-10-10 21:00:00, 1955-10-10...
#> $ state         <chr> "tx", "tx", "", "tx", "hi", "tn", "", "ct", "al", "fl...
#> $ country       <chr> "us", "", "gb", "us", "us", "us", "gb", "us", "us", "...
#> $ shape         <chr> "cylinder", "light", "circle", "circle", "light", "sp...
#> $ duration_mins <dbl> 45.0000000, 120.0000000, 0.3333333, 0.3333333, 15.000...
#> $ year          <dbl> 1949, 1949, 1955, 1956, 1960, 1961, 1965, 1965, 1966,...

Plot histograms or boxplots on scatterplot axis

library(ggExtra)

Here we can create the base of the chart and store it in “g”, and then uses the ggMarginal function from the ggExtra package to create a scatterplot that has histograms along each axis. You can also swap out type = “histogram” for “barplot”.

#it is common to store the base of your graph in something like "g" and then add the more advanced graph features to thase base
g <- ggplot(ufo, aes(x = year, y = duration_mins, color = country)) +
  geom_count() 

ggMarginal(g, type = "histogram", fill = "transparent")

We see above that we have some intense outliers with regards to the duration, we can set the y axis to 1 hour (60 minutes) if we want to zoom in. Let’s try that boxplot too.

#adding ylim to greatly decrease the y axis, so we can zoom in on the majority of the data points without changing our dataframe
g <- ggplot(ufo, aes(x = year, y = duration_mins, color = country)) +
  geom_count() +
  ylim(0, 60)

ggMarginal(g, type = "boxplot", fill = "transparent")

Super quick frequency plots

Another great feature in the ggExtra package is the ability to plot super quick frequency plots without having to manipulate your dataframe first.

#when you have a lot of labels they can be hard to read on the xaxis, coord_flip will swap the axis
plotCount(table(ufo$shape)) +
  coord_flip()

plotCount(table(ufo$country)) +
  coord_flip()