Week 2 Lab

Fun times with ggplot

For those of you who want an extra resource when building your graphs - look at the ggplot cheatsheet. It’s extremely handy and a great resource for your own graphing

ggplot2 cheatsheet

Hi, your friendly GE here, and we’re welcoming you to lab 2! There’s been a lot covered so far, so we’re going to go over some of what you learned before, plus add some new things you can do with the tidyverse package you worked with for the first problemset.

First things first, lets load up some packages. We’re going to be working mostly with dlplyr and ggplot2 today. We covered (a bit) last week how dplyr works, but ggplot is going to give us a ton of new tools for data visualization to work with.

#as before, we are going to load pacman to help us man(age) our pac(ages). If you still need this library, you can install it with install.packages("pacman")

library(pacman)

#p_load will load the following packages. Where are dplyr and ggplot?? Well, they are all included in the tidyverse package, so we don't need to load them specifically (though we can.) You may have a lot of code appear here: that's because R is installing these packages for you.

p_load(tidyverse, lubridate, ggridges, ggthemes, broom)

LESSON 1: Quick overview of dplyr functions and data work

First, let’s do a quick review of how dplyr/the tidyverse works with a bit of data manipulation work. In order to really learn how to manipulate data, we need a dataset. Built into the tidyverse is a good dataset for learning called mpg. We’ll work with that. However, before we manipulate, we might want to

How do we go about getting information from a package? We can access the documentation using the ? operator. Let’s do that on the mpg dataset.

?mpg

We have another way of ‘previewing’ data we addressed last class by way of the head() fumction. Let’s look at the first and last three rows of our df.

#first three
(head(mpg,3))
## # A tibble: 3 x 11
##   manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto(… f        18    29 p     comp…
## 2 audi         a4      1.8  1999     4 manua… f        21    29 p     comp…
## 3 audi         a4      2    2008     4 manua… f        20    31 p     comp…
#last three
(tail(mpg,3))
## # A tibble: 3 x 11
##   manufacturer model  displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen   passat   2.8  1999     6 auto… f        16    26 p     mids…
## 2 volkswagen   passat   2.8  1999     6 manu… f        18    26 p     mids…
## 3 volkswagen   passat   3.6  2008     6 auto… f        17    26 p     mids…

One thing to notice is that the data appears to be alphabetically sorted by manufacturer.

If all we want is the names of our variables, we can get those by using the function names()

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"        
##  [5] "cyl"          "trans"        "drv"          "cty"         
##  [9] "hwy"          "fl"           "class"

Ok, we’ve had a good look at the data now and I think we’re ready to get started on the analysis. Let’s set a goal of what question we want to answer that doesn’t involve a regression.

Maybe we are interested in getting the data for the median number of cylinders for each different drive type. Let’s try to get there, but learn each tool on the way.

In the tidyverse, we get access to a handy tool in the form of the pipe, or this guy: %>% which lets us transform/manipulate data or other objects in steps.

lets start with a basic example: let’s find the mean cylinder count of the cars in our dataset

#with pipes
mpg %>% summarise(mean_cyl = mean(cyl))
## # A tibble: 1 x 1
##   mean_cyl
##      <dbl>
## 1     5.89

we’re passing our dataset, and telling it to create a summary statistic: the mean number of cylinders. We could do this without the pipe like this:

#no pipes
mean(mpg$cyl)
## [1] 5.888889

but piping the data allows our function to live in a world where the dataset is just part of its environment - meaning it is better at figuring out what you’re telling it

(eg, notice how I had to tell R in the second block that cyl came from the mpg dataset, when in the first block it assumed it came from the mpg dataset.)

Now let’s try to do something closer to our original question.

What we need to do here is perform a series of transformations on our dataframe to produce summary statistics on each drv type. You could do this by creating three separate datasets, but that seems clunky. R has a really clean way of doing this using two commands in sequence.

The nicest way I know of to do this analysis is using the group_by followed by summarise commands. These command will allow us to perform various summary operations on groups of data selected by some variable we choose.

Let’s see how this works.

#think of this as your steps: pass in data -> group data by drive type -> summarise by finding median for each group (in our case, drive type).

mpg %>% group_by(drv) %>% 
  summarise(med_cyl = median(cyl))
## # A tibble: 3 x 2
##   drv   med_cyl
##   <chr>   <dbl>
## 1 4           6
## 2 f           4
## 3 r           8

we can see that we have a median number of cylinders for each drive type, with rear wheel drive having the largest number.

Maybe, we don’t care about 4 wheel drive, and we simply want to look at front wheel drive cars. How would we do that?

We can use the filter() command. Filter commands use what is called a ‘logical’ expression. Something that is either true or false. That is, 10 > 6 will return a value of FALSE, or 10 == 10 (two equals signs to check equality) returns TRUE.

Lets filter to front wheel drive cars (drv = 'f') and then summarise to find the mean city miles per gallon.

#Following our same logic above, but using a filter command...

mpg %>% filter(drv == 'f') %>% summarise(mean = mean(cty))
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  20.0

Maybe we want to see the city gas mileage, but we care about everything BUT front wheel drive. How would we do that? We can use “not equals” or !=

#usimg 'not equals'

mpg %>% filter(drv != 'f') %>% group_by(drv) %>% summarise(median = median(cty, na.rm = TRUE))
## # A tibble: 2 x 2
##   drv   median
##   <chr>  <int>
## 1 4         14
## 2 r         15

We can utilize pipes with regression (using the function lm) as well, doing your filtering, data cleaning and analysis in a single line of code

#note: lm() won't look for data in the environment so you need to tell it to do so by setting data = .

#let's find all of the variables (so not intercepts) in this regression that we'd typically classify as statistically significant
mpg %>% filter(drv == 'f') %>% lm(hwy ~ class + displ, data = .) %>% tidy() %>% filter(p.value < .05) %>% filter(term != "(Intercept)")
## # A tibble: 2 x 5
##   term         estimate std.error statistic    p.value
##   <chr>           <dbl>     <dbl>     <dbl>      <dbl>
## 1 classminivan    -3.63     1.30      -2.80 0.00618   
## 2 displ           -2.84     0.587     -4.84 0.00000471

Neat! Definitely useful for quickly doing analysis

LESSON 2: Graphing with GGplot

Before we get into ggplot, let’s talk about plot() which is R’s base plotting package. It does a fine job of plotting things, and is quite powerful, but ggplot is more user friendly and is definitely a resume booster if you can use it well. Many companies, researchers and journalists use ggplot to produce graphs for wide audiences. The best part? It’s totally free to use.

The fastest way to make a simple plot in R using plot() is by using the function qplot which stands for “quick plot”

#Base R's "qplot" makes quick plots
qplot(x =displ, y = cty, data = mpg)

but we can do better than that with ggplot2! But first we need to understand how to communicate with ggplot so it can do what we want it to.

there are multiple ways to graph with ggplot, but the basic components we need are the data, an aesthetic mapping (what data goes where), and a geom (how should I draw the data) that will take on the following generic format:

ggplot(data = <DATASET>) + <GEOM_FUNCTION>(mapping = aes(<AESTHETIC MAPPINGS>))

Let’s build a plot in steps

try this:

ggplot(data = mpg)

We get a blank box. That’s because ggplot has some data, but no indication as to what goes on which axis, nor any idea of how to draw that data.

Lets add the mapping

#mappings are fairly straightforward: you tell R what goes on which axis and then R will figure out the rest.
ggplot(data = mpg, aes( x = displ, y = hwy))

Ok, now we have axes but nothing on the graph. We’ve told R what our axes are, but we haven’t told R how or what to draw. We can do that using a geom.

Let’s recreate the scatter plot with geom_point. We can do this by simply doing a + which adds a layer to our graph and then the function geom_point() and since R already has an X and Y mapping, it can figure out the rest.

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point()

Cool, that looks like the plot we made with qplot a second ago, with a lot more work. What’s the big deal?

Let’s make this graph a little prettier using themes and colors.

Maybe we’re interested in visualizing different classes of cars. We can color our points by class and make it pretty with the theme_minimal() layer.

#We can add professional themes to our graphs by simply adding them as a layer! We can also map our aesthetics to a third dimension, color, by passing it a variable name
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + theme_minimal()

Maybe you prefer to map your dots to size for a more color-blind friendly graph (although viridis has a library that will pick colors that are explicitly friendly to color blind individuals)

# can also size by variables, and use fivethirtyeight's theme! R will yell at me for using size on a discrete variable, but it's just cranky.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class)) + theme_fivethirtyeight()
## Warning: Using size for a discrete variable is not advised.

We can also split a dataset up into a bunch of mini-graphs and plot them in a group using facet_wrap()

#facet wrapping gives you grids
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) + 
  facet_wrap(~ class, nrow = 2)

We can also do NON dot graphs by using a new geom. New geoms offer new ways to differentiate data, for instance, line graphs can have multiple types of line

We’ll use geom_smooth() which fits a smooth line and gives std error ranges around said line.

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Maybe you just can’t choose what kind of graph you want. That’s OK! You can do a bunch all at once by just adding more layers.

#can combine different types of geoms
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Boxplots are also fairly easy to create!

#boxplots
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

now that we have some basics down, we can make some plots with cool, relatively new datasets!

Tidy Tuesday is an excellent resource for R and GGplot users, both beginners and advanced. Every Tuesday, an easily accessible dataset is posted online for you to play around with to improve your skills.

You know what’s cool about this? We don’t even need to download the data to a csv format. We can just grab it directly from the repo in one line, that THEY wrote FOR you. You guys have the tidy tuesday files for this lab in a nice clean format posted to your course lab page, but I want to show you how to do this if you didn’t want to download them to your computer.

I am literally copying these lines from:

bob_ross

ufo

and video_games

Let’s read this in…

#borrowed from the ufo repo
ufo_sightings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")
## Parsed with column specification:
## cols(
##   date_time = col_character(),
##   city_area = col_character(),
##   state = col_character(),
##   country = col_character(),
##   ufo_shape = col_character(),
##   encounter_length = col_double(),
##   described_encounter_length = col_character(),
##   description = col_character(),
##   date_documented = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
#borrowed from the video_games repo
video_games <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv")
## Parsed with column specification:
## cols(
##   number = col_double(),
##   game = col_character(),
##   release_date = col_character(),
##   price = col_double(),
##   owners = col_character(),
##   developer = col_character(),
##   publisher = col_character(),
##   average_playtime = col_double(),
##   median_playtime = col_double(),
##   metascore = col_double()
## )
#borrowed from the bob_ross repo
bob_ross <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )
## See spec(...) for full column specifications.

However, the issue is that we need to use both of the tools we learned about today to - Get these datasets into a format we can work with - Visualize them in a way that’s meaningful

Let’s start with the bob ross one. This dataset is MUCH nicer in the canvas version, but I will show you how that data was produced. Just to give you an idea of what the data looks like:

# Explore the dataset a bit first
head(bob_ross, 4)
## # A tibble: 4 x 69
##   EPISODE TITLE APPLE_FRAME AURORA_BOREALIS  BARN BEACH  BOAT BRIDGE
##   <chr>   <chr>       <dbl>           <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 S01E01  "\"A…           0               0     0     0     0      0
## 2 S01E02  "\"M…           0               0     0     0     0      0
## 3 S01E03  "\"E…           0               0     0     0     0      0
## 4 S01E04  "\"W…           0               0     0     0     0      0
## # … with 61 more variables: BUILDING <dbl>, BUSHES <dbl>, CABIN <dbl>,
## #   CACTUS <dbl>, CIRCLE_FRAME <dbl>, CIRRUS <dbl>, CLIFF <dbl>,
## #   CLOUDS <dbl>, CONIFER <dbl>, CUMULUS <dbl>, DECIDUOUS <dbl>,
## #   DIANE_ANDRE <dbl>, DOCK <dbl>, DOUBLE_OVAL_FRAME <dbl>, FARM <dbl>,
## #   FENCE <dbl>, FIRE <dbl>, FLORIDA_FRAME <dbl>, FLOWERS <dbl>,
## #   FOG <dbl>, FRAMED <dbl>, GRASS <dbl>, GUEST <dbl>,
## #   HALF_CIRCLE_FRAME <dbl>, HALF_OVAL_FRAME <dbl>, HILLS <dbl>,
## #   LAKE <dbl>, LAKES <dbl>, LIGHTHOUSE <dbl>, MILL <dbl>, MOON <dbl>,
## #   MOUNTAIN <dbl>, MOUNTAINS <dbl>, NIGHT <dbl>, OCEAN <dbl>,
## #   OVAL_FRAME <dbl>, PALM_TREES <dbl>, PATH <dbl>, PERSON <dbl>,
## #   PORTRAIT <dbl>, RECTANGLE_3D_FRAME <dbl>, RECTANGULAR_FRAME <dbl>,
## #   RIVER <dbl>, ROCKS <dbl>, SEASHELL_FRAME <dbl>, SNOW <dbl>,
## #   SNOWY_MOUNTAIN <dbl>, SPLIT_FRAME <dbl>, STEVE_ROSS <dbl>,
## #   STRUCTURE <dbl>, SUN <dbl>, TOMB_FRAME <dbl>, TREE <dbl>, TREES <dbl>,
## #   TRIPLE_FRAME <dbl>, WATERFALL <dbl>, WAVES <dbl>, WINDMILL <dbl>,
## #   WINDOW_FRAME <dbl>, WINTER <dbl>, WOOD_FRAMED <dbl>
names(bob_ross)
##  [1] "EPISODE"            "TITLE"              "APPLE_FRAME"       
##  [4] "AURORA_BOREALIS"    "BARN"               "BEACH"             
##  [7] "BOAT"               "BRIDGE"             "BUILDING"          
## [10] "BUSHES"             "CABIN"              "CACTUS"            
## [13] "CIRCLE_FRAME"       "CIRRUS"             "CLIFF"             
## [16] "CLOUDS"             "CONIFER"            "CUMULUS"           
## [19] "DECIDUOUS"          "DIANE_ANDRE"        "DOCK"              
## [22] "DOUBLE_OVAL_FRAME"  "FARM"               "FENCE"             
## [25] "FIRE"               "FLORIDA_FRAME"      "FLOWERS"           
## [28] "FOG"                "FRAMED"             "GRASS"             
## [31] "GUEST"              "HALF_CIRCLE_FRAME"  "HALF_OVAL_FRAME"   
## [34] "HILLS"              "LAKE"               "LAKES"             
## [37] "LIGHTHOUSE"         "MILL"               "MOON"              
## [40] "MOUNTAIN"           "MOUNTAINS"          "NIGHT"             
## [43] "OCEAN"              "OVAL_FRAME"         "PALM_TREES"        
## [46] "PATH"               "PERSON"             "PORTRAIT"          
## [49] "RECTANGLE_3D_FRAME" "RECTANGULAR_FRAME"  "RIVER"             
## [52] "ROCKS"              "SEASHELL_FRAME"     "SNOW"              
## [55] "SNOWY_MOUNTAIN"     "SPLIT_FRAME"        "STEVE_ROSS"        
## [58] "STRUCTURE"          "SUN"                "TOMB_FRAME"        
## [61] "TREE"               "TREES"              "TRIPLE_FRAME"      
## [64] "WATERFALL"          "WAVES"              "WINDMILL"          
## [67] "WINDOW_FRAME"       "WINTER"             "WOOD_FRAMED"

looks like we have a ton of dummy variables for whether or not items were present in a painting. To get to what you had in your course lab page, this is what needed to be done:

#first, copy this very gross-looking code to make Fivethirtyeight's dataset a little easier to work with:

bob_ross_clean <- bob_ross %>% janitor::clean_names() %>%
  gather(element, present, -episode, -title) %>%
  filter(present == 1) %>%
  mutate(title = str_to_title(str_remove_all(title, '"')),
         element = str_to_title(str_replace(element, "_", " "))) %>%
  dplyr::select(-present) %>%
  separate(episode, into = c("season", "episode"), sep = "E") %>% 
  mutate(season = str_extract(season, "[:digit:]+")) %>% 
  mutate_at(vars(season, episode), as.integer) %>%
  arrange(season, episode)

Lets see what this turned our dataframe into:

head(bob_ross_clean)
## # A tibble: 6 x 4
##   season episode title               element  
##    <int>   <int> <chr>               <chr>    
## 1      1       1 A Walk In The Woods Bushes   
## 2      1       1 A Walk In The Woods Deciduous
## 3      1       1 A Walk In The Woods Grass    
## 4      1       1 A Walk In The Woods River    
## 5      1       1 A Walk In The Woods Tree     
## 6      1       1 A Walk In The Woods Trees

Wow. That’s nice. Let’s build a dataset that COUNTS what elements were in each Bob Ross painting!

# Top 10 items featured in Bob Ross paintings
counts <- bob_ross_clean %>% count(element, sort = TRUE) %>% 
  arrange(desc(n))

head(counts, 10)
## # A tibble: 10 x 2
##    element       n
##    <chr>     <int>
##  1 Tree        361
##  2 Trees       337
##  3 Deciduous   227
##  4 Conifer     212
##  5 Clouds      179
##  6 Mountain    160
##  7 Lake        143
##  8 Grass       142
##  9 River       126
## 10 Bushes      120

Lets plot the top 18 items in bob ross paintings. We’re going to, this time, assign our ggplot object to a name, the ephemerally named plot1

plot1 <- counts %>% head(18) %>%
  ggplot(aes(element, n)) + geom_col() + coord_flip()

Nothing happens. Well, something happened, but we need to call the plot’s name to show it.

#you can call object names by typing them out
plot1

Well that’s pretty cool. Lots of trees. Amd different TYPES of trees. In honor of the man, let’s make this more colorful and add some better axes labels. R has a TON of different colors in its pallette: you can go here to see all of them.

#make it pretty by building an object with the deepbluesky color label
my_happy_little_palette = 'deepskyblue'

plot1 <- counts %>% head(15) %>%
  ggplot(aes(element, n)) + geom_col(fill = my_happy_little_palette) + coord_flip() +
  theme_minimal() + 
  labs(title = "Most Popular Items in Bob Ross Paintings",
       subtitle = "How many beautiful little trees?",
       x = "Number",
       y = "Item")

#display our new improved plot
plot1

Ok, that was fun. Now let’s play around with the video games data we loaded (if you were coding along, it should be in a df called video_games())

#look at it first
head(video_games, 10)
## # A tibble: 10 x 10
##    number game  release_date price owners developer publisher
##     <dbl> <chr> <chr>        <dbl> <chr>  <chr>     <chr>    
##  1      1 Half… Nov 16, 2004  9.99 10,00… Valve     Valve    
##  2      3 Coun… Nov 1, 2004   9.99 10,00… Valve     Valve    
##  3     21 Coun… Mar 1, 2004   9.99 10,00… Valve     Valve    
##  4     47 Half… Nov 1, 2004   4.99 5,000… Valve     Valve    
##  5     36 Half… Jun 1, 2004   9.99 2,000… Valve     Valve    
##  6     52 CS2D  Dec 24, 2004 NA    1,000… Unreal S… Unreal S…
##  7      2 Unre… Mar 16, 2004 15.0  500,0… Epic Gam… Epic Gam…
##  8      4 DOOM… Aug 3, 2004   4.99 500,0… id Softw… id Softw…
##  9     14 Beyo… Apr 27, 2004  5.99 500,0… Larian S… Larian S…
## 10     40 Hitm… Apr 20, 2004  8.99 500,0… Io-Inter… Io-Inter…
## # … with 3 more variables: average_playtime <dbl>, median_playtime <dbl>,
## #   metascore <dbl>
names(video_games)
##  [1] "number"           "game"             "release_date"    
##  [4] "price"            "owners"           "developer"       
##  [7] "publisher"        "average_playtime" "median_playtime" 
## [10] "metascore"

Neat. Only issue I see here is that the price variable appears to be potentially incomplete (those NA values mean it was missing). Let’s create a scatterplot of playtime and ratings

#make a scatterplot of average playtime and ratings
video_games %>% ggplot(aes(x = average_playtime, y = metascore)) + geom_point()
## Warning: Removed 23840 rows containing missing values (geom_point).

oof, that’s really hard to read. Luckily, ggplot lets us “censor” our points by providing cutoffs to our x or y axis. Here’s how to do that:

#lets also add a mapping from owner count (these are discrete buckets) to colors and some new labels
video_games  %>% ggplot(aes(x = average_playtime, y = metascore, color = owners)) + 
  geom_point() +
  xlim(100, 6000) + theme_minimal() + 
  labs(x = "Average Playtime",
       y = "Metascore",
       title = "Average Playtime and Metascore by Ownership")
## Warning: Removed 26482 rows containing missing values (geom_point).

Cool! We can also zoom in on that big chunk of data if we want to check for any obscured trends.

video_games  %>% ggplot(aes(x = average_playtime, y = metascore, color = owners)) + 
  geom_point() +
  xlim(100, 1000) + ylim(60,90) + theme_minimal() + 
  labs(x = "Average Playtime",
       y = "Metascore",
       title = "Average Playtime and Metascore by Ownership")
## Warning: Removed 26528 rows containing missing values (geom_point).

Kinda cool, huh?

Lets use the tidyverse to find out what the cost of buying the library of games for all publishers would be. Can you figure out what each part of the code below does?

game_publishers <- video_games %>% 
  group_by(publisher) %>% summarize(revenue = sum(price, na.rm = T)) %>% ungroup() %>%
  filter(publisher != "") %>% 
  arrange(desc(revenue))
head(game_publishers)
## # A tibble: 6 x 2
##   publisher                  revenue
##   <chr>                        <dbl>
## 1 Big Fish Games               2530.
## 2 KOEI TECMO GAMES CO., LTD.   2397.
## 3 Ubisoft                      2337.
## 4 Slitherine Ltd.              2135.
## 5 MAGIX Software GmbH          1826.
## 6 BANDAI NAMCO Entertainment   1583.

Lets do a histogram this time. What is the Y axis on a histogram, by the way?

#histogram
ggplot(data = game_publishers) + geom_histogram(aes(x = revenue))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#we can also use the tidyverse functions right inside of the ggplot function
ggplot(data = game_publishers %>% filter(revenue > 40)) + geom_histogram(aes(x = revenue)) +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cool, moving on to some ufo sightings. Lets see what we can learn from this dataset. First, let’s poke at it a bit

#check out the dataset
head(ufo_sightings, 5)
## # A tibble: 5 x 11
##   date_time city_area state country ufo_shape encounter_length
##   <chr>     <chr>     <chr> <chr>   <chr>                <dbl>
## 1 10/10/19… san marc… tx    us      cylinder              2700
## 2 10/10/19… lackland… tx    <NA>    light                 7200
## 3 10/10/19… chester … <NA>  gb      circle                  20
## 4 10/10/19… edna      tx    us      circle                  20
## 5 10/10/19… kaneohe   hi    us      light                  900
## # … with 5 more variables: described_encounter_length <chr>,
## #   description <chr>, date_documented <chr>, latitude <dbl>,
## #   longitude <dbl>
names(ufo_sightings)
##  [1] "date_time"                  "city_area"                 
##  [3] "state"                      "country"                   
##  [5] "ufo_shape"                  "encounter_length"          
##  [7] "described_encounter_length" "description"               
##  [9] "date_documented"            "latitude"                  
## [11] "longitude"

Let’s take a closer look at this date column:

#what type of thing is this object anyhow? we can use class()
class(ufo_sightings$date_time)
## [1] "character"

Okay, it’s a character. But wouldn’t it be nice if we could use these dates/times in our plots? Like for a time series? We can.

Let’s convert it using the lubridate package, and get rid of any NA values for country. Lubridate will let us transform a string like this into a different kind of object that lets us access all sorts of different slices of information.

ufo <- ufo_sightings %>% 
  mutate(date_time = parse_date_time(date_time, 'mdy_HM')) %>%
  filter(country != "NA")

let’s look at what month these ufo sightings are happening and break it down by country.

ufo %>% ggplot(aes(x = month(date_time), y = country, fill = country)) + geom_density_ridges() +
  theme_minimal()
## Picking joint bandwidth of 0.685

It looks like Aliens prefer the late summer! (think about when summer is in AU)

I wonder what time of day these sightings are happening?

ufo %>% ggplot(aes(x = hour(date_time), y = country, fill = country)) + geom_density_ridges() +
  theme_minimal()
## Picking joint bandwidth of 1.78

Mostly at night! That makes sense!

Now let’s look at total ufo sightings per year

ufo_total <- ufo %>% group_by(year(date_time)) %>% summarize(total = n())

names(ufo_total) <- c("year", "total")

ggplot(aes(x = year, y = total), data = ufo_total) + geom_line() + 
  labs(x = "Year",
       y = "UFO Sightings",
       title = "Total Recorded UFO Sightings") +
  theme_linedraw()

Awesome! We seem to be getting a big spike in alien sightings sometime in the 90s and mid 2000s.

And there you have it, you have a ton of tools to use with GGPLOT now. When you come back next week, we’ll do a homework help session and teach you some cool approaches to solve those problems.

Back to main notes page