The Grammar of Data Science

What is Data Science

Three Core Steps:

  • Collect the data
  • Tidy and analyze the data
  • Compose thoughts and communicate results of analysis
  • May need to repeat this process or go back to previous steps.
  • These steps can be broken down further. This lecture focuses on the analysis steps.

Four Main Analysis Steps:

  • Tidy the data (before analysis)
  • Transform the data
  • Visualize the data
    • Shows things you may not expect
    • Allows you to refine your data questions.
    • Doesn’t scale well since a human has to look at every visualization.
  • Model the data
    • Scales better (easier to throw more PCs at a problem than brains)
    • Drawback: Models make assumptions.

This lecture will look at dplyr (Transform) and ggvis (Visualize)

Tidy Data

Data that is tidy is: * Easy to transform, visualize, and model * Variables are stored in a consistent way – always as columns * Tidyr provides tools to tidy messy data (incl. gather, spread, and separate) * You can find more information about this package on Google

dplyr Package

This package tries to tackle three main bottlenecks in data manipulation: * Cognative - Think about what data manipulation you should be doing Describe it - in the way a PC can understand Do It - Computational * These are the venn diagram slide.

Thinking

dplyr constrains what you can do to manipulate the data.

Which of these five functions should you use?

Functions:

  • Filter - Keep rows matching criteria
  • Select - Pick columns by name
  • Arrange - Allows you to reorder the rows.
  • Mutate - adds new variables
  • Summarize - Reduce variables to values

It is also important to make note of the group_by operator. These satisfy most of your needs.

Test with nycflights13

This contains four data frames:

  • flights - every commercial flight departing NYC in 2013
  • weather - hourly weather data
  • planes - plane metadata
  • airports - airport metadata
flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dest == "IAH") # Take flights and filter on destination airport "IAH"
## # A tibble: 7,198 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ℹ 7,188 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
select(flights, year: day, carrier, tailnum) # Select all variables in list after colon. 
## # A tibble: 336,776 × 5
##     year month   day carrier tailnum
##    <int> <int> <int> <chr>   <chr>  
##  1  2013     1     1 UA      N14228 
##  2  2013     1     1 UA      N24211 
##  3  2013     1     1 AA      N619AA 
##  4  2013     1     1 B6      N804JB 
##  5  2013     1     1 DL      N668DN 
##  6  2013     1     1 UA      N39463 
##  7  2013     1     1 B6      N516JB 
##  8  2013     1     1 EV      N829AS 
##  9  2013     1     1 B6      N593JB 
## 10  2013     1     1 AA      N3ALAA 
## # ℹ 336,766 more rows
select(flights, -(year:day)) # can also filter on negative values (include everything except these)
## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
select(flights, starts_with("arr")) #can use other functions such as text functions
## # A tibble: 336,776 × 2
##    arr_time arr_delay
##       <int>     <dbl>
##  1      830        11
##  2      850        20
##  3      923        33
##  4     1004       -18
##  5      812       -25
##  6      740        12
##  7      913        19
##  8      709       -14
##  9      838        -8
## 10      753         8
## # ℹ 336,766 more rows
arrange(flights, desc(arr_delay)) # order by arrival delay descending
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
mutate(flights, speed = distance / air_time * 60) # This adds a new calculated column
## # A tibble: 336,776 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>
#summarize() - need to group by first to use this
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE)) # Avg of departure delays
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ℹ 355 more rows

Pipelines

These allow you to string along functions rather than nest them. This allows the code to be more readable.

varX %>% (then) f(y) -> f(x,y)

Below is an example of the code nested then using pipelines

hourly_delay <- filter(
  summarize(
    group_by(
      filter(
        flights,
        !is.na(dep_delay)
      ),
      time_hour, hour
    ),
    delay = mean(dep_delay),
    n = n()
  ),
  n > 10
)
## `summarise()` has grouped output by 'time_hour'. You can override using the
## `.groups` argument.
hourly_delay1 <- flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(time_hour, hour) %>%
  summarize(
    delay = mean(dep_delay),
    n = n()) %>%
  filter(n > 10)
## `summarise()` has grouped output by 'time_hour'. You can override using the
## `.groups` argument.

dplyr can work with remote data sources, not just local data.

  • Supports most databases Can write ordinary r code like above, but will send sql query to a database. Can keep you from tripping up over the differences

Learn More about dplyr

To learn more, you can check open the vignettes: browseVignettes(package = “dplyr”) To translate from plyr to dplyr: http://jimhester.gethub.io/plyrToDplyr *Common Q&A: http://stackoverflow.com/questions/tagged/dplyr?sort=frequent

*datatable (DT) is an alternative package to dplyr

ggVis

A package that is similar to ggplot2 but adds additional features.

A synthesis of ideas:

  • grammar of graphics (ggplot2)
  • Reactivity and Interactivity (Shiny - uses reactive programming)
  • Data pipeline (dplyr) integration
  • Of the web (vega.js) - renders in web browser

Check out The Grammar of Graphics by Leland Wilkinson:

  • Describes how graphics can be computed and presented in a declaritive way
  • ggplot2 is one implmementation
  • ggvis is similar, but there are some programming changes

Creating the both table from the nycflights13 dataset.

# summarize daily flights
daily <- flights %>%
  filter(origin == "EWR") %>%
  group_by(year, month, day) %>%
  summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    cancelled = mean(is.na(dep_delay))
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
# summarize daily weather from hourly weather data

daily_weather <- weather %>%
  filter(origin == "EWR") %>%
  group_by(year, month, day) %>%
  summarise(
    temp = mean(temp, na.rm = TRUE),
    wind = mean(wind_speed, na.rm = TRUE),
    precip = sum(precip, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
# Join the tables together
both <- daily %>%
  inner_join(daily_weather) %>%
  ungroup() %>%
  mutate(date = as.Date(ISOdate(year, month, day)))
## Joining with `by = join_by(year, month, day)`

Now we can use ggvis.

Example: A scatter plot with smoothing

both %>%
  ggvis(x = ~temp, y= ~delay) %>%
  layer_points() %>%
  layer_smooths()
#You can typically drop the x and y

#both %>%
  #ggvis(~temp, ~delay) %>%
  #layer_points() %>% #Each layer inherits previous properties
  #layer_smooths()

Example: A scatter plot with a fill color applied to differentiate precipitation

both %>%
  ggvis(~temp, ~delay, fill = ~precip) %>%
  layer_points()

Example A histogram of flight delays

both %>%
  ggvis(~delay) %>%
  layer_histograms()
## Guessing width = 5 # range / 21

Example GGvis makes an educated guess at an appropriate display of delays when no graph layer specified.

both %>% ggvis(~delay)
## Guessing layer_histograms()
## Guessing width = 5 # range / 21

Scaled and Unscaled Values

dat <- data.frame(x = c(1,2,3), y = c(10,20,30), f = c("red","green","black"))
  • There is a difference when using a scaled value (=) vs an unscaled/raw value (:=)
dat %>%
  ggvis(x = ~x, y = ~y, fill = ~f) %>%
  layer_points()

In this scaled value example, we get points with the labels “red”, “green”, and “black”, but they’re actually blue orange, and green.

Now let’s see the unscaled values:

dat %>% 
  ggvis(x = ~x, y := ~y, fill := ~f) %>%
  layer_points()

In this example fill := ~f gave point colors of red, green, and black. y :=~y gave pixel locations at 10, 20, and 30 down from the top.

If we drop the colon off of := ~y, we get a linear set of points .

Capturing Expressions with ~

  • The tilde (~) means capture the expression for later evaluation in the context of the data.
    • No tilde means evaluate it now.

Data Pipeline and Functional Interface

  • Each ggvis function takes a visualization object as an input and returns a modified visualization as an output.
p <- ggvis(mtcars, x = ~wt, y = ~mpg) #Create a ggvis object with mtcars data
p <- layer_points(p) # Take object p from before and layer on points. 
p <- layer_smooths(p) # Take object p from before and layer on smoothing lines.
p # print output

This can also be done with the pipe operator

#form 1- no pipes
p <- ggvis(mtcars, x = ~wt, y = ~mpg) 
p <- layer_points(p)
p <- layer_smooths(p, span = 0.5) 
p
# form 2 - composition of functions

layer_smooths(
  layer_points(
    ggvis(mtcars, ~wt, ~mpg)), 
  span = 0.5)
# form 3 with pipes
mtcars %>%
  ggvis(x = ~wt, y = ~mpg) %>%
  layer_points() %>%
  layer_smooths(span = 0.5)

Reactivity and Interactivity

  • Used often in shiny.
  • In regular programming, when you call a function, it just happens. The function takes a value and returns one (it doesn’t change for new values)
  • In functional reactive programming, a reactive can use a value from another reactive
    • This creates a dependency graph of reactives. The reactives persist.
    • When the value of an ancestor node changes, the future ones change.

Reactive Histogram

both %>%
  ggvis(~delay) %>%
  layer_histograms(width = input_slider(1, 10, value = 5))
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

Reactive Properties

both %>%
  ggvis( ~delay, ~precip) %>%
  layer_points(opacity := input_slider(0,1))
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

Reactive Data Sources

  • Can be used to link plots together and allow brushing
  • Need to use Shiny

Future

  • Subvisualizations
  • Zooming and panning
  • ggplot2 feature parity

More Examples of GGVIS

Example Scatter Plot in ggvis vs ggplot2

# Interactive Scatter Plot using ggvis
scatter_plot <- iris %>%
  ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>%
  layer_points(size := 100) %>%
  add_tooltip(function(df) df$Species) %>%
  hide_legend("fill") %>%
  layer_text(x = ~mean(range(Sepal.Length)), y = ~max(Sepal.Width) + 0.5,
             text := "Scatter Plot - ggvis", fontSize := 15, baseline := "bottom") %>%
  set_options(width = 400, height = 300)

# Display the plot
print(scatter_plot)
# equivalent ggplot2

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:ggvis':
## 
##     resolution
# Scatter plot using ggplot2
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3) +
  labs(title = "Scatter Plot - ggplot2", x = "Sepal Length", y = "Sepal Width") +
  theme_minimal()

Example Barchart with additional customization

# Load necessary libraries
library(ggvis)

# Interactive Bar Chart using ggvis
bar_chart <- iris %>%
  ggvis(x = ~Species, y = ~Sepal.Length, fill = ~Species) %>%
  layer_bars() %>%
  add_tooltip(function(df) df$Species) %>%
  scale_nominal("fill", range = c("blue", "red", "green"))

# Display the plot
print(bar_chart)

Example Boxplot in GGvis

# Load necessary libraries
library(ggvis)

# Define custom colors for species
colors <- c("darkblue", "darkred", "darkgreen")

# Box Plot using ggvis on iris dataset with color scales
box_plot_ggvis <- iris %>%
  ggvis(x = ~Species, y = ~Sepal.Length, fill = ~Species) %>%
  layer_boxplots(fillOpacity := 0.7, strokeWidth := 0.5, stroke := "black") %>%
  scale_nominal("fill", range = colors) %>%
  add_tooltip(function(df) df$Species)

# Display the plot
print(box_plot_ggvis)
## GGPLOT

# Load necessary libraries
library(ggplot2)

# Box Plot using ggplot2 on iris dataset
box_plot_ggplot <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Box Plot of Sepal Length by Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal()

# Display the plot
print(box_plot_ggplot)