1 Data visualizations

1.1 Scatter plot

library(tidyverse)
  • Simple scatter plot between two variables:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

  • Group by another variable and show the groups by different colors, size, shape
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

1.2 Facet

  • Facet scatter plot with one categorical variable We can set nrow to limit the number of items on each row
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

  • Facet with two categorical variables, vertical and horizontal dimensions:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

1.3 Histogram chart

  • Plot histogram with one variable:
hist(mpg$cyl, col = rgb(0,0,1,1/4))

  • Overlay histogram with 2 variables on same plot: Two variables should have same scale.
hist(mpg$cty, col = 'skyblue', xlim = c(0,50), ylim = c(0,100))
hist(mpg$hwy, col = scales::alpha('red',.5), add = TRUE)

  • Overlay histogram, group by a categorical variable: Here, we use ‘..count..’ to refer number of points in bin but we can put ‘..density..’ on case of continuous variable.
ggplot(mpg, aes(cty, fill = drv)) + 
   geom_histogram(alpha = 0.5, bins=10, aes(y = ..count..))

1.4 Bar plot

  • Simple bar chart with categorical variable:
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class))

  • Bar chart shows proportions per category:
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, y = ..prop.., group = 1))

  • Stacked bar chart:
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv))

  • Stacked bar chart shows proportions per category:
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv), position = "fill")

  • Bar chart shows proportions per category, not stacked, each bar next to another:
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv), position = "dodge")

1.5 Box plot

  • Vertical boxplot grouped by a categorical variable:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

  • Horizontal boxplot grouped by a categorical variable:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

  • Boxplot with nested grouping by two categorical variables:
ggplot(data = mpg, aes(x = class, y = hwy)) + geom_boxplot(aes(fill = drv))

  • Jitter to adds a small amount of random variation to box plot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  geom_jitter()

  • Faceted boxplot with nested grouping by two categorical variables:
ggplot(data = mpg, mapping = aes(x = drv, y = hwy)) +
  geom_boxplot() +
    facet_wrap(~ class, nrow = 3)

2 Data Transformation

library(nycflights13)

2.1 Filter

  • Simple filter:
filter(flights, month == 1, day == 1)
  • Filter with a set of values:
filter(flights, month %in% c(11, 12))
  • More complicated filter: not, null, between
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, is.na(dep_time))
filter(flights, between(distance, 700, 1000))

2.2 Sort

  • Simple sort:
arrange(flights, year, month, day)
  • Sort descending:
arrange(flights, desc(dep_delay))
  • Combination of ascending and descending:
arrange(flights, desc(dep_delay), arr_delay)
  • Sort with null value showing first:
arrange(flights, desc(is.na(dep_time)))

2.3 Select

  • Get list of column names:
colnames(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"
  • Select columns by names
select(flights, c(year, month, day))
  • Select range of columns by names
select(flights, c(year:day))
  • Select columns which are not in set of columns
select(flights, -c(year,day))
  • Similar to above but with column indexes, not name
select(flights, c(1, 3))
select(flights, c(1:3))
select(flights, -c(1, 3))
select(flights, c(1:3, 5:7))
  • Select columns with helper functions:
select(flights, contains('time'))

Some helper functions:

  • starts_with(“text”): matches names that begin with “text”.
  • ends_with(“text”): matches names that end with “text”.
  • contains(“text”): matches names that contain “text”.
  • matches(): selects variables that match a regular expression.

2.4 Add new variables

  • Adds new columns at the end of your dataset:
mutate(flights,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
  • If you only want to keep the new variables:
transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

Functions for creating new variables:

  • Arithmetic operators: +, -, *, /, ^
  • Modular arithmetic: %/%
  • Logs: log(), log2(), log10()
  • Logical comparisons, <, <=, >, >=, !=, and ==

2.5 Group by

  • Simple group by
flights %>% 
  group_by(dest) %>% 
  summarise(dist = mean(distance, na.rm = TRUE))
  • Calculate multiple new variables on group by
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
  • Group by multiple variables
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))

Some helper functions can be used:

  • mean(x), median(x)
  • sd(x), IQR(x), mad(x)
  • min(x), quantile(x, 0.25), max(x)
  • first(x), nth(x, 2), last(x)
  • n()
  • sum(x > 10), mean(y == 0)