1 Data visualizations

1.1 Scatter plot

library(tidyverse)

Simple scatter plot between two variables:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Group by another variable and show the groups by different colors, size, shape

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

1.2 Facet

Facet scatter plot with one categorical variable We can set nrow to limit the number of items on each row

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Facet with two categorical variables, vertical and horizontal dimensions:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

1.3 Histogram chart

Plot histogram with one variable:

hist(mpg$cyl, col = rgb(0,0,1,1/4))

Overlay histogram with 2 variables on same plot: Two variables should have same scale.

hist(mpg$cty, col = 'skyblue', xlim = c(0,50), ylim = c(0,100))
hist(mpg$hwy, col = scales::alpha('red',.5), add = TRUE)

Overlay histogram, group by a categorical variable: Here, we use ‘..count..’ to refer number of points in bin but we can put ‘..density..’ on case of continuous variable.

ggplot(mpg, aes(cty, fill = drv)) + 
   geom_histogram(alpha = 0.5, bins=10, aes(y = ..count..))

1.4 Bar plot

Simple bar chart with categorical variable:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class))

Bar chart shows proportions per category:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, y = ..prop.., group = 1))

Stacked bar chart:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv))

Stacked bar chart shows proportions per category:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv), position = "fill")

Bar chart shows proportions per category, not stacked, each bar next to another:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = drv), position = "dodge")

1.5 Box plot

Vertical boxplot grouped by a categorical variable:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

Horizontal boxplot grouped by a categorical variable:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

Boxplot with nested grouping by two categorical variables:

ggplot(data = mpg, aes(x = class, y = hwy)) + geom_boxplot(aes(fill = drv))

Jitter to adds a small amount of random variation to box plot:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  geom_jitter()

Faceted boxplot with nested grouping by two categorical variables:

ggplot(data = mpg, mapping = aes(x = drv, y = hwy)) +
  geom_boxplot() +
    facet_wrap(~ class, nrow = 3)

2 Data Transformation

library(nycflights13)

2.1 Filter

Simple filter:

filter(flights, month == 1, day == 1)

Filter with a set of values:

filter(flights, month %in% c(11, 12))

More complicated filter: not, null, between

filter(flights, !(arr_delay > 120 | dep_delay > 120))

filter(flights, is.na(dep_time))

filter(flights, between(distance, 700, 1000))

2.2 Sort

Simple sort:

arrange(flights, year, month, day)

Sort descending:

arrange(flights, desc(dep_delay))

Combination of ascending and descending:

arrange(flights, desc(dep_delay), arr_delay)

Sort with null value showing first:

arrange(flights, desc(is.na(dep_time)))

2.3 Select

Get list of column names:

colnames(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Select columns by names

select(flights, c(year, month, day))

Select range of columns by names

select(flights, c(year:day))

Select columns which are not in set of columns

select(flights, -c(year,day))

Similar to above but with column indexes, not name

select(flights, c(1, 3))

select(flights, c(1:3))

select(flights, -c(1, 3))

select(flights, c(1:3, 5:7))

Select columns with helper functions:

select(flights, contains('time'))

Some helper functions:

starts_with(“text”): matches names that begin with “text”.
ends_with(“text”): matches names that end with “text”.
contains(“text”): matches names that contain “text”.
matches(): selects variables that match a regular expression.

2.4 Add new variables

Adds new columns at the end of your dataset:

mutate(flights,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

If you only want to keep the new variables:

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

Functions for creating new variables:

Arithmetic operators: +, -, *, /, ^
Modular arithmetic: %/%
Logs: log(), log2(), log10()
Logical comparisons, <, <=, >, >=, !=, and ==

2.5 Group by

Simple group by

flights %>% 
  group_by(dest) %>% 
  summarise(dist = mean(distance, na.rm = TRUE))

Calculate multiple new variables on group by

flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )

Group by multiple variables

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))

Some helper functions can be used:

mean(x), median(x)
sd(x), IQR(x), mad(x)
min(x), quantile(x, 0.25), max(x)
first(x), nth(x, 2), last(x)
n()
sum(x > 10), mean(y == 0)

2A Programming cheat sheets - R

Student ID: 13484528

Vu Viet Truong (Vincent)

28/09/2019

1 Data visualizations

1.1 Scatter plot

1.2 Facet

1.3 Histogram chart

1.4 Bar plot

1.5 Box plot

2 Data Transformation

2.1 Filter

2.2 Sort

2.3 Select

2.4 Add new variables

2.5 Group by