library(tidyverse)

Dataset

The dataset I used is Border Crossing Entry Data from https://www.kaggle.com/datasets. To reduce size of the data, I select data from year 2002 to 2019.

readr–read_csv

read_csv from readr (a sub-package of tidyverse) is a faster function to import csv files in terms of performance than the R default function read.csv, especially for large data sets.

data <- read_csv('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Border_Crossing_Entry_Data_2002-2019.csv')

## Parsed with column specification:
## cols(
##   `Port Name` = col_character(),
##   State = col_character(),
##   `Port Code` = col_double(),
##   Border = col_character(),
##   Date = col_character(),
##   Measure = col_character(),
##   Value = col_double(),
##   Location = col_character()
## )

data

Pipe operator %>%

tidyverse incorperates the pipe operator %>% from the magrittr package. The pipe %>% help writing code in a way that is easier to read and understand. Ranther than embed the ‘input’ of a function within the arguments (eg: function(input = data, argument1, argument2…)), the pipe %>% seperates ‘input’ from the function (eg: input %>% function(argument1, argument2…)), and enable recursive application, which means the output of the preceding piece of the code can be used as the input the folloing piece of code. (eg: input %>% function1(argument1, argument2…) %>% function2(argument1, argument2…)).

data_mod <- data
data_mod$Date <- data_mod$Date %>% as.Date('%m/%d/%Y')
data_mod

dplyr–mutate

the mutate function from dplyr (a sub-package in tidyverse), adds new variables and preserves existing ones. Here I will use mutate as well as the %>% to demostrate the operation above again.

data_mod2 <- data %>%
  mutate(Date = as.Date(Date, '%m/%d/%Y'))

data_mod2

stringr–str_replace

the str_replace function from stringr (a sub-package in tidyverse), replace content in a string with a defined common pattern by another content. In this example I will use backreferences in regular expressions to retain only the numerical parts in column Location.

data_mod3 <- data_mod2 %>%
  mutate(Location = str_replace(Location, '.+\\((.+)\\).*', '\\1'))

data_mod3

tidyr-separate

the separate function from tidyr (a sub-package in tidyverse), splites one column into multiple columns by defined delimiter. In this example, I will demostrator this function by spliting the Date column into Year, Month and Date 3 columns, as well as spliting Location column into Latitude and Longitude 2 columns.

data_mod4 <- data_mod3 %>%
  separate(Date,c('Year','Month','Date')) %>%
  separate(Location, c('Latitute','Longitude'), sep = ' ')

data_mod4

dplyr–select

the select function from dplyr keeps only the variables that are mention, or use minus sign ‘-’ to drop the variables that are mentioned. In this example I will drop the Port Code column which contains duplicate information compared to Port Name column.

data_mod5 <-data_mod4 %>%
  select(-`Port Code`)

data_mod5

dplyr–filter

the filter function from dplyr choose rows/cases where conditions are true. In this example I will filter cases in year 2019 only.

data_mod6 <- data_mod5 %>%
  filter(Year == 2019)

data_mod6

dplyr–group_by, summarise

the group_by and summarise functions from dplyr are often used together. group_by takes an existing table and converts it into a grouped table where operations are performed “by group”. summarise creates one or more scalar variables summarizing the variables of an existing table, such as calculating column sum, mean, etc.,.

data_mod7 <- data_mod6 %>%
  group_by(`State`) %>%
  summarise(Ttl_Value = sum(Value))

data_mod7

dplyr–arrange & desc

the arrange function sorts variables in ascending order. Desc function sorts a vector in descending order. Combine these two function allow as to arrange a table in desending order

data_mod8 <- data_mod7 %>%
  arrange(desc(Ttl_Value))

data_mod8

ggplot2 & fct_reorder

the fct_reorder function from forcats (a sub-package in tidyverse) offers a handy solution to reorder values in ggplot functions.

data_mod6 %>% 
  group_by(State) %>%
  summarise(Sum_Value = sum(Value)) %>%
  ggplot(aes(x=fct_reorder(State, Sum_Value), y=Sum_Value,fill=Sum_Value,label = Sum_Value))+
  geom_col()+
  ylim(0,40000000)+
  coord_flip()+
  geom_text(hjust = -0.1, size = 3)+
  labs(
    title='Border Crossing Activity Count by State',
    subtitle = 'Year 2019')+
  xlab('State')+
  ylab('Count')

data_mod5 %>%
  group_by(Measure,Year) %>%
  summarise(Sum_Value = sum(Value)) %>% 
  mutate(Year = as.numeric(Year)) %>%
  ggplot() +
  geom_line(aes(x= Year, y = Sum_Value, colour = Measure))+
  scale_x_discrete(limits = c(2002:2019))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(
    title = 'Total Border Crossing Activity Count 2002 - 2019',
    subtitle = 'by Measure')+
  ylab('Count')+
  geom_vline(xintercept = 2018, linetype = 'dashed', color = 'steelblue', size = 1)

Data607 Tidyverse Part 1

Sin Ying Wong

12/1/2019