library(tidyverse)
The dataset I used is Border Crossing Entry Data from https://www.kaggle.com/datasets. To reduce size of the data, I select data from year 2002 to 2019.
read_csv from readr (a sub-package of tidyverse) is a faster function to import csv files in terms of performance than the R default function read.csv, especially for large data sets.
data <- read_csv('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Border_Crossing_Entry_Data_2002-2019.csv')
## Parsed with column specification:
## cols(
## `Port Name` = col_character(),
## State = col_character(),
## `Port Code` = col_double(),
## Border = col_character(),
## Date = col_character(),
## Measure = col_character(),
## Value = col_double(),
## Location = col_character()
## )
data
tidyverse incorperates the pipe operator %>% from the magrittr package. The pipe %>% help writing code in a way that is easier to read and understand. Ranther than embed the ‘input’ of a function within the arguments (eg: function(input = data, argument1, argument2…)), the pipe %>% seperates ‘input’ from the function (eg: input %>% function(argument1, argument2…)), and enable recursive application, which means the output of the preceding piece of the code can be used as the input the folloing piece of code. (eg: input %>% function1(argument1, argument2…) %>% function2(argument1, argument2…)).
data_mod <- data
data_mod$Date <- data_mod$Date %>% as.Date('%m/%d/%Y')
data_mod
the mutate function from dplyr (a sub-package in tidyverse), adds new variables and preserves existing ones. Here I will use mutate as well as the %>% to demostrate the operation above again.
data_mod2 <- data %>%
mutate(Date = as.Date(Date, '%m/%d/%Y'))
data_mod2
the str_replace function from stringr (a sub-package in tidyverse), replace content in a string with a defined common pattern by another content. In this example I will use backreferences in regular expressions to retain only the numerical parts in column Location.
data_mod3 <- data_mod2 %>%
mutate(Location = str_replace(Location, '.+\\((.+)\\).*', '\\1'))
data_mod3
the separate function from tidyr (a sub-package in tidyverse), splites one column into multiple columns by defined delimiter. In this example, I will demostrator this function by spliting the Date column into Year, Month and Date 3 columns, as well as spliting Location column into Latitude and Longitude 2 columns.
data_mod4 <- data_mod3 %>%
separate(Date,c('Year','Month','Date')) %>%
separate(Location, c('Latitute','Longitude'), sep = ' ')
data_mod4
the select function from dplyr keeps only the variables that are mention, or use minus sign ‘-’ to drop the variables that are mentioned. In this example I will drop the Port Code column which contains duplicate information compared to Port Name column.
data_mod5 <-data_mod4 %>%
select(-`Port Code`)
data_mod5
the filter function from dplyr choose rows/cases where conditions are true. In this example I will filter cases in year 2019 only.
data_mod6 <- data_mod5 %>%
filter(Year == 2019)
data_mod6
the group_by and summarise functions from dplyr are often used together. group_by takes an existing table and converts it into a grouped table where operations are performed “by group”. summarise creates one or more scalar variables summarizing the variables of an existing table, such as calculating column sum, mean, etc.,.
data_mod7 <- data_mod6 %>%
group_by(`State`) %>%
summarise(Ttl_Value = sum(Value))
data_mod7
the arrange function sorts variables in ascending order. Desc function sorts a vector in descending order. Combine these two function allow as to arrange a table in desending order
data_mod8 <- data_mod7 %>%
arrange(desc(Ttl_Value))
data_mod8
the fct_reorder function from forcats (a sub-package in tidyverse) offers a handy solution to reorder values in ggplot functions.
data_mod6 %>%
group_by(State) %>%
summarise(Sum_Value = sum(Value)) %>%
ggplot(aes(x=fct_reorder(State, Sum_Value), y=Sum_Value,fill=Sum_Value,label = Sum_Value))+
geom_col()+
ylim(0,40000000)+
coord_flip()+
geom_text(hjust = -0.1, size = 3)+
labs(
title='Border Crossing Activity Count by State',
subtitle = 'Year 2019')+
xlab('State')+
ylab('Count')
data_mod5 %>%
group_by(Measure,Year) %>%
summarise(Sum_Value = sum(Value)) %>%
mutate(Year = as.numeric(Year)) %>%
ggplot() +
geom_line(aes(x= Year, y = Sum_Value, colour = Measure))+
scale_x_discrete(limits = c(2002:2019))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(
title = 'Total Border Crossing Activity Count 2002 - 2019',
subtitle = 'by Measure')+
ylab('Count')+
geom_vline(xintercept = 2018, linetype = 'dashed', color = 'steelblue', size = 1)