Data 110 - Project 1: US Border Crossings

##-----------SECTION ONE: LOAD DATA & LIBRARIES

## set working director and read borders data set
setwd("C:/Users/kris.sutton/OneDrive - Government of The District of Columbia/Desktop/Personal/Data 110/R Studio/Datasets")
borders <- read.csv("border_crossing.csv")

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(readr)

## Warning: package 'readr' was built under R version 4.0.5

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.4     v dplyr   1.0.6
## v tibble  3.1.2     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v purrr   0.3.4

## Warning: package 'tibble' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'purrr' was built under R version 4.0.5

## Warning: package 'dplyr' was built under R version 4.0.5

## Warning: package 'stringr' was built under R version 4.0.5

## Warning: package 'forcats' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

library(dplyr)
library(dplyr)
library(streamgraph)
library(alluvial)

## Warning: package 'alluvial' was built under R version 4.0.5

library(ggplot2)
library(ggalluvial)

## Warning: package 'ggalluvial' was built under R version 4.0.5

library(tibble)

borders <- as.tibble(borders)

## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.

head(borders)

## # A tibble: 6 x 7
##   Port.Name    State Port.Code Border       Date        Measure            Value
##   <chr>        <chr>     <int> <chr>        <chr>       <chr>              <int>
## 1 Alcan        AK         3104 US-Canada B~ 2/1/2020 0~ Personal Vehicle ~  1414
## 2 Alcan        AK         3104 US-Canada B~ 2/1/2020 0~ Personal Vehicles    763
## 3 Alcan        AK         3104 US-Canada B~ 2/1/2020 0~ Truck Containers ~   412
## 4 Alcan        AK         3104 US-Canada B~ 2/1/2020 0~ Truck Containers ~   122
## 5 Alcan        AK         3104 US-Canada B~ 2/1/2020 0~ Trucks               545
## 6 Alexandria ~ NY          708 US-Canada B~ 2/1/2020 0~ Bus Passengers      1174

## ---------------SECTION TWO:  CLEANING DATA
## ---------------Subsection One:  Reformat Date Variable

## reformat Date column from chr to date
borders$Date <- 
  as.Date(borders$Date, "%m/%d/%y")


borders

## # A tibble: 355,511 x 7
##    Port.Name    State Port.Code Border       Date       Measure            Value
##    <chr>        <chr>     <int> <chr>        <date>     <chr>              <int>
##  1 Alcan        AK         3104 US-Canada B~ 2020-02-01 Personal Vehicle ~  1414
##  2 Alcan        AK         3104 US-Canada B~ 2020-02-01 Personal Vehicles    763
##  3 Alcan        AK         3104 US-Canada B~ 2020-02-01 Truck Containers ~   412
##  4 Alcan        AK         3104 US-Canada B~ 2020-02-01 Truck Containers ~   122
##  5 Alcan        AK         3104 US-Canada B~ 2020-02-01 Trucks               545
##  6 Alexandria ~ NY          708 US-Canada B~ 2020-02-01 Bus Passengers      1174
##  7 Alexandria ~ NY          708 US-Canada B~ 2020-02-01 Buses                 36
##  8 Alexandria ~ NY          708 US-Canada B~ 2020-02-01 Personal Vehicle ~ 68630
##  9 Alexandria ~ NY          708 US-Canada B~ 2020-02-01 Personal Vehicles  31696
## 10 Alexandria ~ NY          708 US-Canada B~ 2020-02-01 Truck Containers ~  1875
## # ... with 355,501 more rows

## ---------------Subsection Two: Change Column Names

##  rename measure and value columns
borders2 <- borders %>% 
  rename(
    Vehicle_Type = Measure,
    Headcount = Value
    )

borders2

## # A tibble: 355,511 x 7
##    Port.Name   State Port.Code Border     Date       Vehicle_Type      Headcount
##    <chr>       <chr>     <int> <chr>      <date>     <chr>                 <int>
##  1 Alcan       AK         3104 US-Canada~ 2020-02-01 Personal Vehicle~      1414
##  2 Alcan       AK         3104 US-Canada~ 2020-02-01 Personal Vehicles       763
##  3 Alcan       AK         3104 US-Canada~ 2020-02-01 Truck Containers~       412
##  4 Alcan       AK         3104 US-Canada~ 2020-02-01 Truck Containers~       122
##  5 Alcan       AK         3104 US-Canada~ 2020-02-01 Trucks                  545
##  6 Alexandria~ NY          708 US-Canada~ 2020-02-01 Bus Passengers         1174
##  7 Alexandria~ NY          708 US-Canada~ 2020-02-01 Buses                    36
##  8 Alexandria~ NY          708 US-Canada~ 2020-02-01 Personal Vehicle~     68630
##  9 Alexandria~ NY          708 US-Canada~ 2020-02-01 Personal Vehicles     31696
## 10 Alexandria~ NY          708 US-Canada~ 2020-02-01 Truck Containers~      1875
## # ... with 355,501 more rows

## ---------------Subsection Three: Remove Unused Variables

##  remove port name and code

borders3 <- select(borders2, State, Border, Date, Vehicle_Type, Headcount)

head(borders3)

## # A tibble: 6 x 5
##   State Border           Date       Vehicle_Type                Headcount
##   <chr> <chr>            <date>     <chr>                           <int>
## 1 AK    US-Canada Border 2020-02-01 Personal Vehicle Passengers      1414
## 2 AK    US-Canada Border 2020-02-01 Personal Vehicles                 763
## 3 AK    US-Canada Border 2020-02-01 Truck Containers Empty            412
## 4 AK    US-Canada Border 2020-02-01 Truck Containers Full             122
## 5 AK    US-Canada Border 2020-02-01 Trucks                            545
## 6 NY    US-Canada Border 2020-02-01 Bus Passengers                   1174

## ---------------Subsection Four:  Split up Date
## splitting up date so I can filter for just one yer

borders3$year <- year(ymd(borders3$Date))
borders3$month <- month(ymd(borders3$Date)) 
borders3$day <- day(ymd(borders3$Date))
borders3 <- select(borders3, State, Border, Headcount, Vehicle_Type, year)

borders3

## # A tibble: 355,511 x 5
##    State Border           Headcount Vehicle_Type                 year
##    <chr> <chr>                <int> <chr>                       <dbl>
##  1 AK    US-Canada Border      1414 Personal Vehicle Passengers  2020
##  2 AK    US-Canada Border       763 Personal Vehicles            2020
##  3 AK    US-Canada Border       412 Truck Containers Empty       2020
##  4 AK    US-Canada Border       122 Truck Containers Full        2020
##  5 AK    US-Canada Border       545 Trucks                       2020
##  6 NY    US-Canada Border      1174 Bus Passengers               2020
##  7 NY    US-Canada Border        36 Buses                        2020
##  8 NY    US-Canada Border     68630 Personal Vehicle Passengers  2020
##  9 NY    US-Canada Border     31696 Personal Vehicles            2020
## 10 NY    US-Canada Border      1875 Truck Containers Empty       2020
## # ... with 355,501 more rows

## ---------------Subsection Five: Filtering

## filter out Canadian Border and all years but 2019
borders4 <- borders3 %>%
  filter(Border == "US-Mexico Border" & year == 2019)

borders4

## # A tibble: 14,400 x 5
##    State Border           Headcount Vehicle_Type                 year
##    <chr> <chr>                <int> <chr>                       <dbl>
##  1 TX    US-Mexico Border         0 Train Passengers             2019
##  2 TX    US-Mexico Border         0 Rail Containers Full         2019
##  3 TX    US-Mexico Border     44416 Pedestrians                  2019
##  4 AZ    US-Mexico Border    507449 Personal Vehicle Passengers  2019
##  5 NM    US-Mexico Border     31225 Personal Vehicles            2019
##  6 TX    US-Mexico Border      1541 Buses                        2019
##  7 TX    US-Mexico Border       226 Buses                        2019
##  8 AZ    US-Mexico Border         0 Rail Containers Full         2019
##  9 TX    US-Mexico Border        24 Buses                        2019
## 10 TX    US-Mexico Border    126036 Personal Vehicles            2019
## # ... with 14,390 more rows

## ---------------Subsection Six:  Merging Redundant Data

## combining redundant vehicle types
borders4$Vehicle_Type <- gsub("Rail Containers Full", "Rail",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Rail Containers Empty", "Rail",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Train Passengers", "Rail",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Trains", "Rail",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Bus Passengers", "Buses",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Truck Containers Full", "Trucks",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Truck Containers Empty", "Trucks",  borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Personal Vehicle Passenger", "Vehicle Passenger",  borders4$Vehicle_Type)

borders4

## # A tibble: 14,400 x 5
##    State Border           Headcount Vehicle_Type        year
##    <chr> <chr>                <int> <chr>              <dbl>
##  1 TX    US-Mexico Border         0 Rail                2019
##  2 TX    US-Mexico Border         0 Rail                2019
##  3 TX    US-Mexico Border     44416 Pedestrians         2019
##  4 AZ    US-Mexico Border    507449 Vehicle Passengers  2019
##  5 NM    US-Mexico Border     31225 Personal Vehicles   2019
##  6 TX    US-Mexico Border      1541 Buses               2019
##  7 TX    US-Mexico Border       226 Buses               2019
##  8 AZ    US-Mexico Border         0 Rail                2019
##  9 TX    US-Mexico Border        24 Buses               2019
## 10 TX    US-Mexico Border    126036 Personal Vehicles   2019
## # ... with 14,390 more rows

## ---------------Subsection Seven:  Removing More Columns

## Since I filtered I don't need border or year columns so removing
borders5 <- select(borders4, State, Headcount, Vehicle_Type)

borders5

## # A tibble: 14,400 x 3
##    State Headcount Vehicle_Type      
##    <chr>     <int> <chr>             
##  1 TX            0 Rail              
##  2 TX            0 Rail              
##  3 TX        44416 Pedestrians       
##  4 AZ       507449 Vehicle Passengers
##  5 NM        31225 Personal Vehicles 
##  6 TX         1541 Buses             
##  7 TX          226 Buses             
##  8 AZ            0 Rail              
##  9 TX           24 Buses             
## 10 TX       126036 Personal Vehicles 
## # ... with 14,390 more rows

## --------------------SECTION THREE:  Plot the Chart

ggplot(borders5, aes(x = reorder(State, Headcount), y = Headcount/10000, fill = Vehicle_Type)) + 
  ## make columns a little smaller
  geom_col(width = 0.75) +
  ## use RColorbrewer pallete
  scale_fill_brewer(palette = "Spectral") +
  ## Labels
  xlab("State") +
  ylab("Number of Persons Crossing (in 10,000)") +
  ggtitle("Bar Graph of Mexico Border Crossings in 2019")

## creating a summary table to get a better sense of the numbers graphed

borders6 <- borders5%>% 
  group_by(State, Vehicle_Type) %>%
  summarise(Headcount = sum(Headcount))

## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.

borders6

## # A tibble: 24 x 3
## # Groups:   State [4]
##    State Vehicle_Type       Headcount
##    <chr> <chr>                  <int>
##  1 AZ    Buses                 247489
##  2 AZ    Pedestrians         31086720
##  3 AZ    Personal Vehicles   36415863
##  4 AZ    Rail                  126450
##  5 AZ    Trucks               2570883
##  6 AZ    Vehicle Passengers  93853007
##  7 CA    Buses                4225796
##  8 CA    Pedestrians         63119899
##  9 CA    Personal Vehicles   97718020
## 10 CA    Rail                   66265
## # ... with 14 more rows

The data set “borders” contains incoming US border crossing data from January 1996 to February 2020. The source of the data is the Bureau of Transportation Statistics, and the data set’s owner is Divyansh Agrawal. The seven variables included are:

Port.Name - Port Name State - State in US crossing occurred Port.Code - Port Code Border - Mexico or Canada border Date - Date of crossing Measure - Vehicle used to cross Value - Number of persons

For the purpose of this project we will focus on the variables: State, Border, Date, Measure, and Value. Except for Value, which is an integer, all others are in character format. The ultimate goal is to use these variables to show which states had the most border crossings on the Mexican border, and by what method in 2019.

Using a geom stacked bar chart method, I assigned the x-axis the US states on the Mexican border, and the y-axis the number of border crossings in the 10,000. The graph shows the state with the most border crossing in 2019 is Texas, followed by California, Arizona, and New Mexico. In the three largest states the preferred method for crossing is by being a passenger in a vehicle, followed by driving a personal vehicle, pedestrian, trucks, buses, and lastly by train or rail. This pattern holds true for all of the states.

Other topics that present further analysis are which months saw the most crossings, and comparisons to the Canadian border data. The first topic could help predict when there will be future increased crossings, and the second could challenge popular beliefs over which US border sees the most crossings each year.

Data 110 - Project 1: US Border Crossings

Kris Sutton

6/17/2021