##-----------SECTION ONE: LOAD DATA & LIBRARIES
## set working director and read borders data set
setwd("C:/Users/kris.sutton/OneDrive - Government of The District of Columbia/Desktop/Personal/Data 110/R Studio/Datasets")
borders <- read.csv("border_crossing.csv")
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.4 v dplyr 1.0.6
## v tibble 3.1.2 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(dplyr)
library(dplyr)
library(streamgraph)
library(alluvial)
## Warning: package 'alluvial' was built under R version 4.0.5
library(ggplot2)
library(ggalluvial)
## Warning: package 'ggalluvial' was built under R version 4.0.5
library(tibble)
borders <- as.tibble(borders)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
head(borders)
## # A tibble: 6 x 7
## Port.Name State Port.Code Border Date Measure Value
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Alcan AK 3104 US-Canada B~ 2/1/2020 0~ Personal Vehicle ~ 1414
## 2 Alcan AK 3104 US-Canada B~ 2/1/2020 0~ Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada B~ 2/1/2020 0~ Truck Containers ~ 412
## 4 Alcan AK 3104 US-Canada B~ 2/1/2020 0~ Truck Containers ~ 122
## 5 Alcan AK 3104 US-Canada B~ 2/1/2020 0~ Trucks 545
## 6 Alexandria ~ NY 708 US-Canada B~ 2/1/2020 0~ Bus Passengers 1174
## ---------------SECTION TWO: CLEANING DATA
## ---------------Subsection One: Reformat Date Variable
## reformat Date column from chr to date
borders$Date <-
as.Date(borders$Date, "%m/%d/%y")
borders
## # A tibble: 355,511 x 7
## Port.Name State Port.Code Border Date Measure Value
## <chr> <chr> <int> <chr> <date> <chr> <int>
## 1 Alcan AK 3104 US-Canada B~ 2020-02-01 Personal Vehicle ~ 1414
## 2 Alcan AK 3104 US-Canada B~ 2020-02-01 Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada B~ 2020-02-01 Truck Containers ~ 412
## 4 Alcan AK 3104 US-Canada B~ 2020-02-01 Truck Containers ~ 122
## 5 Alcan AK 3104 US-Canada B~ 2020-02-01 Trucks 545
## 6 Alexandria ~ NY 708 US-Canada B~ 2020-02-01 Bus Passengers 1174
## 7 Alexandria ~ NY 708 US-Canada B~ 2020-02-01 Buses 36
## 8 Alexandria ~ NY 708 US-Canada B~ 2020-02-01 Personal Vehicle ~ 68630
## 9 Alexandria ~ NY 708 US-Canada B~ 2020-02-01 Personal Vehicles 31696
## 10 Alexandria ~ NY 708 US-Canada B~ 2020-02-01 Truck Containers ~ 1875
## # ... with 355,501 more rows
## ---------------Subsection Two: Change Column Names
## rename measure and value columns
borders2 <- borders %>%
rename(
Vehicle_Type = Measure,
Headcount = Value
)
borders2
## # A tibble: 355,511 x 7
## Port.Name State Port.Code Border Date Vehicle_Type Headcount
## <chr> <chr> <int> <chr> <date> <chr> <int>
## 1 Alcan AK 3104 US-Canada~ 2020-02-01 Personal Vehicle~ 1414
## 2 Alcan AK 3104 US-Canada~ 2020-02-01 Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada~ 2020-02-01 Truck Containers~ 412
## 4 Alcan AK 3104 US-Canada~ 2020-02-01 Truck Containers~ 122
## 5 Alcan AK 3104 US-Canada~ 2020-02-01 Trucks 545
## 6 Alexandria~ NY 708 US-Canada~ 2020-02-01 Bus Passengers 1174
## 7 Alexandria~ NY 708 US-Canada~ 2020-02-01 Buses 36
## 8 Alexandria~ NY 708 US-Canada~ 2020-02-01 Personal Vehicle~ 68630
## 9 Alexandria~ NY 708 US-Canada~ 2020-02-01 Personal Vehicles 31696
## 10 Alexandria~ NY 708 US-Canada~ 2020-02-01 Truck Containers~ 1875
## # ... with 355,501 more rows
## ---------------Subsection Three: Remove Unused Variables
## remove port name and code
borders3 <- select(borders2, State, Border, Date, Vehicle_Type, Headcount)
head(borders3)
## # A tibble: 6 x 5
## State Border Date Vehicle_Type Headcount
## <chr> <chr> <date> <chr> <int>
## 1 AK US-Canada Border 2020-02-01 Personal Vehicle Passengers 1414
## 2 AK US-Canada Border 2020-02-01 Personal Vehicles 763
## 3 AK US-Canada Border 2020-02-01 Truck Containers Empty 412
## 4 AK US-Canada Border 2020-02-01 Truck Containers Full 122
## 5 AK US-Canada Border 2020-02-01 Trucks 545
## 6 NY US-Canada Border 2020-02-01 Bus Passengers 1174
## ---------------Subsection Four: Split up Date
## splitting up date so I can filter for just one yer
borders3$year <- year(ymd(borders3$Date))
borders3$month <- month(ymd(borders3$Date))
borders3$day <- day(ymd(borders3$Date))
borders3 <- select(borders3, State, Border, Headcount, Vehicle_Type, year)
borders3
## # A tibble: 355,511 x 5
## State Border Headcount Vehicle_Type year
## <chr> <chr> <int> <chr> <dbl>
## 1 AK US-Canada Border 1414 Personal Vehicle Passengers 2020
## 2 AK US-Canada Border 763 Personal Vehicles 2020
## 3 AK US-Canada Border 412 Truck Containers Empty 2020
## 4 AK US-Canada Border 122 Truck Containers Full 2020
## 5 AK US-Canada Border 545 Trucks 2020
## 6 NY US-Canada Border 1174 Bus Passengers 2020
## 7 NY US-Canada Border 36 Buses 2020
## 8 NY US-Canada Border 68630 Personal Vehicle Passengers 2020
## 9 NY US-Canada Border 31696 Personal Vehicles 2020
## 10 NY US-Canada Border 1875 Truck Containers Empty 2020
## # ... with 355,501 more rows
## ---------------Subsection Five: Filtering
## filter out Canadian Border and all years but 2019
borders4 <- borders3 %>%
filter(Border == "US-Mexico Border" & year == 2019)
borders4
## # A tibble: 14,400 x 5
## State Border Headcount Vehicle_Type year
## <chr> <chr> <int> <chr> <dbl>
## 1 TX US-Mexico Border 0 Train Passengers 2019
## 2 TX US-Mexico Border 0 Rail Containers Full 2019
## 3 TX US-Mexico Border 44416 Pedestrians 2019
## 4 AZ US-Mexico Border 507449 Personal Vehicle Passengers 2019
## 5 NM US-Mexico Border 31225 Personal Vehicles 2019
## 6 TX US-Mexico Border 1541 Buses 2019
## 7 TX US-Mexico Border 226 Buses 2019
## 8 AZ US-Mexico Border 0 Rail Containers Full 2019
## 9 TX US-Mexico Border 24 Buses 2019
## 10 TX US-Mexico Border 126036 Personal Vehicles 2019
## # ... with 14,390 more rows
## ---------------Subsection Six: Merging Redundant Data
## combining redundant vehicle types
borders4$Vehicle_Type <- gsub("Rail Containers Full", "Rail", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Rail Containers Empty", "Rail", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Train Passengers", "Rail", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Trains", "Rail", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Bus Passengers", "Buses", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Truck Containers Full", "Trucks", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Truck Containers Empty", "Trucks", borders4$Vehicle_Type)
borders4$Vehicle_Type <- gsub("Personal Vehicle Passenger", "Vehicle Passenger", borders4$Vehicle_Type)
borders4
## # A tibble: 14,400 x 5
## State Border Headcount Vehicle_Type year
## <chr> <chr> <int> <chr> <dbl>
## 1 TX US-Mexico Border 0 Rail 2019
## 2 TX US-Mexico Border 0 Rail 2019
## 3 TX US-Mexico Border 44416 Pedestrians 2019
## 4 AZ US-Mexico Border 507449 Vehicle Passengers 2019
## 5 NM US-Mexico Border 31225 Personal Vehicles 2019
## 6 TX US-Mexico Border 1541 Buses 2019
## 7 TX US-Mexico Border 226 Buses 2019
## 8 AZ US-Mexico Border 0 Rail 2019
## 9 TX US-Mexico Border 24 Buses 2019
## 10 TX US-Mexico Border 126036 Personal Vehicles 2019
## # ... with 14,390 more rows
## ---------------Subsection Seven: Removing More Columns
## Since I filtered I don't need border or year columns so removing
borders5 <- select(borders4, State, Headcount, Vehicle_Type)
borders5
## # A tibble: 14,400 x 3
## State Headcount Vehicle_Type
## <chr> <int> <chr>
## 1 TX 0 Rail
## 2 TX 0 Rail
## 3 TX 44416 Pedestrians
## 4 AZ 507449 Vehicle Passengers
## 5 NM 31225 Personal Vehicles
## 6 TX 1541 Buses
## 7 TX 226 Buses
## 8 AZ 0 Rail
## 9 TX 24 Buses
## 10 TX 126036 Personal Vehicles
## # ... with 14,390 more rows
## --------------------SECTION THREE: Plot the Chart
ggplot(borders5, aes(x = reorder(State, Headcount), y = Headcount/10000, fill = Vehicle_Type)) +
## make columns a little smaller
geom_col(width = 0.75) +
## use RColorbrewer pallete
scale_fill_brewer(palette = "Spectral") +
## Labels
xlab("State") +
ylab("Number of Persons Crossing (in 10,000)") +
ggtitle("Bar Graph of Mexico Border Crossings in 2019")
## creating a summary table to get a better sense of the numbers graphed
borders6 <- borders5%>%
group_by(State, Vehicle_Type) %>%
summarise(Headcount = sum(Headcount))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
borders6
## # A tibble: 24 x 3
## # Groups: State [4]
## State Vehicle_Type Headcount
## <chr> <chr> <int>
## 1 AZ Buses 247489
## 2 AZ Pedestrians 31086720
## 3 AZ Personal Vehicles 36415863
## 4 AZ Rail 126450
## 5 AZ Trucks 2570883
## 6 AZ Vehicle Passengers 93853007
## 7 CA Buses 4225796
## 8 CA Pedestrians 63119899
## 9 CA Personal Vehicles 97718020
## 10 CA Rail 66265
## # ... with 14 more rows
The data set “borders” contains incoming US border crossing data from January 1996 to February 2020. The source of the data is the Bureau of Transportation Statistics, and the data set’s owner is Divyansh Agrawal. The seven variables included are:
Port.Name - Port Name State - State in US crossing occurred Port.Code - Port Code Border - Mexico or Canada border Date - Date of crossing Measure - Vehicle used to cross Value - Number of persons
For the purpose of this project we will focus on the variables: State, Border, Date, Measure, and Value. Except for Value, which is an integer, all others are in character format. The ultimate goal is to use these variables to show which states had the most border crossings on the Mexican border, and by what method in 2019.
Using a geom stacked bar chart method, I assigned the x-axis the US states on the Mexican border, and the y-axis the number of border crossings in the 10,000. The graph shows the state with the most border crossing in 2019 is Texas, followed by California, Arizona, and New Mexico. In the three largest states the preferred method for crossing is by being a passenger in a vehicle, followed by driving a personal vehicle, pedestrian, trucks, buses, and lastly by train or rail. This pattern holds true for all of the states.
Other topics that present further analysis are which months saw the most crossings, and comparisons to the Canadian border data. The first topic could help predict when there will be future increased crossings, and the second could challenge popular beliefs over which US border sees the most crossings each year.