When we think about US borders, we often think about walls, immigration, and defense, but sometimes forget how porous these boundaries are. Trade and travel happens every day across the Mexican and Canadian borders. Consumer goods are shipped into the country on trucks, people travel by bus on on foot to vacation in another country, and workers drive across the perform duties on both sides of the divide.
For this second project I chose to re-use the Kaggle dataset I found on border crossings for our first project. Here is a refresher on the data, taken from the Kaggle.com description:
“The Bureau of Transportation Statistics (BTS) Border Crossing Data provide summary statistics for inbound crossings at the U.S.-Canada and the U.S.-Mexico border at the port level. Data are available for trucks, trains, containers, buses, personal vehicles, passengers, and pedestrians. Border crossing data are collected at ports of entry by U.S. Customs and Border Protection (CBP). The data reflect the number of vehicles, containers, passengers or pedestrians entering the United States. CBP does not collect comparable data on outbound crossings. Users seeking data on outbound counts may therefore want to review data from individual bridge operators, border state governments, or the Mexican and Canadian governments.”
My goal in this second project is to dive deeper in the numbers and explore border crossings beyond their role in immigration.
Step 1: Import dataset to R and examine its basic properties
#read in data
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
bordercrossings <-read.csv("Border_Crossing_Entry_Data_From_Kaggle.csv")
head(bordercrossings)
## Port.Name State Port.Code Border Date
## 1 Calexico East California 2507 US-Mexico Border 03/01/2019 12:00:00 AM
## 2 Van Buren Maine 108 US-Canada Border 03/01/2019 12:00:00 AM
## 3 Otay Mesa California 2506 US-Mexico Border 03/01/2019 12:00:00 AM
## 4 Nogales Arizona 2604 US-Mexico Border 03/01/2019 12:00:00 AM
## 5 Trout River New York 715 US-Canada Border 03/01/2019 12:00:00 AM
## 6 Madawaska Maine 109 US-Canada Border 03/01/2019 12:00:00 AM
## Measure Value Location
## 1 Trucks 34447 POINT (-115.48433000000001 32.67524)
## 2 Rail Containers Full 428 POINT (-67.94271 47.16207)
## 3 Trucks 81217 POINT (-117.05333 32.57333)
## 4 Trains 62 POINT (-110.93361 31.340279999999996)
## 5 Personal Vehicle Passengers 16377 POINT (-73.44253 44.990010000000005)
## 6 Trucks 179 POINT (-68.3271 47.35446)
#examine the structure of the data
str(bordercrossings)
## 'data.frame': 346733 obs. of 8 variables:
## $ Port.Name: Factor w/ 116 levels "Alcan","Alexandria Bay",..: 19 108 73 65 106 57 74 84 80 22 ...
## $ State : Factor w/ 15 levels "Alaska","Arizona",..: 3 5 3 2 10 5 11 13 11 10 ...
## $ Port.Code: int 2507 108 2506 2604 715 109 3401 2309 3403 712 ...
## $ Border : Factor w/ 2 levels "US-Canada Border",..: 2 1 2 2 1 1 1 2 1 1 ...
## $ Date : Factor w/ 279 levels "01/01/1996 12:00:00 AM",..: 72 72 72 72 72 72 72 72 72 72 ...
## $ Measure : Factor w/ 12 levels "Bus Passengers",..: 12 7 12 9 4 12 1 10 6 12 ...
## $ Value : int 34447 428 81217 62 16377 179 1054 1808 6685 24759 ...
## $ Location : Factor w/ 224 levels "POINT (-100.05 49)",..: 75 142 88 54 162 143 198 205 17 163 ...
summary(bordercrossings)
## Port.Name State Port.Code
## Eastport : 5541 North Dakota: 57071 Min. : 101
## Buffalo-Niagara Falls : 3348 Washington : 44677 1st Qu.:2304
## Calais : 3348 Montana : 38154 Median :3013
## Calexico East : 3348 Maine : 38136 Mean :2456
## Champlain-Rouses Point: 3348 Texas : 35754 3rd Qu.:3402
## El Paso : 3348 Minnesota : 23135 Max. :4105
## (Other) :324452 (Other) :109806
## Border Date
## US-Canada Border:266187 05/01/2010 12:00:00 AM: 1356
## US-Mexico Border: 80546 06/01/2010 12:00:00 AM: 1356
## 07/01/2010 12:00:00 AM: 1356
## 08/01/2010 12:00:00 AM: 1356
## 09/01/2010 12:00:00 AM: 1356
## 10/01/2010 12:00:00 AM: 1356
## (Other) :338597
## Measure Value
## Personal Vehicles : 30219 Min. : 0
## Personal Vehicle Passengers: 30196 1st Qu.: 0
## Trucks : 29856 Median : 90
## Truck Containers Empty : 29757 Mean : 28188
## Truck Containers Full : 29694 3rd Qu.: 2483
## Buses : 28822 Max. :4447374
## (Other) :168189
## Location
## POINT (-83.04 42.32) : 3069
## POINT (-102.55 49) : 3060
## POINT (-106.45 31.76): 3060
## POINT (-110.94 31.33): 3060
## POINT (-111.96 49) : 3060
## POINT (-115.39 32.67): 3060
## (Other) :328364
This dataset is huge, with dimensions of 346733 x8, unsurprising since it represents monthly totals for every category of land border crossing at each port since January 1, 1996.
Step 3-Dive deeper into the data with preliminary plots
At this point I still have a dataframe with 346733 rows, an unweidly amount to work with.
I decided to filter in various ways and construct a few preliminary plots to get an idea of where to explore more.
Prelim Plot 1-Each type of crossings each year
plot1crossings <- bordercrossingsclean %>%
select(measure, date, value)
plot1crossings$date <- as.factor(substring(bordercrossingsclean$date,1,4))
plot1crossings <- plot1crossings %>%
group_by(date, measure) %>%
summarise(sum=sum(value))
head(plot1crossings)
## # A tibble: 6 x 3
## # Groups: date [1]
## date measure sum
## <fct> <chr> <dbl>
## 1 1996 Bus Passengers 5813778
## 2 1996 Buses 292789
## 3 1996 Pedestrians 34717351
## 4 1996 Personal Vehicle Passengers 272593220
## 5 1996 Personal Vehicles 101960373
## 6 1996 Rail Containers Empty 268134
library(ggplot2)
plot1crossings %>%
ggplot(aes(x=date, y=sum)) +
geom_point(aes(color=measure)) +
ggtitle("Types of Border Crossings") +
xlab ("Year") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(name="Crossings",trans = 'log2')

plot1crossings %>%
ggplot(aes(x=date, y=sum)) +
geom_point(aes(color=measure)) +
ggtitle("Types of Border Crossings") +
xlab ("Year") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(name="Crossings",trans = 'log2')

Prelim Plot 2- Total crossings over time for top 5 states (top 5 being total across all time)
plot2crossings <- bordercrossingsclean %>%
select (state, value)
plot2crossings <- plot2crossings %>%
group_by(state) %>%
summarise(sum=sum(value))
head(plot2crossings)
## # A tibble: 6 x 2
## state sum
## <chr> <dbl>
## 1 Alaska 13920348
## 2 Arizona 888075092
## 3 California 2499521716
## 4 Idaho 20998428
## 5 Maine 231249782
## 6 Michigan 753167571
Top 5 states: Texas, California, Arizona, New York, Michigan
plot2crossingstop5states <- bordercrossingsclean %>%
select(state, date, value)
plot2crossingstop5states$date <- as.factor(substring(plot2crossingstop5states$date,1,4))
plot2crossingstop5states <- plot2crossingstop5states %>%
filter(state %in% c("Texas", "California", "Arizona", "New York", "Michigan")) %>%
group_by(state, date) %>%
summarise(sum=sum(value))
head(plot2crossingstop5states)
## # A tibble: 6 x 3
## # Groups: state [1]
## state date sum
## <chr> <fct> <dbl>
## 1 Arizona 1996 38038999
## 2 Arizona 1997 40510361
## 3 Arizona 1998 41441325
## 4 Arizona 1999 44309727
## 5 Arizona 2000 46455388
## 6 Arizona 2001 43736679
plot2crossingstop5states %>%
ggplot(aes(x=date, y=sum)) +
geom_point(aes(color=state)) +
ggtitle("Aggregated Border Crossings by Year") +
xlab ("Year") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(name="Crossings", trans = 'log10')

Step 4- Making data more workable
Because there are only 3 months of data for 2019, yearly totals give artifically low numbers for this year. Instead, I will calculate monthly averages by dividing each total by the number of months represented in the year it was taken. For every year except 2019 the montly total will be n/12, but for 2019 it will be 2/3. I will split the dataset into two and then join them back again to perform this operation.
For 2019:
# Extracting 2019 data from the dataset using row numbers
threemonthdata <- bordercrossingsclean %>%
slice(1:2364)
#Creating new monthly average column based on 3 months in 2019
threemonthdata <- threemonthdata %>%
mutate(monthly.average = value/3)
For all other years:
# Extracting all other years' data from the dataset using row numbers
twelvemonthdata <- bordercrossingsclean %>%
slice(2365:346733)
#Creating new monthly average column based on 12 months in all other years
twelvemonthdata <- twelvemonthdata %>%
mutate(monthly.average = value/12)
Prelim Plot 3- Personal Vehicle Passengers Over Time and at Different Ports of Entry
plot3crossings <- monthlyavgbordercrossings %>%
select(portname, measure, monthly.average, date) %>%
filter(measure == "Personal Vehicle Passengers")
plot3crossings$date <- as.factor(substring(plot3crossings$date,1,4))
Top 5 ports (found above: El Paso, San Ysidro, Hidalgo, Buffalo-Niagara Falls, Laredo)
plot3crossingstoplot <- plot3crossings %>%
filter(portname %in% c("El Paso", "San Ysidro", "Hidalgo", "Laredo", "Buffalo-Niagara Falls")) %>%
group_by(portname, date) %>%
summarise(sum=sum(monthly.average))
plot3crossingstoplot %>%
ggplot(aes(x=date, y=sum)) +
geom_point(aes(color=portname)) +
ggtitle("Monthly Average Crossings Each Year") +
xlab ("Year") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(name="Crossings", trans = 'log2')
## Warning: Transformation introduced infinite values in continuous y-axis

Now let’s try this as a heatmap
plot3crossingstoplot %>%
ggplot(aes(date, portname, fill = sum)) +
geom_tile() +
scale_fill_gradientn(colors = hcl.colors(120)) +
ggtitle("Monthly Average Crossings Each Year") +
xlab ("Year") +
theme(axis.text.x = element_text(angle = 90)) +
ylab ("Port of Entry")

Step 5-Going further with the data: GIS visualization
#call back necessary packages
library("tmap")
library("tmaptools")
library("sf")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library("leaflet")
library("leaflet.extras")
# Call in shapefile to create map
setwd("~/Desktop/Datsets_for_DATA110/border shapefiles")
bordershapefile <- "border_x.shp"
usborders <- read_shape(file= bordershapefile, as.sf = TRUE)
## Warning: This function is deprecated and has been migrated to github.com/
## mtennekes/oldtmaptools
str(usborders)
## Classes 'sf' and 'data.frame': 171 obs. of 21 variables:
## $ PortCode : Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...
## $ PortName : Factor w/ 171 levels "Alburg","Alburg Springs",..: 148 129 45 103 105 104 22 4 114 95 ...
## $ State : Factor w/ 15 levels "AK","AZ","CA",..: 6 6 6 12 12 12 12 12 12 12 ...
## $ CP_Name : Factor w/ 112 levels "Alcan","Alexandria Bay",..: 94 77 27 15 15 15 15 2 68 57 ...
## $ UNIQUEID : Factor w/ 171 levels "0101","0103",..: 170 169 171 59 58 57 56 45 42 43 ...
## $ Year : num 2013 2013 0 0 0 ...
## $ Trucks : num 39777 731165 0 0 0 ...
## $ Loaded_Tru: num 35635 565265 0 0 0 ...
## $ Empty_Truc: num 7854 158460 0 0 0 ...
## $ Trains : num 352 3607 0 0 0 ...
## $ Loaded_Rai: num 15521 243828 0 0 0 ...
## $ Empty_Rai : num 16361 181946 0 0 0 ...
## $ Train_Pass: num 715 9044 0 0 0 ...
## $ Buses : num 2594 1837 0 0 0 ...
## $ Bus_Passen: num 41531 74114 0 0 0 ...
## $ Personal_V: num 1003967 2037430 0 0 0 ...
## $ PV_Passeng: num 1856527 4018331 0 0 0 ...
## $ Pedestrian: num 0 0 0 0 0 ...
## $ CLASS : int 1 1 2 2 2 2 1 1 1 1 ...
## $ STFIPS : Factor w/ 15 levels "02","04","06",..: 6 6 6 11 11 11 11 11 11 11 ...
## $ geometry :sfc_POINT of length 171; first list element: 'XY' num -84.4 46.5
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr "PortCode" "PortName" "State" "CP_Name" ...
This shapefiled contained a lot of unecessary information that I thought would be cumbersome later on, so I once again used the dplyr select feature to get only the Port Code, Port Name, and State. The “Geometry” column, which I assume allows for mapping, remained.
usborders2 <- dplyr::select(usborders, PortCode:State)
str(usborders2)
## Classes 'sf' and 'data.frame': 171 obs. of 4 variables:
## $ PortCode: Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...
## $ PortName: Factor w/ 171 levels "Alburg","Alburg Springs",..: 148 129 45 103 105 104 22 4 114 95 ...
## $ State : Factor w/ 15 levels "AK","AZ","CA",..: 6 6 6 12 12 12 12 12 12 12 ...
## $ geometry:sfc_POINT of length 171; first list element: 'XY' num -84.4 46.5
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA
## ..- attr(*, "names")= chr "PortCode" "PortName" "State"
Finally, I used the Quick Thematic Map command to display the shapefile, just to check that everything looked as it should.
qtm(usborders2)

Step 5- Selecting the data I want to use for my final visualization
I ultimately want to focus on the US-Canda border, specifically ports where there have been substantial increases or decreases in crossings over the years. For now, I will create a test map using the original dataset.
thisyearbcs <- bordercrossingsclean %>%
select(portname, state, portcode, value) %>%
group_by(portname,state, portcode) %>%
summarise(sum=sum(value))
head(thisyearbcs)
## # A tibble: 6 x 4
## # Groups: portname, state [6]
## portname state portcode sum
## <chr> <chr> <chr> <dbl>
## 1 Alcan Alaska 3104 4225089
## 2 Alexandria Bay New York 0708 62067849
## 3 Algonac Michigan 3814 63825
## 4 Ambrose North Dakota 3410 210157
## 5 Anacortes Washington 3010 1596779
## 6 Andrade California 2502 72889447
str(usborders2$portcode)
## Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...
str(thisyearbcs$portcode)
## chr [1:117] "3104" "0708" "3814" "3410" "3010" "2502" "3413" "0112" "3424" ...
usborders3 <- usborders2 %>%
group_by(portcode)
usborders3
## Simple feature collection with 171 features and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -141.0014 ymin: 25.88342 xmax: -66.98008 ymax: 64.08552
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 171 x 4
## # Groups: portcode [113]
## portcode portname state geometry
## <fct> <fct> <fct> <POINT [°]>
## 1 3803 Sault Ste. Marie MI (-84.36072 46.50844)
## 2 3802 Port Huron MI (-82.4236 42.99852)
## 3 3801 Detroit-Ambassador Bridge MI (-83.07406 42.31182)
## 4 0901 Niagara Falls-Lewiston Bridge NY (-79.04446 43.15305)
## 5 0901 Niagara Falls-Whirlpool Bridge NY (-79.05834 43.10924)
## 6 0901 Niagara Falls-Rainbow Bridge NY (-79.0677 43.09018)
## 7 0901 Buffalo-Peace Bridge NY (-78.90595 42.90694)
## 8 0708 Alexandria Bay NY (-75.98359 44.34723)
## 9 0701 Ogdensburg NY (-75.45775 44.73309)
## 10 0704 Massena NY (-74.7395 44.9906)
## # … with 161 more rows
Issue above- there appear to be multiple names for a port using the same code in this shapefile. Needs more investigation. Causing 171 entries for this dataset instead of 117
#Did this before, hence character return
#usborders2$portcode <- as.character(usborders2$portcode)
usborders2 <- usborders2[order(usborders2$portcode),]
thisyearbcs <- thisyearbcs[order(thisyearbcs$portname),]
identical(usborders2$portcode,thisyearbcs$portname)
## [1] FALSE