Project 3: Border Crossings in Depth

When we think about US borders, we often think about walls, immigration, and defense, but sometimes forget how porous these boundaries are. Trade and travel happens every day across the Mexican and Canadian borders. Consumer goods are shipped into the country on trucks, people travel by bus on on foot to vacation in another country, and workers drive across the perform duties on both sides of the divide.

For this second project I chose to re-use the Kaggle dataset I found on border crossings for our first project. Here is a refresher on the data, taken from the Kaggle.com description:

“The Bureau of Transportation Statistics (BTS) Border Crossing Data provide summary statistics for inbound crossings at the U.S.-Canada and the U.S.-Mexico border at the port level. Data are available for trucks, trains, containers, buses, personal vehicles, passengers, and pedestrians. Border crossing data are collected at ports of entry by U.S. Customs and Border Protection (CBP). The data reflect the number of vehicles, containers, passengers or pedestrians entering the United States. CBP does not collect comparable data on outbound crossings. Users seeking data on outbound counts may therefore want to review data from individual bridge operators, border state governments, or the Mexican and Canadian governments.”

My goal in this second project is to dive deeper in the numbers and explore border crossings beyond their role in immigration.

Step 1: Import dataset to R and examine its basic properties

#read in data
library(tidyverse)

## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

bordercrossings <-read.csv("Border_Crossing_Entry_Data_From_Kaggle.csv")

head(bordercrossings)

##       Port.Name      State Port.Code           Border                   Date
## 1 Calexico East California      2507 US-Mexico Border 03/01/2019 12:00:00 AM
## 2     Van Buren      Maine       108 US-Canada Border 03/01/2019 12:00:00 AM
## 3     Otay Mesa California      2506 US-Mexico Border 03/01/2019 12:00:00 AM
## 4       Nogales    Arizona      2604 US-Mexico Border 03/01/2019 12:00:00 AM
## 5   Trout River   New York       715 US-Canada Border 03/01/2019 12:00:00 AM
## 6     Madawaska      Maine       109 US-Canada Border 03/01/2019 12:00:00 AM
##                       Measure Value                              Location
## 1                      Trucks 34447  POINT (-115.48433000000001 32.67524)
## 2        Rail Containers Full   428            POINT (-67.94271 47.16207)
## 3                      Trucks 81217           POINT (-117.05333 32.57333)
## 4                      Trains    62 POINT (-110.93361 31.340279999999996)
## 5 Personal Vehicle Passengers 16377  POINT (-73.44253 44.990010000000005)
## 6                      Trucks   179             POINT (-68.3271 47.35446)

#examine the structure of the data
str(bordercrossings)

## 'data.frame':    346733 obs. of  8 variables:
##  $ Port.Name: Factor w/ 116 levels "Alcan","Alexandria Bay",..: 19 108 73 65 106 57 74 84 80 22 ...
##  $ State    : Factor w/ 15 levels "Alaska","Arizona",..: 3 5 3 2 10 5 11 13 11 10 ...
##  $ Port.Code: int  2507 108 2506 2604 715 109 3401 2309 3403 712 ...
##  $ Border   : Factor w/ 2 levels "US-Canada Border",..: 2 1 2 2 1 1 1 2 1 1 ...
##  $ Date     : Factor w/ 279 levels "01/01/1996 12:00:00 AM",..: 72 72 72 72 72 72 72 72 72 72 ...
##  $ Measure  : Factor w/ 12 levels "Bus Passengers",..: 12 7 12 9 4 12 1 10 6 12 ...
##  $ Value    : int  34447 428 81217 62 16377 179 1054 1808 6685 24759 ...
##  $ Location : Factor w/ 224 levels "POINT (-100.05 49)",..: 75 142 88 54 162 143 198 205 17 163 ...

summary(bordercrossings)

##                   Port.Name               State          Port.Code   
##  Eastport              :  5541   North Dakota: 57071   Min.   : 101  
##  Buffalo-Niagara Falls :  3348   Washington  : 44677   1st Qu.:2304  
##  Calais                :  3348   Montana     : 38154   Median :3013  
##  Calexico East         :  3348   Maine       : 38136   Mean   :2456  
##  Champlain-Rouses Point:  3348   Texas       : 35754   3rd Qu.:3402  
##  El Paso               :  3348   Minnesota   : 23135   Max.   :4105  
##  (Other)               :324452   (Other)     :109806                 
##               Border                           Date       
##  US-Canada Border:266187   05/01/2010 12:00:00 AM:  1356  
##  US-Mexico Border: 80546   06/01/2010 12:00:00 AM:  1356  
##                            07/01/2010 12:00:00 AM:  1356  
##                            08/01/2010 12:00:00 AM:  1356  
##                            09/01/2010 12:00:00 AM:  1356  
##                            10/01/2010 12:00:00 AM:  1356  
##                            (Other)               :338597  
##                         Measure           Value        
##  Personal Vehicles          : 30219   Min.   :      0  
##  Personal Vehicle Passengers: 30196   1st Qu.:      0  
##  Trucks                     : 29856   Median :     90  
##  Truck Containers Empty     : 29757   Mean   :  28188  
##  Truck Containers Full      : 29694   3rd Qu.:   2483  
##  Buses                      : 28822   Max.   :4447374  
##  (Other)                    :168189                    
##                   Location     
##  POINT (-83.04 42.32) :  3069  
##  POINT (-102.55 49)   :  3060  
##  POINT (-106.45 31.76):  3060  
##  POINT (-110.94 31.33):  3060  
##  POINT (-111.96 49)   :  3060  
##  POINT (-115.39 32.67):  3060  
##  (Other)              :328364

This dataset is huge, with dimensions of 346733 x8, unsurprising since it represents monthly totals for every category of land border crossing at each port since January 1, 1996.

Step 2- Clean and reformat data

Fortunately this data was very clean to begin with because it was from a government source. It did not have duplicate values, errors or blanks. To be sure of this I imported it to OpenRefine and sorted each category, selecting for “blanks” and “errors” as the first entries, but found none.

While in OpenRefine I also chose to edit the dates slightly, because I was unhappy with their format in the original Excel .csv file and unable to format them correctly in R.

The workaround code I would have used in R is below, but it proved inadequate because it failed to present the first 2 digits for the “year” value. This meant 2002 would appear as “0002” and 1998 would appear as “0098.” Although I was able to read these, I knew R would later be confused and attribute a higher value to “0098” than “0002”, which was not my intention.

#transform dates
#bordercrossings$date <- as.POSIXct(bordercrossings$date, format = "%m/%d/%Y %H:%M")
#select first seven characters of date
# bordercrossings$date <- as.factor(substring(bordercrossings$date,1,7))

In OpenRefine, I converted values in the date column to date format using the “transform…” feature, then used the GREL command value.slice(0,7) to select the first 7 characters of each date, effectively putting them into year-month format.

I imported the new, OpenRefine-edited file to R and named it “bordercrossingsclean”.

bordercrossingsclean <-read_csv("Border_Crossing_Entry_Data-OpenRefine.csv")

## Parsed with column specification:
## cols(
##   `Port Name` = col_character(),
##   State = col_character(),
##   `Port Code` = col_character(),
##   Border = col_character(),
##   Date = col_character(),
##   Measure = col_character(),
##   Value = col_double(),
##   Location = col_character()
## )

I then conducted a bit more reformatting in R. This included creating a uniform heading style, using lowercase column titles without spaces between the words.

names(bordercrossingsclean) <- tolower(names(bordercrossingsclean))
names(bordercrossingsclean) <- gsub(" ","",names(bordercrossingsclean)) 
head(bordercrossingsclean)

## # A tibble: 6 x 8
##   portname   state   portcode border    date  measure      value location       
##   <chr>      <chr>   <chr>    <chr>     <chr> <chr>        <dbl> <chr>          
## 1 Calexico … Califo… 2507     US-Mexic… 2019… Trucks       34447 POINT (-115.48…
## 2 Van Buren  Maine   0108     US-Canad… 2019… Rail Contai…   428 POINT (-67.942…
## 3 Otay Mesa  Califo… 2506     US-Mexic… 2019… Trucks       81217 POINT (-117.05…
## 4 Nogales    Arizona 2604     US-Mexic… 2019… Trains          62 POINT (-110.93…
## 5 Trout Riv… New Yo… 0715     US-Canad… 2019… Personal Ve… 16377 POINT (-73.442…
## 6 Madawaska  Maine   0109     US-Canad… 2019… Trucks         179 POINT (-68.327…

I also used the dplyr ‘select’ feature to get rid of the “location” column, which listed longitude and latitute coordinates I found hard to understand. This information was also irrelevant for my uses because the port of entry for each crossing was already documented.

#select all columns between portname and value, inclusive
bordercrossingsclean <- dplyr::select(bordercrossingsclean, portname:value)

Step 3-Dive deeper into the data with preliminary plots

At this point I still have a dataframe with 346733 rows, an unweidly amount to work with.

I decided to filter in various ways and construct a few preliminary plots to get an idea of where to explore more.

Prelim Plot 1-Each type of crossings each year

plot1crossings <- bordercrossingsclean %>%
  select(measure, date, value)

plot1crossings$date <- as.factor(substring(bordercrossingsclean$date,1,4))

plot1crossings <- plot1crossings %>% 
  group_by(date, measure) %>%
    summarise(sum=sum(value))

head(plot1crossings)

## # A tibble: 6 x 3
## # Groups:   date [1]
##   date  measure                           sum
##   <fct> <chr>                           <dbl>
## 1 1996  Bus Passengers                5813778
## 2 1996  Buses                          292789
## 3 1996  Pedestrians                  34717351
## 4 1996  Personal Vehicle Passengers 272593220
## 5 1996  Personal Vehicles           101960373
## 6 1996  Rail Containers Empty          268134

library(ggplot2)
plot1crossings %>% 
  ggplot(aes(x=date, y=sum)) +
  geom_point(aes(color=measure)) +
  ggtitle("Types of Border Crossings") +
  xlab ("Year") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(name="Crossings",trans = 'log2')

plot1crossings %>% 
  ggplot(aes(x=date, y=sum)) +
  geom_point(aes(color=measure)) +
  ggtitle("Types of Border Crossings") +
  xlab ("Year") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(name="Crossings",trans = 'log2')

Prelim Plot 2- Total crossings over time for top 5 states (top 5 being total across all time)

plot2crossings <- bordercrossingsclean %>% 
  select (state, value)

plot2crossings <- plot2crossings %>%
  group_by(state) %>%
  summarise(sum=sum(value))

head(plot2crossings)

## # A tibble: 6 x 2
##   state             sum
##   <chr>           <dbl>
## 1 Alaska       13920348
## 2 Arizona     888075092
## 3 California 2499521716
## 4 Idaho        20998428
## 5 Maine       231249782
## 6 Michigan    753167571

Top 5 states: Texas, California, Arizona, New York, Michigan

plot2crossingstop5states <- bordercrossingsclean %>%
  select(state, date, value)

plot2crossingstop5states$date <- as.factor(substring(plot2crossingstop5states$date,1,4))

  plot2crossingstop5states <- plot2crossingstop5states %>%
  filter(state %in% c("Texas", "California", "Arizona", "New York", "Michigan")) %>%
  group_by(state, date) %>%
  summarise(sum=sum(value))
head(plot2crossingstop5states)

## # A tibble: 6 x 3
## # Groups:   state [1]
##   state   date       sum
##   <chr>   <fct>    <dbl>
## 1 Arizona 1996  38038999
## 2 Arizona 1997  40510361
## 3 Arizona 1998  41441325
## 4 Arizona 1999  44309727
## 5 Arizona 2000  46455388
## 6 Arizona 2001  43736679

plot2crossingstop5states %>% 
  ggplot(aes(x=date, y=sum)) +
  geom_point(aes(color=state)) +
  ggtitle("Aggregated Border Crossings by Year") +
  xlab ("Year") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(name="Crossings", trans = 'log10')

Step 4- Making data more workable

Because there are only 3 months of data for 2019, yearly totals give artifically low numbers for this year. Instead, I will calculate monthly averages by dividing each total by the number of months represented in the year it was taken. For every year except 2019 the montly total will be n/12, but for 2019 it will be 2/3. I will split the dataset into two and then join them back again to perform this operation.

For 2019:

# Extracting 2019 data from the dataset using row numbers
threemonthdata <- bordercrossingsclean %>%
  slice(1:2364)

#Creating new monthly average column based on 3 months in 2019
threemonthdata <- threemonthdata %>%
  mutate(monthly.average = value/3)

For all other years:

# Extracting all other years' data from the dataset using row numbers
twelvemonthdata <- bordercrossingsclean %>%
  slice(2365:346733)

#Creating new monthly average column based on 12 months in all other years
twelvemonthdata <- twelvemonthdata %>%
  mutate(monthly.average = value/12)

Now we have to create and reformat one cohesive dataset again:

# Combining the two datasets
monthlyavgbordercrossings <- bind_rows(threemonthdata, twelvemonthdata)

# Reformatting column headers
names(monthlyavgbordercrossings) <- tolower(names(monthlyavgbordercrossings))
names(monthlyavgbordercrossings) <- gsub(" ","",names(monthlyavgbordercrossings))

# Selecting relevant information (columns). I did not select the original values because I only wanted the monthly averages
monthlyavgbordercrossings <- monthlyavgbordercrossings %>%
  select(portname, state, portcode, border, date, measure, monthly.average)

Prelim Plot 3- Personal Vehicle Passengers Over Time and at Different Ports of Entry

plot3crossings <- monthlyavgbordercrossings %>% 
  select(portname, measure, monthly.average, date) %>%
  filter(measure == "Personal Vehicle Passengers")

plot3crossings$date <- as.factor(substring(plot3crossings$date,1,4))

Top 5 ports (found above: El Paso, San Ysidro, Hidalgo, Buffalo-Niagara Falls, Laredo)

plot3crossingstoplot <- plot3crossings %>%
  filter(portname %in% c("El Paso", "San Ysidro", "Hidalgo", "Laredo", "Buffalo-Niagara Falls")) %>%
    group_by(portname, date) %>%
  summarise(sum=sum(monthly.average))

plot3crossingstoplot %>% 
  ggplot(aes(x=date, y=sum)) +
  geom_point(aes(color=portname)) + 
  ggtitle("Monthly Average Crossings Each Year") +
  xlab ("Year") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(name="Crossings", trans = 'log2')

## Warning: Transformation introduced infinite values in continuous y-axis

Now let’s try this as a heatmap

plot3crossingstoplot %>% 
  ggplot(aes(date, portname, fill = sum)) +
  geom_tile() +
  scale_fill_gradientn(colors = hcl.colors(120)) +
  ggtitle("Monthly Average Crossings Each Year") +
  xlab ("Year") +
  theme(axis.text.x = element_text(angle = 90)) +
  ylab ("Port of Entry")

Step 5-Going further with the data: GIS visualization

#call back necessary packages
library("tmap")
library("tmaptools")
library("sf")

## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

library("leaflet")
library("leaflet.extras")

I found a shapefile here with corresponding ports of entry: https://geo.nyu.edu/catalog/stanford-yv333xj1559

# Call in shapefile to create map
setwd("~/Desktop/Datsets_for_DATA110/border shapefiles")
bordershapefile <- "border_x.shp"

usborders <- read_shape(file= bordershapefile, as.sf = TRUE)

## Warning: This function is deprecated and has been migrated to github.com/
## mtennekes/oldtmaptools

str(usborders)

## Classes 'sf' and 'data.frame':   171 obs. of  21 variables:
##  $ PortCode  : Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...
##  $ PortName  : Factor w/ 171 levels "Alburg","Alburg Springs",..: 148 129 45 103 105 104 22 4 114 95 ...
##  $ State     : Factor w/ 15 levels "AK","AZ","CA",..: 6 6 6 12 12 12 12 12 12 12 ...
##  $ CP_Name   : Factor w/ 112 levels "Alcan","Alexandria Bay",..: 94 77 27 15 15 15 15 2 68 57 ...
##  $ UNIQUEID  : Factor w/ 171 levels "0101","0103",..: 170 169 171 59 58 57 56 45 42 43 ...
##  $ Year      : num  2013 2013 0 0 0 ...
##  $ Trucks    : num  39777 731165 0 0 0 ...
##  $ Loaded_Tru: num  35635 565265 0 0 0 ...
##  $ Empty_Truc: num  7854 158460 0 0 0 ...
##  $ Trains    : num  352 3607 0 0 0 ...
##  $ Loaded_Rai: num  15521 243828 0 0 0 ...
##  $ Empty_Rai : num  16361 181946 0 0 0 ...
##  $ Train_Pass: num  715 9044 0 0 0 ...
##  $ Buses     : num  2594 1837 0 0 0 ...
##  $ Bus_Passen: num  41531 74114 0 0 0 ...
##  $ Personal_V: num  1003967 2037430 0 0 0 ...
##  $ PV_Passeng: num  1856527 4018331 0 0 0 ...
##  $ Pedestrian: num  0 0 0 0 0 ...
##  $ CLASS     : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ STFIPS    : Factor w/ 15 levels "02","04","06",..: 6 6 6 11 11 11 11 11 11 11 ...
##  $ geometry  :sfc_POINT of length 171; first list element:  'XY' num  -84.4 46.5
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr  "PortCode" "PortName" "State" "CP_Name" ...

This shapefiled contained a lot of unecessary information that I thought would be cumbersome later on, so I once again used the dplyr select feature to get only the Port Code, Port Name, and State. The “Geometry” column, which I assume allows for mapping, remained.

usborders2 <- dplyr::select(usborders, PortCode:State)
str(usborders2)

## Classes 'sf' and 'data.frame':   171 obs. of  4 variables:
##  $ PortCode: Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...
##  $ PortName: Factor w/ 171 levels "Alburg","Alburg Springs",..: 148 129 45 103 105 104 22 4 114 95 ...
##  $ State   : Factor w/ 15 levels "AK","AZ","CA",..: 6 6 6 12 12 12 12 12 12 12 ...
##  $ geometry:sfc_POINT of length 171; first list element:  'XY' num  -84.4 46.5
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA
##   ..- attr(*, "names")= chr  "PortCode" "PortName" "State"

I also reformatted the column names to match the format used for my bordercrossings dataframe.

names(usborders2) <- tolower(names(usborders2))
names(usborders2) <- gsub(" ","",names(usborders2))

Finally, I used the Quick Thematic Map command to display the shapefile, just to check that everything looked as it should.

qtm(usborders2)

Step 5- Selecting the data I want to use for my final visualization

I ultimately want to focus on the US-Canda border, specifically ports where there have been substantial increases or decreases in crossings over the years. For now, I will create a test map using the original dataset.

thisyearbcs <- bordercrossingsclean %>% 
  select(portname, state, portcode, value) %>%
  group_by(portname,state, portcode) %>%
  summarise(sum=sum(value))
head(thisyearbcs)

## # A tibble: 6 x 4
## # Groups:   portname, state [6]
##   portname       state        portcode      sum
##   <chr>          <chr>        <chr>       <dbl>
## 1 Alcan          Alaska       3104      4225089
## 2 Alexandria Bay New York     0708     62067849
## 3 Algonac        Michigan     3814        63825
## 4 Ambrose        North Dakota 3410       210157
## 5 Anacortes      Washington   3010      1596779
## 6 Andrade        California   2502     72889447

str(usborders2$portcode)

##  Factor w/ 113 levels "0101","0103",..: 113 112 111 25 25 25 25 22 19 20 ...

str(thisyearbcs$portcode)

##  chr [1:117] "3104" "0708" "3814" "3410" "3010" "2502" "3413" "0112" "3424" ...

usborders3 <- usborders2 %>%
  group_by(portcode)
usborders3

## Simple feature collection with 171 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -141.0014 ymin: 25.88342 xmax: -66.98008 ymax: 64.08552
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 171 x 4
## # Groups:   portcode [113]
##    portcode portname                       state             geometry
##    <fct>    <fct>                          <fct>          <POINT [°]>
##  1 3803     Sault Ste. Marie               MI    (-84.36072 46.50844)
##  2 3802     Port Huron                     MI     (-82.4236 42.99852)
##  3 3801     Detroit-Ambassador Bridge      MI    (-83.07406 42.31182)
##  4 0901     Niagara Falls-Lewiston Bridge  NY    (-79.04446 43.15305)
##  5 0901     Niagara Falls-Whirlpool Bridge NY    (-79.05834 43.10924)
##  6 0901     Niagara Falls-Rainbow Bridge   NY     (-79.0677 43.09018)
##  7 0901     Buffalo-Peace Bridge           NY    (-78.90595 42.90694)
##  8 0708     Alexandria Bay                 NY    (-75.98359 44.34723)
##  9 0701     Ogdensburg                     NY    (-75.45775 44.73309)
## 10 0704     Massena                        NY      (-74.7395 44.9906)
## # … with 161 more rows

Issue above- there appear to be multiple names for a port using the same code in this shapefile. Needs more investigation. Causing 171 entries for this dataset instead of 117

#Did this before, hence character return
#usborders2$portcode <- as.character(usborders2$portcode)

usborders2 <- usborders2[order(usborders2$portcode),]
thisyearbcs <- thisyearbcs[order(thisyearbcs$portname),]

identical(usborders2$portcode,thisyearbcs$portname)

## [1] FALSE