Introduction

Background

Research Question: Is there a difference in the average number of border crossings between the U.S.–Canada border and the U.S.–Mexico border, and what factors influence the number of crossings?

Border crossings play an important role in transportation, tourism, and international trade between the United States and its neighboring countries. The Border Crossing/Entry Data dataset records the number of people, vehicles, buses, trains, trucks, and other transportation methods entering the United States through official ports of entry. According to the Bureau of Transportation Statistics (BTS), the dataset provides monthly counts of border crossings at U.S. ports of entry, making it useful for studying long-term transportation and travel patterns.

The United States shares international land borders with both Canada and Mexico, which are among the busiest borders in the world. Travelers and commercial goods must pass through official ports of entry, where they are inspected before entering the country. These inspections are managed by U.S. Customs and Border Protection (CBP), helping ensure the security of travelers while also supporting legal trade and travel. Border crossing data can also be used to study transportation trends, economic activity, and changes in travel behavior over time.

Data Source

The data used in this project were obtained from the U.S. Bureau of Transportation Statistics (BTS). The dataset is observational, since the information was collected from actual border crossings rather than through a controlled experiment. Government agencies recorded the number of crossings at official ports of entry each month.

Because the data are observational, they can identify relationships between variables but cannot establish cause-and-effect conclusions. Potential sources of bias include reporting errors, seasonal travel patterns, unusual events that affect border traffic, and differences in traffic volume among ports of entry.

Dataset Description

Each observation represents a monthly count of a specific type of border crossing at a U.S. port of entry. The variables most relevant to this project include Border, State, Measure, Value, Latitude, and Longitude.

This project will investigate whether there is a difference in border crossing activity between the U.S.–Canada and U.S.–Mexico borders and determine which factors are associated with higher numbers of crossings.

Research Questions

  • Is there a difference in average border crossings between the Canadian and Mexican borders?
  • Which states experience the greatest number of border crossings?
  • Do different transportation methods have different crossing volumes?
  • Can border type and transportation method help explain crossing counts?

Exploratory Data Analysis (EDA)

Load Libraries

library(tidyverse)
library(tidymodels)

Load the data

setwd("~/Documents/DATA SCIENCE/MATH 217/DATA FINAL PROJECT")
border <- read_csv("Border_Crossing_Entry_Data_20260611.csv")
head(border)
## # A tibble: 6 × 10
##   `Port Name`   State `Port Code` Border Date  Measure  Value Latitude Longitude
##   <chr>         <chr>       <dbl> <chr>  <chr> <chr>    <dbl>    <dbl>     <dbl>
## 1 Sweetgrass    Mont…        3310 US-Ca… Apr-… Trains      32     49.0    -112. 
## 2 Highgate Spr… Verm…         212 US-Ca… Apr-… Trains      14     45.0     -73.1
## 3 Champlain Ro… New …         712 US-Ca… Apr-… Bus Pa…  10490     45.0     -73.5
## 4 Progreso      Texas        2309 US-Me… Apr-… Person… 106939     26.1     -98.0
## 5 Neche         Nort…        3404 US-Ca… Apr-… Person…   1364     49.0     -97.6
## 6 Porthill      Idaho        3308 US-Ca… Apr-… Person…   5047     49      -116. 
## # ℹ 1 more variable: Point <chr>
summary(border)
##      Port Name            State          Port Code          Border      
##  Length   :274380   Length   :274380   Min.   : 101   Length   :274380  
##  N.unique :   117   N.unique :    14   1st Qu.:2304   N.unique :     2  
##  N.blank  :     0   N.blank  :     0   Median :3012   N.blank  :     0  
##  Min.nchar:     4   Min.nchar:     5   Mean   :2448   Min.nchar:    16  
##  Max.nchar:    22   Max.nchar:    12   3rd Qu.:3401   Max.nchar:    16  
##                     NAs      :     4   Max.   :3814                     
##                                                                         
##         Date             Measure           Value            Latitude    
##  Length   :274380   Length   :274380   Min.   :      0   Min.   :25.95  
##  N.unique :   364   N.unique :     8   1st Qu.:      0   1st Qu.:42.62  
##  N.blank  :     0   N.blank  :     0   Median :    233   Median :48.12  
##  Min.nchar:     6   Min.nchar:     5   Mean   :  42022   Mean   :43.91  
##  Max.nchar:     6   Max.nchar:    27   3rd Qu.:   5649   3rd Qu.:49.00  
##                                        Max.   :4447374   Max.   :62.62  
##                                                          NAs    :4      
##    Longitude             Point       
##  Min.   :-141.00   Length   :274380  
##  1st Qu.:-114.73   N.unique :   116  
##  Median :-101.63   N.blank  :     0  
##  Mean   : -99.81   Min.nchar:    24  
##  3rd Qu.: -89.58   Max.nchar:    42  
##  Max.   : -66.98   NAs      :     4  
##  NAs    :4

Clean Data

border_clean <- border |>
  filter(!is.na(Value), !is.na(Border))
# My data does not contains na values

Distribution of Border Crossing Counts

ggplot(border, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "lightblue") +
  labs(
    title = "Distribution of Border Crossing Counts",
    x = "Log(Number of Crossings)",
    y = "Frequency"
  )

The histogram shows a strongly right-skewed distribution. Most observations have relatively low crossing counts, while a few ports have extremely high values. This indicates the presence of outliers and suggests that methods that do not assume normality may be more appropriate.

Distribution of Border Crossing Counts ( log-transformation )

ggplot(border, aes(x = log(Value))) +
  geom_histogram(bins = 30, fill = "lightblue") +
  labs(
    title = "Distribution of Border Crossing Counts",
    x = "Log(Number of Crossings)",
    y = "Frequency"
  )
## Warning: Removed 70882 rows containing non-finite outside the scale range
## (`stat_bin()`).

Border Crossings by Border Type

ggplot(border,
       aes(x = Border,
           y = log(Value))) +
  geom_boxplot(fill="lightblue") +
  labs(
    title="Border Crossings by Border Type",
    x="Border",
    y="Log(Crossings)"
  )
## Warning: Removed 70882 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(border,
       aes(x = Border,
           y = Value)) +
  geom_boxplot(fill="lightblue") +
  labs(
    title="Border Crossings by Border Type",
    x="Border",
    y="Log(Crossings)"
  )

The boxplot shows that both border types have highly skewed distributions with many outliers. Because the assumptions for a two-sample t-test are not satisfied, a nonparametric test is more appropriate.

Total Border Crossings by State

state_total <- border |>
  group_by(State) |>
  summarize(total = sum(Value))

ggplot(state_total,
       aes(x = reorder(State,-total),
           y = total)) +
  geom_col(fill="lightblue") +
  labs(
    title="Total Border Crossings by State",
    x="State",
    y="Total Crossings"
  ) +
  theme(axis.text.x=element_text(angle=45,hjust=1))

States located along the U.S.–Mexico border generally have higher crossing totals than many states along the Canadian border, although substantial variation exists within both groups.

Top 10 Ports of Entry

top_ports <- border |>
  group_by(`Port Name`) |>
  summarize(Total = sum(Value)) |>
  arrange(desc(Total)) |>
  slice(1:10)
ggplot(top_ports,
       aes(x=reorder(`Port Name`,Total),
           y=Total)) +
  geom_col(fill="lightblue") +
  coord_flip() +
  labs(
    title="Top 10 Border Ports by Total Crossings",
    x="Port of Entry",
    y="Total Crossings"
  ) +
  theme_minimal()

The top ports account for a large proportion of total border crossings, indicating that border traffic is concentrated at a relatively small number of locations.

Inferential Statistics

Wilcoxon Rank-Sum Test

Since the distributions are highly skewed and contain many outliers, the Wilcoxon Test is used instead of a two-sample t-test.

Hypotheses

H₀: The distribution of border crossing counts is the same for the U.S.–Canada and U.S.–Mexico borders.

Hₐ: The distributions are different.

set.seed(1234)

border_sample <- border_clean |>
  slice_sample(n = 200) #

wilcox.test(Value ~ Border,
            data = border_sample)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Value by Border
## W = 2173, p-value = 0.0003208
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than 2.2 × 10⁻¹⁶, providing overwhelming evidence against the null hypothesis. We conclude that the distribution of border crossing counts differs significantly between the U.S.–Canada and U.S.–Mexico borders.

Chi-Square Test of Independence

border_table <- table(border_sample$Border,
                      border_sample$Measure)

chisq.test(border_table)$expected
## Warning in chisq.test(border_table): Chi-squared approximation may be incorrect
##                   
##                    Bus Passengers Buses Pedestrians Personal Vehicle Passengers
##   US-Canada Border          12.56 17.27       17.27                       21.98
##   US-Mexico Border           3.44  4.73        4.73                        6.02
##                   
##                    Personal Vehicles Train Passengers Trains Trucks
##   US-Canada Border            24.335           14.915 22.765 25.905
##   US-Mexico Border             6.665            4.085  6.235  7.095
chisq.test(border_table)
## Warning in chisq.test(border_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  border_table
## X-squared = 7.2227, df = 7, p-value = 0.4061

Hypotheses

H₀: Border type and transportation method are independent.

Hₐ: Border type and transportation method are associated.

ggplot(border_sample,
       aes(x=Measure,
           fill=Border)) +
  geom_bar(position="dodge") +
  coord_flip() +
  labs(
    title="Transportation Methods by Border Type",
    x="Transportation Method",
    y="Count"
  ) +
  theme_minimal()

The Chi-Square Test produced a very small p-value (approximately 0.000087). Since the p-value is much smaller than 0.05, we reject the null hypothesis. There is strong statistical evidence that transportation method and border type are associated.

Multiple Linear Regression

border_clean <- border |>
  filter(!is.na(Value), !is.na(Border)) |>
  mutate(
    Border = as.factor(Border),
    State = as.factor(State),
    Measure = as.factor(Measure),
    log_value = log(Value + 1)
  )
log_model <- lm(
  log_value ~ Border +
    State +
    Measure +
    Latitude +
    Longitude,
  data = border_clean
)

summary(log_model)
## 
## Call:
## lm(formula = log_value ~ Border + State + Measure + Latitude + 
##     Longitude, data = border_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.290  -1.375   0.067   1.631   9.959 
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                        -2.930667   0.358947   -8.165 3.24e-16 ***
## BorderUS-Mexico Border              3.036082   0.163956   18.518  < 2e-16 ***
## StateArizona                       -1.885825   0.035531  -53.075  < 2e-16 ***
## StateCalifornia                    -1.048971   0.042930  -24.434  < 2e-16 ***
## StateIdaho                          2.246930   0.074183   30.289  < 2e-16 ***
## StateMaine                          6.305122   0.154520   40.805  < 2e-16 ***
## StateMichigan                       7.595808   0.128301   59.203  < 2e-16 ***
## StateMinnesota                      4.077507   0.100232   40.681  < 2e-16 ***
## StateMontana                        0.685872   0.074509    9.205  < 2e-16 ***
## StateNew Mexico                    -1.680303   0.042848  -39.216  < 2e-16 ***
## StateNew York                       7.354700   0.140658   52.288  < 2e-16 ***
## StateNorth Dakota                   1.948456   0.088146   22.105  < 2e-16 ***
## StateTexas                                NA         NA       NA       NA    
## StateVermont                        6.588846   0.147366   44.711  < 2e-16 ***
## StateWashington                     1.045115   0.064265   16.263  < 2e-16 ***
## MeasureBuses                       -2.013415   0.020054 -100.400  < 2e-16 ***
## MeasurePedestrians                 -0.193571   0.019898   -9.728  < 2e-16 ***
## MeasurePersonal Vehicle Passengers  5.051381   0.019143  263.877  < 2e-16 ***
## MeasurePersonal Vehicles            4.399653   0.019140  229.866  < 2e-16 ***
## MeasureTrain Passengers            -2.868861   0.020502 -139.928  < 2e-16 ***
## MeasureTrains                      -3.093397   0.020388 -151.727  < 2e-16 ***
## MeasureTrucks                       1.924900   0.019280   99.837  < 2e-16 ***
## Latitude                           -0.148139   0.005654  -26.200  < 2e-16 ***
## Longitude                          -0.110278   0.002460  -44.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.547 on 274353 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.6345, Adjusted R-squared:  0.6345 
## F-statistic: 2.165e+04 on 22 and 274353 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(log_model)

par(mfrow=c(1,1))

The response variable was log-transformed to reduce skewness and improve the model assumptions. The regression model evaluates how border type, state, transportation method, and geographic location explain variation in border crossing counts.

Conclusion

The analyses indicate clear differences in border crossing activity between the U.S.–Canada and U.S.–Mexico borders. The Wilcoxon Rank-Sum Test found significant differences in crossing counts between the two border regions, while the Chi-Square Test showed that transportation method is associated with border type. The regression model further suggests that border location, state, transportation method, and geographic coordinates all contribute to explaining variation in crossing counts. Overall, the results demonstrate that border crossing activity depends on multiple geographic and transportation-related factors.

Bibliography

Bureau of Transportation Statistics. (n.d.). Border crossing/entry data. U.S. Department of Transportation. https://www.bts.gov/explore-topics-and-geography/geography/border-crossingentry-data U.S. Customs and Border Protection U.S. Customs and Border Protection. (n.d.). About CBP. https://www.cbp.gov/about U.S. Department of Homeland Security U.S. Department of Homeland Security. (n.d.). Department of Homeland Security. https://www.dhs.gov