Research Question: Is there a difference in the average number of border crossings between the U.S.–Canada border and the U.S.–Mexico border, and what factors influence the number of crossings?
Border crossings play an important role in transportation, tourism, and international trade between the United States and its neighboring countries. The Border Crossing/Entry Data dataset records the number of people, vehicles, buses, trains, trucks, and other transportation methods entering the United States through official ports of entry. According to the Bureau of Transportation Statistics (BTS), the dataset provides monthly counts of border crossings at U.S. ports of entry, making it useful for studying long-term transportation and travel patterns.
The United States shares international land borders with both Canada and Mexico, which are among the busiest borders in the world. Travelers and commercial goods must pass through official ports of entry, where they are inspected before entering the country. These inspections are managed by U.S. Customs and Border Protection (CBP), helping ensure the security of travelers while also supporting legal trade and travel. Border crossing data can also be used to study transportation trends, economic activity, and changes in travel behavior over time.
The data used in this project were obtained from the U.S. Bureau of Transportation Statistics (BTS). The dataset is observational, since the information was collected from actual border crossings rather than through a controlled experiment. Government agencies recorded the number of crossings at official ports of entry each month.
Because the data are observational, they can identify relationships between variables but cannot establish cause-and-effect conclusions. Potential sources of bias include reporting errors, seasonal travel patterns, unusual events that affect border traffic, and differences in traffic volume among ports of entry.
Each observation represents a monthly count of a specific type of border crossing at a U.S. port of entry. The variables most relevant to this project include Border, State, Measure, Value, Latitude, and Longitude.
This project will investigate whether there is a difference in border crossing activity between the U.S.–Canada and U.S.–Mexico borders and determine which factors are associated with higher numbers of crossings.
library(tidyverse)
library(tidymodels)
setwd("~/Documents/DATA SCIENCE/MATH 217/DATA FINAL PROJECT")
border <- read_csv("Border_Crossing_Entry_Data_20260611.csv")
head(border)
## # A tibble: 6 × 10
## `Port Name` State `Port Code` Border Date Measure Value Latitude Longitude
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Sweetgrass Mont… 3310 US-Ca… Apr-… Trains 32 49.0 -112.
## 2 Highgate Spr… Verm… 212 US-Ca… Apr-… Trains 14 45.0 -73.1
## 3 Champlain Ro… New … 712 US-Ca… Apr-… Bus Pa… 10490 45.0 -73.5
## 4 Progreso Texas 2309 US-Me… Apr-… Person… 106939 26.1 -98.0
## 5 Neche Nort… 3404 US-Ca… Apr-… Person… 1364 49.0 -97.6
## 6 Porthill Idaho 3308 US-Ca… Apr-… Person… 5047 49 -116.
## # ℹ 1 more variable: Point <chr>
summary(border)
## Port Name State Port Code Border
## Length :274380 Length :274380 Min. : 101 Length :274380
## N.unique : 117 N.unique : 14 1st Qu.:2304 N.unique : 2
## N.blank : 0 N.blank : 0 Median :3012 N.blank : 0
## Min.nchar: 4 Min.nchar: 5 Mean :2448 Min.nchar: 16
## Max.nchar: 22 Max.nchar: 12 3rd Qu.:3401 Max.nchar: 16
## NAs : 4 Max. :3814
##
## Date Measure Value Latitude
## Length :274380 Length :274380 Min. : 0 Min. :25.95
## N.unique : 364 N.unique : 8 1st Qu.: 0 1st Qu.:42.62
## N.blank : 0 N.blank : 0 Median : 233 Median :48.12
## Min.nchar: 6 Min.nchar: 5 Mean : 42022 Mean :43.91
## Max.nchar: 6 Max.nchar: 27 3rd Qu.: 5649 3rd Qu.:49.00
## Max. :4447374 Max. :62.62
## NAs :4
## Longitude Point
## Min. :-141.00 Length :274380
## 1st Qu.:-114.73 N.unique : 116
## Median :-101.63 N.blank : 0
## Mean : -99.81 Min.nchar: 24
## 3rd Qu.: -89.58 Max.nchar: 42
## Max. : -66.98 NAs : 4
## NAs :4
border_clean <- border |>
filter(!is.na(Value), !is.na(Border))
# My data does not contains na values
ggplot(border, aes(x = Value)) +
geom_histogram(bins = 30, fill = "lightblue") +
labs(
title = "Distribution of Border Crossing Counts",
x = "Log(Number of Crossings)",
y = "Frequency"
)
The histogram shows a strongly right-skewed distribution. Most observations have relatively low crossing counts, while a few ports have extremely high values. This indicates the presence of outliers and suggests that methods that do not assume normality may be more appropriate.
ggplot(border, aes(x = log(Value))) +
geom_histogram(bins = 30, fill = "lightblue") +
labs(
title = "Distribution of Border Crossing Counts",
x = "Log(Number of Crossings)",
y = "Frequency"
)
## Warning: Removed 70882 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(border,
aes(x = Border,
y = log(Value))) +
geom_boxplot(fill="lightblue") +
labs(
title="Border Crossings by Border Type",
x="Border",
y="Log(Crossings)"
)
## Warning: Removed 70882 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(border,
aes(x = Border,
y = Value)) +
geom_boxplot(fill="lightblue") +
labs(
title="Border Crossings by Border Type",
x="Border",
y="Log(Crossings)"
)
The boxplot shows that both border types have highly skewed distributions with many outliers. Because the assumptions for a two-sample t-test are not satisfied, a nonparametric test is more appropriate.
state_total <- border |>
group_by(State) |>
summarize(total = sum(Value))
ggplot(state_total,
aes(x = reorder(State,-total),
y = total)) +
geom_col(fill="lightblue") +
labs(
title="Total Border Crossings by State",
x="State",
y="Total Crossings"
) +
theme(axis.text.x=element_text(angle=45,hjust=1))
States located along the U.S.–Mexico border generally have higher crossing totals than many states along the Canadian border, although substantial variation exists within both groups.
top_ports <- border |>
group_by(`Port Name`) |>
summarize(Total = sum(Value)) |>
arrange(desc(Total)) |>
slice(1:10)
ggplot(top_ports,
aes(x=reorder(`Port Name`,Total),
y=Total)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(
title="Top 10 Border Ports by Total Crossings",
x="Port of Entry",
y="Total Crossings"
) +
theme_minimal()
The top ports account for a large proportion of total border crossings, indicating that border traffic is concentrated at a relatively small number of locations.
Wilcoxon Rank-Sum Test
Since the distributions are highly skewed and contain many outliers, the Wilcoxon Test is used instead of a two-sample t-test.
H₀: The distribution of border crossing counts is the same for the U.S.–Canada and U.S.–Mexico borders.
Hₐ: The distributions are different.
set.seed(1234)
border_sample <- border_clean |>
slice_sample(n = 200) #
wilcox.test(Value ~ Border,
data = border_sample)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Value by Border
## W = 2173, p-value = 0.0003208
## alternative hypothesis: true location shift is not equal to 0
The p-value is less than 2.2 × 10⁻¹⁶, providing overwhelming evidence against the null hypothesis. We conclude that the distribution of border crossing counts differs significantly between the U.S.–Canada and U.S.–Mexico borders.
border_table <- table(border_sample$Border,
border_sample$Measure)
chisq.test(border_table)$expected
## Warning in chisq.test(border_table): Chi-squared approximation may be incorrect
##
## Bus Passengers Buses Pedestrians Personal Vehicle Passengers
## US-Canada Border 12.56 17.27 17.27 21.98
## US-Mexico Border 3.44 4.73 4.73 6.02
##
## Personal Vehicles Train Passengers Trains Trucks
## US-Canada Border 24.335 14.915 22.765 25.905
## US-Mexico Border 6.665 4.085 6.235 7.095
chisq.test(border_table)
## Warning in chisq.test(border_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: border_table
## X-squared = 7.2227, df = 7, p-value = 0.4061
Hypotheses
H₀: Border type and transportation method are independent.
Hₐ: Border type and transportation method are associated.
ggplot(border_sample,
aes(x=Measure,
fill=Border)) +
geom_bar(position="dodge") +
coord_flip() +
labs(
title="Transportation Methods by Border Type",
x="Transportation Method",
y="Count"
) +
theme_minimal()
The Chi-Square Test produced a very small p-value (approximately 0.000087). Since the p-value is much smaller than 0.05, we reject the null hypothesis. There is strong statistical evidence that transportation method and border type are associated.
border_clean <- border |>
filter(!is.na(Value), !is.na(Border)) |>
mutate(
Border = as.factor(Border),
State = as.factor(State),
Measure = as.factor(Measure),
log_value = log(Value + 1)
)
log_model <- lm(
log_value ~ Border +
State +
Measure +
Latitude +
Longitude,
data = border_clean
)
summary(log_model)
##
## Call:
## lm(formula = log_value ~ Border + State + Measure + Latitude +
## Longitude, data = border_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.290 -1.375 0.067 1.631 9.959
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.930667 0.358947 -8.165 3.24e-16 ***
## BorderUS-Mexico Border 3.036082 0.163956 18.518 < 2e-16 ***
## StateArizona -1.885825 0.035531 -53.075 < 2e-16 ***
## StateCalifornia -1.048971 0.042930 -24.434 < 2e-16 ***
## StateIdaho 2.246930 0.074183 30.289 < 2e-16 ***
## StateMaine 6.305122 0.154520 40.805 < 2e-16 ***
## StateMichigan 7.595808 0.128301 59.203 < 2e-16 ***
## StateMinnesota 4.077507 0.100232 40.681 < 2e-16 ***
## StateMontana 0.685872 0.074509 9.205 < 2e-16 ***
## StateNew Mexico -1.680303 0.042848 -39.216 < 2e-16 ***
## StateNew York 7.354700 0.140658 52.288 < 2e-16 ***
## StateNorth Dakota 1.948456 0.088146 22.105 < 2e-16 ***
## StateTexas NA NA NA NA
## StateVermont 6.588846 0.147366 44.711 < 2e-16 ***
## StateWashington 1.045115 0.064265 16.263 < 2e-16 ***
## MeasureBuses -2.013415 0.020054 -100.400 < 2e-16 ***
## MeasurePedestrians -0.193571 0.019898 -9.728 < 2e-16 ***
## MeasurePersonal Vehicle Passengers 5.051381 0.019143 263.877 < 2e-16 ***
## MeasurePersonal Vehicles 4.399653 0.019140 229.866 < 2e-16 ***
## MeasureTrain Passengers -2.868861 0.020502 -139.928 < 2e-16 ***
## MeasureTrains -3.093397 0.020388 -151.727 < 2e-16 ***
## MeasureTrucks 1.924900 0.019280 99.837 < 2e-16 ***
## Latitude -0.148139 0.005654 -26.200 < 2e-16 ***
## Longitude -0.110278 0.002460 -44.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.547 on 274353 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.6345, Adjusted R-squared: 0.6345
## F-statistic: 2.165e+04 on 22 and 274353 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(log_model)
par(mfrow=c(1,1))
The response variable was log-transformed to reduce skewness and improve the model assumptions. The regression model evaluates how border type, state, transportation method, and geographic location explain variation in border crossing counts.
The analyses indicate clear differences in border crossing activity between the U.S.–Canada and U.S.–Mexico borders. The Wilcoxon Rank-Sum Test found significant differences in crossing counts between the two border regions, while the Chi-Square Test showed that transportation method is associated with border type. The regression model further suggests that border location, state, transportation method, and geographic coordinates all contribute to explaining variation in crossing counts. Overall, the results demonstrate that border crossing activity depends on multiple geographic and transportation-related factors.
Bureau of Transportation Statistics. (n.d.). Border crossing/entry data. U.S. Department of Transportation. https://www.bts.gov/explore-topics-and-geography/geography/border-crossingentry-data U.S. Customs and Border Protection U.S. Customs and Border Protection. (n.d.). About CBP. https://www.cbp.gov/about U.S. Department of Homeland Security U.S. Department of Homeland Security. (n.d.). Department of Homeland Security. https://www.dhs.gov