Data 101 Project 2

By: Ameer Adegun

Introduction

Research Question: Is there a difference in the average number of border crossings between passenger vehicles and trucks at U.S. border ports?

To answer this question, I used the Border Crossing Entry Data dataset from Data.gov (https://catalog.data.gov/dataset/border-crossing-entry-data-683ae). This dataset tracks the number of crossings at different U.S. border ports over time.

Each record represents a count of crossings for a specific port and date. The key variables used in this analysis are measure, which identifies the type of crossing like trucks, and value, which represents the number of crossings. These variables allow me to compare the average crossings between two groups.

Data Analysis

Before I can do what I want with the data I must first clean the data by standardizing column names and removing missing values. I also must filter the dataset to include the categories im going to use: passenger vehicles and trucks.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/SwagD/Desktop/Data 101")

data <- read_csv("Border_Crossing_Entry_Data.csv")

## Rows: 273391 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Port Name, State, Port Code, Border, Date, Measure, Point
## dbl (3): Value, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean column names
names(data) <- tolower(gsub("[(). \\-]", "_", names(data)))

# Filter relevant data
clean_data <- data %>%
  filter(measure %in% c("Personal Vehicles", "Trucks"),
         !is.na(value))

Now that I’ve cleaned my dataset I can now explore the data using summary statistics.

clean_data %>%
  group_by(measure) %>%
  summarize(
    mean_crossings = mean(value),
    median_crossings = median(value),
    max_crossings = max(value),
    min_crossings = min(value)
  )

## # A tibble: 2 × 5
##   measure           mean_crossings median_crossings max_crossings min_crossings
##   <chr>                      <dbl>            <dbl>         <dbl>         <dbl>
## 1 Personal Vehicles         80548.            8530        1744349             0
## 2 Trucks                     8993.             864.        267884             0

Here is a visualization I created to help visualize the distribution of crossings

ggplot(clean_data, aes(x = measure, y = value, fill = measure)) +
  geom_boxplot(outlier.alpha = 0.2) +
  scale_y_log10(labels = scales::comma) +
  labs(
    title = "Distribution of Border Crossings by Vehicle Type",
    x     = "Vehicle Type",
    y     = "Number of Crossings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

## Warning in scale_y_log10(labels = scales::comma): log-10 transformation
## introduced infinite values.

## Warning: Removed 2738 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Statistical Analysis

I will find out whether there is a significant difference in average crossings by performing a two sample t-test.

H₀: μ₁ = μ₂ Hₐ: μ₁ ≠ μ₂

Where: μ₁ = mean number of passenger vehicle crossings μ₂ = mean number of truck crossings

t_test_result <- t.test(value ~ measure, data = clean_data)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  value by measure
## t = 73.778, df = 40405, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Personal Vehicles and group Trucks is not equal to 0
## 95 percent confidence interval:
##  69653.66 73455.56
## sample estimates:
## mean in group Personal Vehicles            mean in group Trucks 
##                       80548.063                        8993.453

The significance level for this test is α = 0.05.

With this info in mind, If the p value is less than 0.05, we reject the null hypothesis and end with a statistical significant difference in average crossings between the two vehicle types. If the p value was greater than or equal to 0.05, we would have failed to reject the null hypothesis.

The confidence interval provided shows the range of possible values for the difference in means. If the interval does not have a 0, this increases the likelihood of there being a statistical significant difference.

Conclusion

This analysis was made with the intent of finding whether or not there was a difference in average border crossings between passenger vehicles and trucks. The results of the t-test provide statistical evidence that can support or reject the claim. Understanding these differences is important for identifying traffic patterns and improving how the border operates. If one type of vehicle is crossing the border more than others, resources can be moved around more efficiently. Future research could be opened up to look at the differences across different regions, seasons, and to look at more vehicle types. More advanced methods could also be used to gain a better understanding of border activity.