Across the world and still lingering into the modern age, Human trafficking is a pervasive crime that ensnares people of every age, gender, and background in forced labor, commercial sex, and other formats alike to modern slavery. Today, it’s estimated that 27.6 million individuals suffer under trafficking conditions. Perpetrators aren’t limited to organized criminal networks, in fact some world powers and governments secretly obtain these victims including children into exploitative labor or other illicit activities. Given the global scale and severity of this injustice, it demands urgent attention and collective action to stem its spread and protect vulnerable populations.
The first data source, a .CSV file, was downloaded from the United Nations Office on Drugs and Crime (UNDOC) website: https://dataunodc.un.org/dp-trafficking-persons This data was collected by using the Questionnaire for the Global Report on Trafficking in Persons (GLOTIP), hence the name of the original file being “data_glotip”.
The second data source, via web scraping and using the initial source as an API, was found from the US Human Trafficking Hotline: https://humantraffickinghotline.org/en/statistics
The .CSV was downloaded from the United Nations Office on Drugs and Crime (UNDOC) website and read into R.
The API for the US Human Trafficking Hotline is imported with a rvest approach, because the website/source doesn’t expose those year-by-year numbers via a neat JSON endpoint. So therefore, I scraped out the data using rvest by reading the website into R, then grabbing each year’s panel and extracting.
After importing both the UNODC CSV and the scraped Hotline data, I tidied each dataset into a common “long” format and harmonized their key fields. For the UNODC table, I filtered to only U.S. rows, “Detected trafficking victims,” and total Sex/Age dimensions, then renamed columns for clarity. For the Hotline data, I will first removed the redundant “Unit of measurement” column (all entries were “Counts” anyways), renamed Iso3_code to Abbreviation (which is basically a 3-4 letter code for a country) and txtVALUE to Count, and converted “<5” into a numeric midpoint value of 2 before parsing each text string to a number. I will then removed the placeholder “NA” panel (the 2007–2014 aggregate) and filter out any remaining NA years, producing two clean, year-indexed time series ready for combined analysis.
The research questions; 1. What is the difference in counts between the UNODC vs. the US Human Trafficking Hotline 2. What are some trends over time for gender and/or age? 3. What are the results of t-tests that compare males to females and ages 17 and less to ages 18 and older? would be answered using summary statistics as well as visualizing the data to answer these questions. For the final research question: “How can you use a linear regression line to determine an average number of people in human trafficking (across all countries and in the United States) for each of the following: Males 17 or younger, males 18 or older, females 17 or younger, and females 18 or older” I’ll be using linear regression as a analytical method to predict.
github <- "https://raw.githubusercontent.com/GullitNa/DATA607FINAL/main/data_glotip.csv"
data_glotip <- read.csv(
github,
skip = 2,
stringsAsFactors = FALSE,
check.names = FALSE
)
head(data_glotip)
## Iso3_code Country Region Subregion
## 1 ABW Aruba Americas Latin America and the Caribbean
## 2 AFG Afghanistan Asia Southern Asia
## 3 AFG Afghanistan Asia Southern Asia
## 4 AFG Afghanistan Asia Southern Asia
## 5 AFG Afghanistan Asia Southern Asia
## 6 AFG Afghanistan Asia Southern Asia
## Indicator Dimension
## 1 Detected trafficking victims by country of repatriation
## 2 Detected trafficking victims by country of repatriation
## 3 Detected trafficking victims by country of repatriation
## 4 Offences of trafficking in persons Total
## 5 Persons brought into formal contact Total
## 6 Detected trafficking victims by form of exploitation
## Category Sex Age Year Unit of measurement txtVALUE Source
## 1 Ukraine Total Total 2010 Counts <5 GLOTIP
## 2 Abroad Total Total 2003 Counts <5 GLOTIP
## 3 Abroad Total Total 2008 Counts <5 GLOTIP
## 4 Total Total Total 2013 Counts 103 GLOTIP
## 5 Total Total Total 2013 Counts 167 GLOTIP
## 6 Sexual exploitation Total Total 2017 Counts <5 GLOTIP
unique(data_glotip$`Unit of measurement`)
## [1] "Counts"
Only has Counts so I’ll remove this column in Data Transformation.
url <- "https://humantraffickinghotline.org/en/statistics"
page <- read_html(url)
panels <- page %>% html_nodes("section.js-tabs-panel")
ids <- panels %>% html_attr("id")
years <- as.integer(str_remove(ids, "^year\\-"))
## Warning: NAs introduced by coercion
counts <- lapply(panels, function(p) {
p %>%
html_nodes(".text-h1.font-black") %>%
html_text(trim = TRUE) %>%
str_remove_all(",") %>%
as.integer()
})
yearly_stats <- data.frame(
Year = years,
Signals_Received = sapply(counts, `[`, 1),
Victim_Signals = sapply(counts, `[`, 2),
Cases_Identified = sapply(counts, `[`, 3),
Victims = sapply(counts, `[`, 4),
row.names = NULL
)
yearly_stats
## Year Signals_Received Victim_Signals Cases_Identified Victims
## 1 2023 30162 7380 9619 16999
## 2 2022 37331 9647 9014 15299
## 3 2021 47325 12846 10353 16708
## 4 2020 51611 13946 10525 16984
## 5 2019 49669 10676 11368 22163
## 6 2018 41736 8172 10703 21714
## 7 2017 34017 5485 8596 21305
## 8 2016 31986 5148 7528 16747
## 9 2015 26840 4469 5551 12008
## 10 NA 82225 2874 17634 37073
Remove the “NA” year (2007-2014)
yearly_stats <- yearly_stats %>%
filter(!is.na(Year))
For the .CSV file, I’ll begin by checking if the “Unit of measurement” column confirms my suspicion of only having “Counts”. If so, then I’ll remove it as it is a repetitive column.
unique(data_glotip$`Unit of measurement`)
## [1] "Counts"
Confirmed that the column literally only has “Counts”.
glotip <- data_glotip %>%
rename(
Abbreviation = Iso3_code, # rename country‐code
Count = txtVALUE # rename the value column
) %>%
select(
-`Unit of measurement` # drop the one‐only “Counts” column
) %>%
mutate(
Count = parse_number(Count, na = "<5") # turn "<5" into NA, "1,234" → 1234
)
# peek
head(glotip)
## Abbreviation Country Region Subregion
## 1 ABW Aruba Americas Latin America and the Caribbean
## 2 AFG Afghanistan Asia Southern Asia
## 3 AFG Afghanistan Asia Southern Asia
## 4 AFG Afghanistan Asia Southern Asia
## 5 AFG Afghanistan Asia Southern Asia
## 6 AFG Afghanistan Asia Southern Asia
## Indicator Dimension
## 1 Detected trafficking victims by country of repatriation
## 2 Detected trafficking victims by country of repatriation
## 3 Detected trafficking victims by country of repatriation
## 4 Offences of trafficking in persons Total
## 5 Persons brought into formal contact Total
## 6 Detected trafficking victims by form of exploitation
## Category Sex Age Year Count Source
## 1 Ukraine Total Total 2010 NA GLOTIP
## 2 Abroad Total Total 2003 NA GLOTIP
## 3 Abroad Total Total 2008 NA GLOTIP
## 4 Total Total Total 2013 103 GLOTIP
## 5 Total Total Total 2013 167 GLOTIP
## 6 Sexual exploitation Total Total 2017 NA GLOTIP
Change the “NA” values in Counts to be the midpoint (2) on average instead of just throwing away the data entirely in analysis
glotip <- glotip %>%
mutate(
Count = replace_na(Count, 2)
)
us_victims <- glotip %>%
filter(
Country == "United States of America",
Indicator == "Detected trafficking victims",
Dimension == "Total",
Sex == "Total",
Age == "Total"
) %>%
group_by(Year) %>%
summarize(
UNODC_Victims_US = sum(Count),
.groups = "drop"
)
# 3) Clean up the scraped hotline stats and drop that NA‐year row
yearly_stats_clean <- yearly_stats %>%
filter(!is.na(Year)) %>%
rename(Hotline_Victims = Victims)
# 4) Stitch them together on Year
combined <- yearly_stats_clean %>%
inner_join(us_victims, by = "Year")
# peek
combined
## Year Signals_Received Victim_Signals Cases_Identified Hotline_Victims
## 1 2022 37331 9647 9014 15299
## 2 2021 47325 12846 10353 16708
## 3 2020 51611 13946 10525 16984
## 4 2019 49669 10676 11368 22163
## 5 2018 41736 8172 10703 21714
## 6 2017 34017 5485 8596 21305
## 7 2016 31986 5148 7528 16747
## 8 2015 26840 4469 5551 12008
## UNODC_Victims_US
## 1 16390
## 2 10070
## 3 9854
## 4 8375
## 5 8913
## 6 8003
## 7 5582
## 8 3885
Now you truly have two datasets in one analysis—compare the hotline’s identified victims against the UNODC’s official counts:
ggplot(combined, aes(x = Year)) +
geom_line(aes(y = Hotline_Victims, color = "Hotline")) +
geom_line(aes(y = UNODC_Victims_US, color = "UNODC")) +
labs(
y = "Number of Victims",
color = "Source",
title = "U.S. Trafficking Victims: Hotline vs. UNODC"
) +
scale_color_manual(values = c("Hotline" = "steelblue", "UNODC" = "firebrick"))
The Hotline’s total victims curve sits well above the UNODC’s. This is possible because the Hotline is picking up substantially more “signals” or potential victim reports than what shows up in the UNODC’s official national totals—often by a factor of two or more. UNODC data comes from country reports, law enforcement and government agencies, aggregated and vetted before publication. There may be “definition” differences between different countries and cultures.
Hotline data is “real-time” outreach: it includes any contact (calls, texts, chats) that meet the trafficking-victim criteria, even if those never end up as formally “identified” cases in government statistics.
What does this counting gap mean? Under-reporting in official channels: many victims never come to the attention of law enforcement but do make it to the Hotline.
df_totals <- glotip %>%
filter(
Indicator == "Detected trafficking victims",
Dimension == "Total"
)
gender_trend <- df_totals %>%
filter(Age == "Total") %>%
group_by(Year, Sex) %>%
summarize(Victims = sum(Count, na.rm = TRUE), .groups = "drop")
# FOR PLOTTING ONLY
gender_plot_df <- gender_trend %>%
filter(Sex %in% c("Female", "Male"))
ggplot(gender_plot_df, aes(x = Year, y = Victims, color = Sex)) +
geom_line(size = 1) +
geom_point() +
scale_x_continuous(
breaks = unique(gender_plot_df$Year),
labels = function(x) sprintf("%02d", x %% 100) #better year clarity
) +
labs(
title = "Global Trend of Trafficking Victims by Sex",
x = "Year",
y = "Total Victims"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gender_summary <- gender_trend %>%
group_by(Sex) %>%
summarize(
Total_Victims = sum(Victims, na.rm = TRUE),
.groups = "drop"
)
gender_summary
## # A tibble: 4 × 2
## Sex Total_Victims
## <chr> <dbl>
## 1 Female 335046
## 2 Male 165173
## 3 Other 2382
## 4 Total 565985
In total counts and consistently every year, there are more female victims of human trafficking than males.
age_trend <- df_totals %>%
filter(Sex == "Total") %>%
group_by(Year, Age) %>%
summarize(Victims = sum(Count, na.rm = TRUE), .groups = "drop")
# PLOTTING ONLY
age_plot_df <- age_trend %>%
filter(Age %in% c("0 to 17 years", "18 years or over"))
ggplot(age_plot_df, aes(x = Year, y = Victims, color = Age)) +
geom_line(size = 1) +
geom_point() +
scale_x_continuous(
breaks = unique(age_plot_df$Year),
labels = function(x) sprintf("%02d", x %% 100)
) +
labs(
title = "Global Trend of Trafficking Victims by Age Group",
x = "Year",
y = "Total Victims",
color = "Age Group"
) +
theme_minimal()
age_summary <- age_trend %>%
group_by(Age) %>%
summarize(Total_Victims = sum(Victims, na.rm = TRUE), .groups = "drop")
age_summary
## # A tibble: 3 × 2
## Age Total_Victims
## <chr> <dbl>
## 1 0 to 17 years 158647
## 2 18 years or over 355004
## 3 Total 565985
According to the graph and summary statistics, the victims of human trafficking tend to come from the age group “18 or over” more so than those who come from the age group “0 to 17 years” or 17 or younger.
country_totals <- glotip %>%
group_by(Country, Sex, Age) %>%
summarize(Victims = sum(Count, na.rm = TRUE), .groups = "drop") %>%
mutate(Country = recode(Country,
`United States of America` = "USA",
`Russian Federation` = "Russia",
`Antigua and Barbuda` = "Antigua and Barbuda",
`Bolivia (Plurinational State of)` = "Bolivia",
`Brunei Darussalam` = "Brunei",
`Cabo Verde` = "Cape Verde",
`China, Hong Kong Special Administrative Region` = "Hong Kong",
`China, Macao Special Administrative Region` = "Macau",
`Congo` = "Republic of the Congo",
`Curaçao` = "Curacao",
`Czechia` = "Czech Republic",
`Côte d’Ivoire` = "Ivory Coast",
`Eswatini` = "Swaziland",
`Holy See` = "Vatican City",
`Iran (Islamic Republic of)` = "Iran",
`Lao People's Democratic Republic` = "Laos",
`Micronesia (Federated States of)` = "Micronesia",
`Netherlands (Kingdom of the)` = "Netherlands",
`Republic of Korea` = "South Korea",
`Republic of Moldova` = "Moldova",
`Saint Kitts and Nevis` = "Saint Kitts and Nevis",
`Saint Vincent and the Grenadines` = "Saint Vincent and the Grenadines",
`Syrian Arab Republic` = "Syria",
`Trinidad and Tobago` = "Trinidad and Tobago",
`Türkiye` = "Turkey",
`United Kingdom of Great Britain and Northern Ireland` = "UK",
`United Republic of Tanzania` = "Tanzania",
`Venezuela (Bolivarian Republic of)` = "Venezuela",
.default = Country
)
)
world_outline <- map_data("world") %>%
rename(Country = region)
plot_choro <- function(sex, age, title){
df_map <- country_totals %>%
filter(Sex == sex, Age == age)
map_df <- left_join(world_outline, df_map, by = "Country")
ggplot(map_df, aes(long, lat, group = group, fill = Victims)) +
geom_polygon(color = "gray80", size = 0.1) +
coord_quickmap() +
scale_fill_viridis_c(na.value = "white", option = "plasma") +
labs(
title = title,
fill = "Victims"
) +
theme_void() +
theme(
plot.title = element_text(hjust = 0.5, size = 14)
)
}
p1 <- plot_choro("Male", "0 to 17 years", "Number of Male Victims 17 or Younger by Country")
p2 <- plot_choro("Female", "0 to 17 years", "Number of Female Victims 17 or Younger by Country")
p3 <- plot_choro("Male", "18 years or over", "Number of Male Victims 18 or Older by Country")
p4 <- plot_choro("Female", "18 years or over", "Number of Female Victims 18 or Older by Country")
p1
p2
p3
p4
world_outline <- map_data("world") %>% rename(Country = region)
joined <- left_join(world_outline, country_totals, by="Country")
## Warning in left_join(world_outline, country_totals, by = "Country"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 421 of `x` matches multiple rows in `y`.
## ℹ Row 61 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Check what is mismatched by the maps_data country names/spelling
unmatched <- anti_join(country_totals, world_outline, by="Country")
unique(unmatched$Country)
## [1] "Antigua and Barbuda" "Hong Kong"
## [3] "Macau" "Republic of the Congo"
## [5] "Cura\xe7ao" "C\xf4te d\x92Ivoire"
## [7] "Vatican City" "Saint Kitts and Nevis"
## [9] "Saint Vincent and the Grenadines" "Sark"
## [11] "Trinidad and Tobago" "Tuvalu"
## [13] "T\xfcrkiye" "Viet Nam"
Manually recode
country_totals_clean <- country_totals %>%
mutate(
Country = recode(Country,
`Antigua and Barbuda` = "Antigua and Barbuda",
`Bolivia (Plurinational State of)` = "Bolivia",
`Brunei Darussalam` = "Brunei",
`Cabo Verde` = "Cape Verde",
`China, Hong Kong Special Administrative Region` = "Hong Kong",
`China, Macao Special Administrative Region` = "Macau",
`Congo` = "Republic of the Congo",
`Curaçao` = "Curacao",
`Czechia` = "Czech Republic",
`Côte d’Ivoire` = "Ivory Coast",
`Eswatini` = "Swaziland",
`Holy See` = "Vatican City",
`Iran (Islamic Republic of)` = "Iran",
`Lao People's Democratic Republic` = "Laos",
`Micronesia (Federated States of)` = "Micronesia",
`Netherlands (Kingdom of the)` = "Netherlands",
`Republic of Korea` = "South Korea",
`Republic of Moldova` = "Moldova",
`Saint Kitts and Nevis` = "Saint Kitts and Nevis",
`Saint Vincent and the Grenadines` = "Saint Vincent and the Grenadines",
`Syrian Arab Republic` = "Syria",
`Trinidad and Tobago` = "Trinidad and Tobago",
`Türkiye` = "Turkey",
`United Kingdom of Great Britain and Northern Ireland` = "UK",
`United Republic of Tanzania` = "Tanzania",
`Venezuela (Bolivarian Republic of)` = "Venezuela",
.default = Country
)
)
First t-test will be on the topic of these hypothesis: Null Hypothesis: “Males and females are victims of human trafficking in equal numbers.” Alternative Hypothesis: “Males and females are victims of human trafficking in different amounts.”
sex_totals <- glotip %>%
filter(Dimension == "Total",
Age == "Total",
Sex %in% c("Male","Female")) %>%
group_by(Country, Year, Sex) %>%
summarize(Victims = sum(Count, na.rm=TRUE), .groups="drop") %>%
pivot_wider(names_from = Sex,
values_from = Victims) %>%
filter(!is.na(Male), !is.na(Female))
t_sex <- t.test(sex_totals$Male,
sex_totals$Female,
paired = TRUE,
alternative = "two.sided")
t_sex
##
## Paired t-test
##
## data: sex_totals$Male and sex_totals$Female
## t = -4.3628, df = 1433, p-value = 1.376e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -137.62305 -52.25142
## sample estimates:
## mean difference
## -94.93724
Null hypothesis is rejected because its p-value is < 0.001. The mean paired difference (Male and Female) is –94.94 (95 % CI [–137.6, –52.3]) on average there are about 95 more female victims than male victims per country‐year.
The second t-test will be on the topic of these hypothesis: Null Hypothesis: “Age groups 17 or younger and 18 or older are victims of human trafficking in equal numbers.” Alternative Hypothesis: “Age groups 17 or younger and 18 or older are victims of human trafficking in different amounts.”
age_totals <- glotip %>%
filter(Dimension == "Total",
Sex == "Total",
Age %in% c("0 to 17 years","18 years or over")) %>%
group_by(Country, Year, Age) %>%
summarize(Victims = sum(Count, na.rm=TRUE), .groups="drop") %>%
pivot_wider(names_from = Age,
values_from = Victims) %>%
filter(!is.na(`0 to 17 years`),
!is.na(`18 years or over`))
t_age <- t.test(age_totals$`0 to 17 years`,
age_totals$`18 years or over`,
paired = TRUE,
alternative = "two.sided")
t_age
##
## Paired t-test
##
## data: age_totals$`0 to 17 years` and age_totals$`18 years or over`
## t = -6.8074, df = 1674, p-value = 1.38e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -204.3603 -112.9388
## sample estimates:
## mean difference
## -158.6496
Null hypothesis is rejected because its p-value is < 0.001. The mean paired difference (0 to 17 years and 18 years or over) is –158.65 (95 % CI [–204.4, –112.9]) on average there are roughly 159 more adult victims than children, per country‐year. ## Linear Regression Linear regression is a machine learning tool used for predictions, so it was an obvious choice to use this tool when it came to the research question of predicting the following scenarios: 1. Males 17 years or less in the year 2025. 2. Males 18 years or more in the year 2023. 3. Females 17 years or less in the year 2025. 4. Females 18 years or more in the year 2023.
global_trend <- glotip %>%
filter(Dimension == "Total",
Sex %in% c("Male","Female"),
Age %in% c("0 to 17 years","18 years or over")) %>%
group_by(Sex, Age, Year) %>%
summarize(Victims = sum(Count, na.rm = TRUE), .groups = "drop")
mod_m_u17 <- lm(Victims ~ Year,
data = filter(global_trend, Sex=="Male", Age=="0 to 17 years"))
mod_m_18p <- lm(Victims ~ Year,
data = filter(global_trend, Sex=="Male", Age=="18 years or over"))
mod_f_u17 <- lm(Victims ~ Year,
data = filter(global_trend, Sex=="Female", Age=="0 to 17 years"))
mod_f_18p <- lm(Victims ~ Year,
data = filter(global_trend, Sex=="Female", Age=="18 years or over"))
preds <- tibble(
Sex = c("Male", "Male", "Female", "Female"),
Age = c("0 to 17 years", "18 years or over",
"0 to 17 years", "18 years or over"),
Year = c(2025, 2023, 2025, 2023),
Predicted_Victims = c(
predict(mod_m_u17, newdata = data.frame(Year=2025)),
predict(mod_m_18p, newdata = data.frame(Year=2023)),
predict(mod_f_u17, newdata = data.frame(Year=2025)),
predict(mod_f_18p, newdata = data.frame(Year=2023))
)
)
preds
## # A tibble: 4 × 4
## Sex Age Year Predicted_Victims
## <chr> <chr> <dbl> <dbl>
## 1 Male 0 to 17 years 2025 12638.
## 2 Male 18 years or over 2023 27220.
## 3 Female 0 to 17 years 2025 13345.
## 4 Female 18 years or over 2023 43781.
Put in the form of a table, the predictions Male, 0–17, in 2025 = estimated 12,638 victims According to the model, if current year-to-year trends continue, there will be about 12.6 thousand male victims under 18 in 2025.
Male, 18+ in 2023 = estimated 27,220 victims For adult males in 2023, the model predicts 27.2 thousand victims, which shows the growing trend especially amongst those 18 or over as the victims of human trafficking when compared to 17 or under.
Female, 0–17 in 2025 = estimated 13,345 victims Predicted count for female victims under 18 in 2025 is slightly higher than the male counterpart of the same age at 13.3 thousand.
Female, 18+ in 2023 = estimated 43,781 victims Finally the model predicts that adult females in 2023 are projected at about 43.8 thousand victims significantly higher than adult males. This major disparity highlights this gender and age group’s vulnerability to human trafficking, especially when compared to the 3 other subgroups.
Throughout this project, I first imported the UNODC global trafficking dataset directly from their .CSV download and then wrestled with the Human Trafficking Hotline data by writing a custom rvest web scrape (since it wasn’t available as a clean API endpoint). Once both sources were in R, I cleaned and reshaped the tables by renaming columns, handling “<5” counts by imputing the midpoint, and dropping redundant columns/data so that I had counts by country, year, sex/gender, and age group. I then visualized global trends with line plots (by gender and by age), ran paired t-tests to confirm that females and adults consistently experience higher victim counts, and finally built four complex maps showing male and female victims 17 or under and 18+ across every country. The closing takeaway from this project is that women aged 18 or over are by far the most vulnerable group to human trafficking worldwide, especially when compared to the other 3 subgroups within the dataset (males 17 or younger, males 18 or over, and females but 17 or younger). Of course, this doesn’t exclude anyone from being a victim to human trafficking.