Project Proposal

Data Preparation

library(readr)
traffickingVictims <- read_csv("https://raw.githubusercontent.com/juliaDataScience-22/cuny-fall-23/stats-and-probability/data_glotip.csv")

library(dplyr)
library(plyr)
library(ggplot2)

# Less than 5 can include 0, 1, 2, 3, or 4.
# On average, we can expect 2 to be the value, 
#   so I changed every value that was <5 to 2.
traffickingVictims$txtVALUE <- gsub("<5",2,traffickingVictims$txtVALUE)
traffickingVictims$txtVALUE <- as.numeric(traffickingVictims$txtVALUE)

traffickingVictims <- traffickingVictims |>
  filter(Indicator == "Detected trafficking victims")
traffickingVictims <- traffickingVictims[,c(-1, -11, -13)]

totals <- traffickingVictims |>
  filter(Dimension == "Total")

others <- traffickingVictims |>
  filter(Dimension != "Total")

countryUnique <- unique(na.omit(traffickingVictims$Country))

newOthersDF <- data.frame(Country = c(NA),
                          Year = c(NA),
                          Dimension = c(NA),
                          Sex = c(NA),
                          Age = c(NA),
                          Num_Of_People = c(NA))

for (country in countryUnique)
{
  print("country loop")
  if(!is.na(country))
  {
    yearsForData <- as.list(others |>
      filter(Country %in% country) |>
      select(Year) |>
      as.list(distinct()))
    yearsForData <- unique(yearsForData$Year)
    for (year in yearsForData)
    {
      if(!is.na(year))
      {
        tempDF <- others |>
          filter(Country %in% country) |>
          filter(Year %in% year)
        dimensions <- unique(tempDF$Dimension)
        
        for (dim in dimensions)
        {
          if(!is.na(dim))
          {
            temp2DF <- tempDF |>
              filter(Dimension %in% dim)
            sexes <- unique(temp2DF$Sex)
            for (sex in sexes)
            {
              if(!is.na(sex))
              {
                temp3DF <- temp2DF |>
                  filter(Sex %in% sex)
                ages <- unique(temp3DF$Age)
                for (age in ages)
                {
                  if(!is.na(age))
                  {
                    temp4DF <- temp3DF |>
                      filter(Age %in% age)
                    sumForThis <- sum(temp4DF$txtVALUE, na.rm = TRUE)
                    newlist <- data.frame(Country = country,
                                          Year = year,
                                          Dimension = dim,
                                          Sex = sex,
                                          Age = age,
                                          Num_Of_People = sumForThis)
                    newOthersDF <- bind_rows(newOthersDF, newlist)
                  } #End age if statement
                } #End age for loop
              } #End sex if statement
            } #End sex for loop
          } #End dim if statement
            
        } #End dim for loop
      } #End year if statement
    } #End year for loop
  } #End country if statement
} #End country for loop

#This will get rid of the NA row
newOthersDF <- newOthersDF[-1,]

finalDFTrafficking <- data.frame(Country = c(newOthersDF$Country, totals$Country),
                                 Year = c(newOthersDF$Year, totals$Year),
                                 Dimension = c(newOthersDF$Dimension,
                                               totals$Dimension),
                                 Sex = c(newOthersDF$Sex, totals$Sex),
                                 Age = c(newOthersDF$Age, totals$Age),
                                 Num_Of_People = c(newOthersDF$Num_Of_People,
                                                   totals$txtVALUE))

Research focus

Compare the averages of human trafficking by age and gender over the years. Use hypothesis tests to determine any differences between ages and to determine any differences between the genders. In locations that do not distinguish between ages or genders, split up the data equally.

Cases

Each case is a country with data. When multiple rows contain the same country, they are from different years or show information about different genders or different ages. Based on my research question, I will only include data about victims, not convicted people or any other category. After reformatting the data, it contained 18,257 cases.

Data collection

The data table was downloaded from the United Nations Office on Drugs and Crime (UNDOC). The data were collected from national authorities, such as law enforcement and the criminal justice system in each region. UNDOC collected this data by using the Questionnaire for the Global Report on Trafficking in Persons (GLOTIP). The information reported in the questionnaire is verified with other sources. Although most data were collected with the questionnaire, some information was submitted from other sources, like official websites of national authorities or reports from governments.

Type of study

This is an observational study. Human trafficking is not an experiment. It can only be observed and reported on with the goal to reduce its prevalence in the world.

Data Source

The following link was used for the original: https://dataunodc.un.org/dp-trafficking-persons

Dependent Variable

The dependent variable is the number of victims of trafficking. This is quantitative.

Independent Variable(s)

The independent variables are gender (males, females) and age group (less than 18, 18 and older).

Relevant summary statistics

The variables were age and gender. The categories for age were 0 to 17, and 18 or older. The categories for gender were Totals, Males, and Females. Combinations of these showed counts, indicating the number of people who fit in those categories. Please note that the graphs below will be improved for the final project.

Statistics for gender

options(scipen=999)

new <- finalDFTrafficking |> filter(Sex == "Female")
females <- sum(new$Num_Of_People, na.rm = TRUE)
new2 <- finalDFTrafficking |> filter(Sex == "Male")
males <- sum(new2$Num_Of_People, na.rm = TRUE)

df <- data.frame(sex = c("females", "males"),
                 count = c(females, males))

ggplot(df, aes(x=sex, y=count)) + 
  geom_bar(stat = "identity")

Statistics for age

new <- finalDFTrafficking |> filter(Age == "0 to 17 years")
less_than_18 <- sum(new$Num_Of_People, na.rm = TRUE)
new2 <- finalDFTrafficking |> filter(Age == "18 years or over")
eighteen_plus <- sum(new2$Num_Of_People, na.rm = TRUE)

df <- data.frame(age = c("0 to 17 years", "18 years or more"),
                 count = c(less_than_18, eighteen_plus))

ggplot(df, aes(x=age, y=count)) + 
  geom_bar(stat = "identity")

new <- finalDFTrafficking |> filter(Age == "0 to 17 years") |> filter(Sex == "Male")
less_than_18_male <- sum(new$Num_Of_People, na.rm = TRUE)

new1 <- finalDFTrafficking |> filter(Age == "0 to 17 years") |> filter(Sex == "Female")
less_than_18_female <- sum(new1$Num_Of_People, na.rm = TRUE)

new2 <- finalDFTrafficking |> filter(Age == "18 years or over") |> filter(Sex == "Male")
eighteen_plus_male <- sum(new2$Num_Of_People, na.rm = TRUE)

new3 <- finalDFTrafficking |> filter(Age == "18 years or over") |> filter(Sex == "Female")
eighteen_plus_female <- sum(new3$Num_Of_People, na.rm = TRUE)

df <- data.frame(category = c("17 or less + Males", "17 or less + Females", 
                              "More than 18 + Males", "More than 18 + Females"),
                 count = c(less_than_18_male, less_than_18_female, 
                           eighteen_plus_male, eighteen_plus_female))

ggplot(df, aes(x=category, y=count)) + 
  geom_bar(stat = "identity")

Show counts by gender by country

countries <- unique(na.omit(finalDFTrafficking$Country))
otherDF <- c()

for (country in countries)
{
  temp <- finalDFTrafficking |> filter(Country == country) |> filter(Age == "0 to 17 years") |> filter(Sex == "Male")
  males_17_minus <- sum(temp$Num_Of_People, na.rm = TRUE)
  temp <- finalDFTrafficking |> filter(Country == country) |> filter(Age == "0 to 17 years") |> filter(Sex == "Female")
  females_17_minus <- sum(temp$Num_Of_People, na.rm = TRUE)
  temp <- finalDFTrafficking |> filter(Country == country) |> filter(Age == "18 years or over") |> filter(Sex == "Male")
  males_18_plus <- sum(temp$Num_Of_People, na.rm = TRUE)
  temp <- finalDFTrafficking |> filter(Country == country) |> filter(Age == "18 years or over") |> filter(Sex == "Female")
  females_18_plus <- sum(temp$Num_Of_People, na.rm = TRUE)
  
  newlist <- data.frame(Country = c(country),
                        males_17_or_less = c(males_17_minus),
                        females_17_or_less = c(females_17_minus),
                        males_18_plus = c(males_18_plus),
                        females_18_plus = c(females_18_plus))
  otherDF <- bind_rows(otherDF, newlist)
}

otherDF$Country <- gsub("United States of America", 'USA', otherDF$Country, useBytes = TRUE)

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:plyr':
## 
##     ozone

world_map <- map_data(map = "world")

world_map$region <- gsub("USA", "USA", world_map$region)


ggplot(otherDF) +
  geom_map(aes(map_id = countries, fill = males_17_or_less), map = world_map) +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), colour = 'black', fill = NA) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  scale_fill_viridis_c(option = "magma") +
  theme_void() +
  coord_fixed()

ggplot(otherDF) +
  geom_map(aes(map_id = countries, fill = females_17_or_less), map = world_map) +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), colour = 'black', fill = NA) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  scale_fill_viridis_c(option = "magma") +
  theme_void() +
  coord_fixed()

ggplot(otherDF) +
  geom_map(aes(map_id = countries, fill = males_18_plus), map = world_map) +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), colour = 'black', fill = NA) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  scale_fill_viridis_c(option = "magma") +
  theme_void() +
  coord_fixed()

ggplot(otherDF) +
  geom_map(aes(map_id = countries, fill = females_18_plus), map = world_map) +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), colour = 'black', fill = NA) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  scale_fill_viridis_c(option = "magma") +
  theme_void() +
  coord_fixed()

Statistical Tests

In the final project, I will create graphs showing the trends over time for certain countries by Sex and by Age. Some countries do not split their data by Sex or Age, so I will exclude those for these graphs. I will focus on the U.S., and the countries with the highest numbers. I will also create graphs showing the trends over time for each country total for the countries I include.

Next, I will perform two t-tests, one for sexes, one for ages. The t-tests will help determine if one sex experiences human trafficking more than another, and it will help determine if one age group experiences human trafficking more than another. This portion of the statistical tests will respond to the focus of this research.

Lastly, I will use linear regression for the totals by sex (one regression) and by age (a second regression) in each country to predict future totals. The purpose of this regression is to show an extension of this project and practice linear regression beyond classwork.

Sources

https://stackoverflow.com/questions/71858134/create-ggplot2-map-in-r-using-count-by-country

The source above helped guide me in creating the maps.