Data Dive Documentation

#1 The column labeled “unfounded count” in the data is causing some confusion for me. Specifically, I’m uncertain about the quantity of unfounded counts and why the majority of these counts are zero. I’m curious if a zero count indicates that no counts were unfounded, or if this value has a different interpretation in the context of this column. This data originates from the FBI crime data site, and I suspect that the encoding of this column was influenced by the volume of reports they received about unfounded counts from agencies reporting human trafficking crimes.

The “Juvenile Cleared Count” column in the data predominantly contains zero values, unlike the “Cleared Count” column. I’m intrigued as to why the “Cleared Count” column doesn’t have as many zero values, and whether there’s a correlation between these two columns. It’s odd how the “Juvenile Cleared Count” is frequently absent, while the “Cleared Count” is not. Could there be a specific reason for this discrepancy in the data?

The column titled “population group description” in the data has left me with some questions. Specifically, I’m unsure about the meaning of the terms “MSA counties” and “non-MSA counties”. I suspect that the encoding of this column was influenced by the reports the FBI received from the counties or agencies in the areas where the reports originated.

#2 Upon reviewing the data documentation, there is an element of uncertainty surrounding the ‘unfounded count’. It’s not immediately clear what this count signifies or how it’s calculated. This becomes particularly confusing when considering the offense name ‘human trafficking’. This is the only offense name present in the data, yet it is associated with two distinct subcategories. The relationship between these subcategories and the ‘unfounded count’ is unclear. It’s uncertain whether the ‘unfounded count’ is a collective total for both subcategories, or if each subcategory has its own separate ‘unfounded count’. Furthermore, the term ‘unfounded’ itself could use more explanation. Does it refer to cases that were dismissed due to lack of evidence? Or perhaps cases that were reported but later retracted? Ultimately, the role and representation of ‘unfounded counts’ within the context of these two subcategories of ‘human trafficking’ offense needs further clarification to ensure accurate interpretation and analysis of the data.

htd <- read.csv("C:\\Users\\moore\\OneDrive\\Desktop\\Fall 2023\\Intro to statistics\\project\\Statistics Project\\Statistics Project\\Human Trafficking data.csv")

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.2.3

library(ggrepel)

htd |>
  ggplot() +
  geom_boxplot(mapping = aes(x = UNFOUNDED_COUNT, y = "")) +
  labs(title = "Unfounded count outliers",
       x = "Unfounded", y = "") +
  theme_classic()

## Warning: Removed 1969 rows containing non-finite values (stat_boxplot).

The visualization of the ‘unfounded count’ column reveals that the majority of values are 0’s, with numerous outliers. This could create uncertainty as the dominance of 0’s might mask the significance of the outliers, leading to an unclear understanding of the data distribution.

htd |>
  ggplot() +
  geom_histogram(mapping = aes(x = UNFOUNDED_COUNT), color = "white", 
                 fill = "#3182bd") +
  labs(title = "Unfounded count from Data",
       x = "Unfounded count", y = "") +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1969 rows containing non-finite values (stat_bin).

htd |>
  mutate(unfounded_is_na = is.na(UNFOUNDED_COUNT)) |>
  ggplot() +
  geom_bar(mapping = aes(x = OFFENSE_SUBCAT_NAME, fill = unfounded_is_na)) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Missing Values by offense subcategory name",
       y = "Number of Rows",
       fill = "Missing Unfounded?") +
  theme_classic()

The visualization highlights the ambiguity in the ‘offense subcategory name’ due to missing values. This lack of clarity can lead to misinterpretation of the data, potentially skewing any analysis or conclusions drawn from it. The risks include biased decision-making or inaccurate inferential statistics. To mitigate these risks it’s crucial to acknowledge this limitation when interpreting the visualization. Employing robust statistical techniques that can handle missing data, such as imputation methods or probabilistic models, can help ensure more reliable outcomes.

value_counts <- htd |>
  group_by(UNFOUNDED_COUNT) |>
  summarise(count = n())
print(value_counts)

## # A tibble: 34 × 2
##    UNFOUNDED_COUNT count
##              <int> <int>
##  1               0   980
##  2               5     2
##  3               9     1
##  4              10     1
##  5              25     1
##  6              30     1
##  7              43     1
##  8              47     1
##  9              49     1
## 10              51     1
## # … with 24 more rows

ggplot(value_counts, aes(x = factor(UNFOUNDED_COUNT), y = count, fill = factor(UNFOUNDED_COUNT))) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  ggtitle("Bar Chart of Value Counts") +
  xlab("Values") +
  ylab("Count")

The visualization, which displays the count of each value in the ‘unfounded count’ category, reveals that 0’s and NA values are most prevalent. This could lead to ambiguity as the dominance of 0’s and NA values might overshadow the presence of other values, making the column unclear. The potential risk is that this could skew any analysis or conclusions drawn from it, leading to inaccurate insights. To reduce these negative consequences, it’s important to interpret the visualization with an understanding of this limitation. Using statistical techniques that can handle this data distribution, such as techniques specifically designed for sparse data, can help ensure more reliable outcomes. Overall the visualizations show the unfounded count columns potential problems in further analysis and emphasize the uncertainty of what the values of 0 represent.

Data Dive Documentation

2023-09-24