library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(ggrepel)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.2.3
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(readxl)
htd <- read.csv("htd.csv")
linear_model = lm(ACTUAL_COUNT ~ JUVENILE_CLEARED_COUNT + CLEARED_COUNT, data = htd)
summary(linear_model)
##
## Call:
## lm(formula = ACTUAL_COUNT ~ JUVENILE_CLEARED_COUNT + CLEARED_COUNT,
## data = htd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2654.2 -108.6 2.8 2.8 8116.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.20660 6.99213 8.754 < 2e-16 ***
## JUVENILE_CLEARED_COUNT -0.32616 0.11336 -2.877 0.00404 **
## CLEARED_COUNT 1.73992 0.02073 83.947 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 373.5 on 3095 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.7121
## F-statistic: 3831 on 2 and 3095 DF, p-value: < 2.2e-16
Interpretation: The intercept represents the predicted ‘ACTUAL_COUNT’ when both juvenile cleared count and cleared count are zero, which might not have a practical interpretation. The negative coefficient for ‘JUVENILE_CLEARED_COUNT’ suggests that an increase in juvenile cleared count is associated with a decrease in actual count, all else being equal. Conversely, the positive coefficient for ‘CLEARED_COUNT’ indicates that an increase in cleared count is associated with an increase in actual count, holding juvenile cleared count constant.
The adjusted R-squared value of 0.7121 indicates that approximately 71.21% of the variance in ‘ACTUAL_COUNT’ can be explained by the linear combination of ‘JUVENILE_CLEARED_COUNT’ and ‘CLEARED_COUNT’ in the model. The very low p-value (< 2.2e-16) for both coefficients indicates that these relationships are statistically significant.
ggplot(htd, aes(x = JUVENILE_CLEARED_COUNT, y = ACTUAL_COUNT)) +
geom_point(shape = 1, color = "blue") +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE) +
labs(title = "Actual Count vs. Juvenile Cleared Count",
x = "Juvenile Cleared Count",
y = "Actual Count") +
theme_minimal()
ggplot(htd, aes(x = CLEARED_COUNT, y = ACTUAL_COUNT)) +
geom_point(shape = 1, color = "green") +
geom_smooth(method = "lm", formula = y ~ x, color = "green", se = FALSE) +
labs(title = "Actual Count vs. Cleared Count",
x = "Cleared Count",
y = "Actual Count") +
theme_minimal()
Chart 1: Juvenile Cleared Counts vs. Actual Count In this chart, as the
number of “Juvenile Cleared Counts” increases, the “Actual Count”
decreases. This implies a negative correlation between the count of
juvenile cases cleared (potentially related to human trafficking
incidents involving juveniles) and the overall count of human
trafficking incidents. One possible interpretation is that law
enforcement efforts focusing on juvenile cases result in a decrease in
overall reported human trafficking incidents. This inverse relationship
might indicate that law enforcement agencies are successfully addressing
cases involving juveniles, leading to a reduction in the overall count
of human trafficking incidents.
Chart 2: Clear Counts vs. Actual Count In contrast to the first chart, in this chart, an increase in “Clear Counts” (which could represent the total number of cases cleared, not limited to juveniles) is associated with an increase in the “Actual Count” of human trafficking incidents. This suggests a positive correlation between the overall number of cases cleared by law enforcement and the total count of human trafficking incidents. A potential explanation is that as law enforcement agencies successfully address more cases, they also uncover more incidents, leading to a higher overall count.
Transformation:
htd_cleaned <- htd |>
mutate(JUVENILE_CLEARED_COUNT = ifelse(JUVENILE_CLEARED_COUNT == -Inf, median(JUVENILE_CLEARED_COUNT, na.rm = TRUE), JUVENILE_CLEARED_COUNT))
htd_transformed <- htd_cleaned |>
mutate(
log_JUVENILE_CLEARED_COUNT = log(JUVENILE_CLEARED_COUNT + 1), # Adding 1 to avoid log(0)
sqrt_CLEARED_COUNT = sqrt(CLEARED_COUNT)
)
model_transformed <- lm(ACTUAL_COUNT ~ log_JUVENILE_CLEARED_COUNT + sqrt_CLEARED_COUNT, data = htd_transformed)
summary(model_transformed)
##
## Call:
## lm(formula = ACTUAL_COUNT ~ log_JUVENILE_CLEARED_COUNT + sqrt_CLEARED_COUNT,
## data = htd_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2088.5 -295.5 114.9 132.9 10049.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -68.912 11.140 -6.186 6.97e-10 ***
## log_JUVENILE_CLEARED_COUNT 55.915 11.943 4.682 2.97e-06 ***
## sqrt_CLEARED_COUNT 53.551 1.184 45.222 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 518 on 3095 degrees of freedom
## Multiple R-squared: 0.4465, Adjusted R-squared: 0.4462
## F-statistic: 1248 on 2 and 3095 DF, p-value: < 2.2e-16
The transformed coefficients indicate the expected change in ACTUAL_COUNT. ACTUAL_COUNT associated with changes in the transformed predictors, taking into account the transformation applied to each variable. The positive coefficients for both transformed predictors suggest that increases in the logarithm of juvenile cleared count and the square root of cleared count are associated with higher ACTUAL_COUNT. ACTUAL_COUNT, supporting the idea that these transformed variables have a positive impact on the actual count of cases. The model’s overall performance is indicated by the adjusted R-squared value of 0.4462, suggesting that approximately 44.62% of the variance in ACTUAL_COUNT. ACTUAL_COUNT can be explained by the transformed predictors in the model.
ggplot(htd_transformed, aes(x = log_JUVENILE_CLEARED_COUNT, y = ACTUAL_COUNT)) +
geom_point(shape = 1, color = "blue") +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE) +
labs(title = "Actual Count vs. Log(Juvenile Cleared Count)",
x = "Log(Juvenile Cleared Count)",
y = "Actual Count") +
theme_minimal()
ggplot(htd_transformed, aes(x = sqrt_CLEARED_COUNT, y = ACTUAL_COUNT)) +
geom_point(shape = 1, color = "green") +
geom_smooth(method = "lm", formula = y ~ x, color = "green", se = FALSE) +
labs(title = "Actual Count vs. Square Root(Cleared Count)",
x = "Square Root(Cleared Count)",
y = "Actual Count") +
theme_minimal()
Chart 1: Actual Count vs. Log(Juvenile Cleared Count): This chart
visualizes the relationship between the actual count of human
trafficking cases and the logarithm of the juvenile cleared count. In
cases related to human trafficking, law enforcement efforts often focus
on identifying and clearing cases involving juveniles. The positive
slope of the regression line in this chart suggests that an increase in
the log-transformed juvenile cleared count (indicative of successful
resolutions involving juveniles) corresponds to an increase in the
predicted actual count of human trafficking cases. This indicates that
as law enforcement successfully resolves cases involving juveniles,
there tends to be an increase in the overall reported human trafficking
cases. The logarithmic transformation might be capturing diminishing
returns - as law enforcement efforts target more juvenile cases, the
impact on overall human trafficking cases becomes proportionally
smaller.
Chart 2: Actual Count vs. Square Root(Cleared Count): This chart illustrates the relationship between the actual count of human trafficking cases and the square root of the cleared count. In this context, cleared count could represent the number of human trafficking cases successfully investigated and resolved by law enforcement agencies. The positive slope of the regression line signifies that an increase in the square root of cleared count (indicating a moderate increase in resolved cases) corresponds to an increase in the predicted actual count of human trafficking cases. Here, the square root transformation might be addressing the diminishing returns in a different way, capturing a more gradual impact of successful resolutions on the overall reported human trafficking cases.
These interpretations highlight the nuanced relationships between law enforcement efforts (specifically focusing on juveniles and overall case resolutions) and the reported instances of human trafficking. The logarithmic and square root transformations provide a mathematical means to capture these relationships, allowing for a more accurate understanding of the dynamics within the data.
Issues with models: The regular linear regression model and the transformation-based model both have their potential issues. In the regular model, without transformations, there might be concerns about the assumptions of linearity and homoscedasticity, which could affect the reliability of predictions. Additionally, the transformed predictors in the second model introduce interpretation challenges. While transformations like logarithm and square root can mitigate skewed distributions, they can also make the interpretation of coefficients less intuitive and not well-versed in these transformations. Furthermore, both models exhibit limitations in explaining the variability in the actual count of human trafficking cases, with adjusted R-squared values indicating a moderate fit at best. These limitations underscore the complexity of human trafficking data, suggesting the presence of unaccounted factors influencing the reported cases beyond the variables included in the models, thereby impacting the models’ explanatory power and reliability.