Regression

Null hypothesis: There is no significant difference in correlation between the clear counts and actual counts among the regions.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.2.3

library(ggrepel)
library(AmesHousing)

## Warning: package 'AmesHousing' was built under R version 4.2.3

library(boot)
library(broom)
library(lindia)

## Warning: package 'lindia' was built under R version 4.2.3

htd <- read.csv("C:\\Users\\moore\\OneDrive\\Desktop\\Fall 2023\\Intro to statistics\\project\\Statistics Project\\Statistics Project\\htd.csv")

n <- nrow(htd)
k <- n_distinct(htd$REGION_NAME)

ggplot() +
  geom_function(xlim = c(0, 10), 
                fun = \(x) df(x, k - 1, n - k)) +
  geom_vline(xintercept = 1, color = 'orange') +
  labs(title = 'F Distribution for Regions',
       x = "F Values",
       y = "Probability Density") +
  theme_hc()

m <- aov(CLEARED_COUNT ~ REGION_NAME, data = htd)
summary(m)

##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## REGION_NAME    4   3258497  814624   7.055 1.19e-05 ***
## Residuals   3093 357157683  115473                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This anova test concludes there is enough evidence to suggest that there is significant difference among means of cleared counts for each state.

model <- lm(CLEARED_COUNT ~ ACTUAL_COUNT, htd)
model$coefficients

##  (Intercept) ACTUAL_COUNT 
##    3.4871437    0.4133829

south_region <- htd |>
  filter(REGION_NAME == "South")

I focused on the southern region due to its consistently high counts, not just in actual count but also in overall dataset prevalence. This strategic choice illuminates specific patterns and trends within a region known for its significant data representation, providing valuable insights for analysis.

south_region |>
  ggplot(aes(x = CLEARED_COUNT, y = ACTUAL_COUNT, group = REGION_NAME)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  facet_wrap(vars(REGION_NAME), labeller = label_both) +
  labs(title = "Scatter Plot and Regression Lines",
       subtitle = "Faceted by South Region",
       x = "CLEARED_COUNT", y = "ACTUAL_COUNT") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

From an earlier data dive I conducted an analysis that the south region had the highest actual counts among all regions. This shows that as the cleared count increased actual count increases to showcasing a positive correlation.

model2 <- lm(CLEARED_COUNT ~ JUVENILE_CLEARED_COUNT + ACTUAL_COUNT, htd)
model2$coefficients

##            (Intercept) JUVENILE_CLEARED_COUNT           ACTUAL_COUNT 
##              3.0615945              0.6552555              0.3993505

htd |>
  ggplot(aes(x = CLEARED_COUNT, y = JUVENILE_CLEARED_COUNT)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatter Plot and Regression Lines",
       subtitle = "",
       x = "CLEARED_COUNT", y = "Juvenile clear counts") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

This is the relationship between cleared counts and juvenile cleared counts in the dataset. The plot displays another linear realtionship which for every increase in clear counts juvenile clear counts increase.

model3 <- lm(JUVENILE_CLEARED_COUNT ~ ACTUAL_COUNT, htd)
model3

## 
## Call:
## lm(formula = JUVENILE_CLEARED_COUNT ~ ACTUAL_COUNT, data = htd)
## 
## Coefficients:
##  (Intercept)  ACTUAL_COUNT  
##      0.64944       0.02142

htd |>
  ggplot(aes(x = JUVENILE_CLEARED_COUNT, y = ACTUAL_COUNT)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatter Plot and Regression Lines",
       subtitle = "",
       x = "Juvenile Count", y = "ACTUAL_COUNT") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

This plot represents the predicted relationship between Juvenile Count and ACTUAL_COUNT according to the linear regression model. The positive slope of the line indicates that as Juvenile Count increases, ACTUAL_COUNT also tends to increase.

htd |>
  ggplot(mapping = aes(x = CLEARED_COUNT, y = ACTUAL_COUNT,
                       color = factor(REGION_NAME))) +
  geom_jitter(height = 0, width = 0.1, shape = 'o', size = 3) +
  geom_smooth(method = 'lm', se = FALSE, linewidth = 0.5) +
  scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
  scale_color_brewer(palette = 'Dark2') +
  labs(title = "Actual Count by Cleared Count",
       subtitle = "Colored by Region Abbreviation",
       x = "Cleared Count (x-jittered)", y = "Actual Count",
       color = 'Region Abbreviation') +
  theme_minimal()

## Warning: Ignoring unknown parameters: linewidth

## `geom_smooth()` using formula 'y ~ x'

# include all variables and their interaction
model <- lm(ACTUAL_COUNT ~ CLEARED_COUNT + JUVENILE_CLEARED_COUNT, south_region)

# to view more coefficients a bit easier
tidy(model) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))

## # A tibble: 3 × 2
##   term                   estimate
##   <chr>                     <dbl>
## 1 (Intercept)                76.5
## 2 CLEARED_COUNT               1.7
## 3 JUVENILE_CLEARED_COUNT     -1.3

The positive coefficient for CLEARED_COUNT suggests that an increase in the number of cases cleared is associated with a rise in the predicted count of actual trafficking incidents. This could indicate that law enforcement efforts leading to cleared cases are effective in identifying and addressing human trafficking. Conversely, the negative coefficient for JUVENILE_CLEARED_COUNT implies that as the number of cleared juvenile cases increases, the predicted count of actual trafficking incidents decreases. This might suggest a preventive effect, indicating that successful intervention in juvenile cases might deter potential human trafficking incidents involving minors.

west_region <- htd |>
  filter(REGION_NAME == "West")
model2 <- lm(ACTUAL_COUNT ~ CLEARED_COUNT + JUVENILE_CLEARED_COUNT, west_region)

# to view more coefficients a bit easier
tidy(model2) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))

## # A tibble: 3 × 2
##   term                   estimate
##   <chr>                     <dbl>
## 1 (Intercept)                92.1
## 2 CLEARED_COUNT               2  
## 3 JUVENILE_CLEARED_COUNT      0.7

Now we see compared to the south region the west is completely different. The coefficients from the linear regression model provide insights into the relationship between different types of cleared cases and the actual count of events in the West region. The intercept of 92.1 suggests a baseline value for ACTUAL_COUNT when there are no cleared cases. The positive coefficients for both CLEARED_COUNT (2.0) and JUVENILE_CLEARED_COUNT (0.7) indicate that as the number of cleared regular cases increases by one unit, the actual count of events is expected to rise by 2.0 units, and a similar pattern holds for juvenile cleared cases. This implies a positive correlation between law enforcement activities (measured by cleared cases) and the occurrence of actual events, indicating that higher law enforcement clearance rates are associated with a higher actual count of events in the West region of the dataset.

Regression

2023-10-19