Null hypothesis: There is no significant difference in correlation between the clear counts and actual counts among the regions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(ggrepel)
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.2.3
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.2.3
htd <- read.csv("C:\\Users\\moore\\OneDrive\\Desktop\\Fall 2023\\Intro to statistics\\project\\Statistics Project\\Statistics Project\\htd.csv")
n <- nrow(htd)
k <- n_distinct(htd$REGION_NAME)
ggplot() +
geom_function(xlim = c(0, 10),
fun = \(x) df(x, k - 1, n - k)) +
geom_vline(xintercept = 1, color = 'orange') +
labs(title = 'F Distribution for Regions',
x = "F Values",
y = "Probability Density") +
theme_hc()
m <- aov(CLEARED_COUNT ~ REGION_NAME, data = htd)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## REGION_NAME 4 3258497 814624 7.055 1.19e-05 ***
## Residuals 3093 357157683 115473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This anova test concludes there is enough evidence to suggest that there is significant difference among means of cleared counts for each state.
model <- lm(CLEARED_COUNT ~ ACTUAL_COUNT, htd)
model$coefficients
## (Intercept) ACTUAL_COUNT
## 3.4871437 0.4133829
south_region <- htd |>
filter(REGION_NAME == "South")
I focused on the southern region due to its consistently high counts, not just in actual count but also in overall dataset prevalence. This strategic choice illuminates specific patterns and trends within a region known for its significant data representation, providing valuable insights for analysis.
south_region |>
ggplot(aes(x = CLEARED_COUNT, y = ACTUAL_COUNT, group = REGION_NAME)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
facet_wrap(vars(REGION_NAME), labeller = label_both) +
labs(title = "Scatter Plot and Regression Lines",
subtitle = "Faceted by South Region",
x = "CLEARED_COUNT", y = "ACTUAL_COUNT") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
From an earlier data dive I conducted an analysis that the south region had the highest actual counts among all regions. This shows that as the cleared count increased actual count increases to showcasing a positive correlation.
model2 <- lm(CLEARED_COUNT ~ JUVENILE_CLEARED_COUNT + ACTUAL_COUNT, htd)
model2$coefficients
## (Intercept) JUVENILE_CLEARED_COUNT ACTUAL_COUNT
## 3.0615945 0.6552555 0.3993505
htd |>
ggplot(aes(x = CLEARED_COUNT, y = JUVENILE_CLEARED_COUNT)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Scatter Plot and Regression Lines",
subtitle = "",
x = "CLEARED_COUNT", y = "Juvenile clear counts") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
This is the relationship between cleared counts and juvenile cleared
counts in the dataset. The plot displays another linear realtionship
which for every increase in clear counts juvenile clear counts
increase.
model3 <- lm(JUVENILE_CLEARED_COUNT ~ ACTUAL_COUNT, htd)
model3
##
## Call:
## lm(formula = JUVENILE_CLEARED_COUNT ~ ACTUAL_COUNT, data = htd)
##
## Coefficients:
## (Intercept) ACTUAL_COUNT
## 0.64944 0.02142
htd |>
ggplot(aes(x = JUVENILE_CLEARED_COUNT, y = ACTUAL_COUNT)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Scatter Plot and Regression Lines",
subtitle = "",
x = "Juvenile Count", y = "ACTUAL_COUNT") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
This plot represents the predicted relationship between Juvenile Count
and ACTUAL_COUNT according to the linear regression model. The positive
slope of the line indicates that as Juvenile Count increases,
ACTUAL_COUNT also tends to increase.
htd |>
ggplot(mapping = aes(x = CLEARED_COUNT, y = ACTUAL_COUNT,
color = factor(REGION_NAME))) +
geom_jitter(height = 0, width = 0.1, shape = 'o', size = 3) +
geom_smooth(method = 'lm', se = FALSE, linewidth = 0.5) +
scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
scale_color_brewer(palette = 'Dark2') +
labs(title = "Actual Count by Cleared Count",
subtitle = "Colored by Region Abbreviation",
x = "Cleared Count (x-jittered)", y = "Actual Count",
color = 'Region Abbreviation') +
theme_minimal()
## Warning: Ignoring unknown parameters: linewidth
## `geom_smooth()` using formula 'y ~ x'
# include all variables and their interaction
model <- lm(ACTUAL_COUNT ~ CLEARED_COUNT + JUVENILE_CLEARED_COUNT, south_region)
# to view more coefficients a bit easier
tidy(model) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 3 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 76.5
## 2 CLEARED_COUNT 1.7
## 3 JUVENILE_CLEARED_COUNT -1.3
The positive coefficient for CLEARED_COUNT suggests that an increase in the number of cases cleared is associated with a rise in the predicted count of actual trafficking incidents. This could indicate that law enforcement efforts leading to cleared cases are effective in identifying and addressing human trafficking. Conversely, the negative coefficient for JUVENILE_CLEARED_COUNT implies that as the number of cleared juvenile cases increases, the predicted count of actual trafficking incidents decreases. This might suggest a preventive effect, indicating that successful intervention in juvenile cases might deter potential human trafficking incidents involving minors.
west_region <- htd |>
filter(REGION_NAME == "West")
model2 <- lm(ACTUAL_COUNT ~ CLEARED_COUNT + JUVENILE_CLEARED_COUNT, west_region)
# to view more coefficients a bit easier
tidy(model2) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 3 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 92.1
## 2 CLEARED_COUNT 2
## 3 JUVENILE_CLEARED_COUNT 0.7
Now we see compared to the south region the west is completely
different. The coefficients from the linear regression model provide
insights into the relationship between different types of cleared cases
and the actual count of events in the West region. The intercept of 92.1
suggests a baseline value for ACTUAL_COUNT when there are
no cleared cases. The positive coefficients for both
CLEARED_COUNT (2.0) and JUVENILE_CLEARED_COUNT
(0.7) indicate that as the number of cleared regular cases increases by
one unit, the actual count of events is expected to rise by 2.0 units,
and a similar pattern holds for juvenile cleared cases. This implies a
positive correlation between law enforcement activities (measured by
cleared cases) and the occurrence of actual events, indicating that
higher law enforcement clearance rates are associated with a higher
actual count of events in the West region of the dataset.