This homework gives students the opportunity to practice the second set of inferential statistics lessons. At least as much as the last homework, we want to be thinking about the INTERPRETATION of the output, so most of the work here is accurately describing what you found, and what it means. We are going to use a new dataset, covering the Uniform Crime Reporting Program from 1960 through 2020. You should find the dataset, alongside a manual, in the same module as this assignment on Canvas. In this dataset, a row can be considered an agency-year combination - each row reflects the crimes. arrests, and clearances reported to a particular field office in a particular year.
setwd("C:/Users/emily/OneDrive/바탕 화면/FELS/Use_Misuse of Data")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ucr <- readRDS("ucr.rds")
The codebook for the dataset is available in the same folder as this homework, and is very useful to look through when completing this assignment. However, we are going to start simply. I want you to focus on areas where the population is over 100,000, using the column “population_1.” Create a second dataset that subsets to only those observations, and keep only the population column and the coloumns “agency_name”, “state_abb”, “year”, “actual_robbery_total”, “actual_all_crimes”, “tot_clr_robbery_total”, and “tot_clr_all_crimes”.
ucr_100k <- ucr %>%
filter(population_1 > 100000) %>%
select(
population_1,
agency_name,
state_abb,
year,
actual_robbery_total,
actual_all_crimes,
tot_clr_robbery_total,
tot_clr_all_crimes
)
head(ucr_100k)
## population_1 agency_name state_abb year actual_robbery_total
## 1 286388 anchorage AK 2020 558
## 2 287731 anchorage AK 2019 621
## 3 291992 anchorage AK 2018 717
## 4 296188 anchorage AK 2017 778
## 5 299097 anchorage AK 2016 695
## 6 301239 anchorage AK 2015 621
## actual_all_crimes tot_clr_robbery_total tot_clr_all_crimes
## 1 16856 207 5001
## 2 19993 196 5872
## 3 22398 276 6417
## 4 23750 216 5997
## 5 22336 205 5679
## 6 19474 213 5686
When data comes to you, it is frequently not in the final version you would prefer. You may have to create new variables that are more similar to outcomes that you want to use in your analysis. Here, raw numbers of crimes and clearances may not be very helpful, particularly since units have such different populations (even after pulling only the largest geographies). So create two new variables: the percent of robberies “cleared” (solved), and the percent of all crimes cleared. What is the average clearance rate for robberies across all units? For all crimes? Which state has the highest average clearance rate for all crimes over the years (here, consider each unit within a state equally, rather than weighting by population)? Finally, which years saw the highest average clearance rate for robberies? Is there a trend you can spot there?
# Create percent clearance variables
ucr_100k <- ucr_100k %>%
mutate(
robbery_clr_pct = (tot_clr_robbery_total / actual_robbery_total) * 100,
all_clr_pct = (tot_clr_all_crimes / actual_all_crimes) * 100)
# Averages across all observations
mean_robbery_clr <- mean(ucr_100k$robbery_clr_pct, na.rm = TRUE)
mean_allclr <- mean(ucr_100k$all_clr_pct, na.rm = TRUE)
mean_robbery_clr #30.4
## [1] 30.43298
mean_allclr #24.9
## [1] 24.9432
state_highest_allclr <- ucr_100k %>%
group_by(state_abb) %>%
summarize(avg_allclr = mean(all_clr_pct, na.rm = TRUE)) %>%
arrange(desc(avg_allclr))
state_highest_allclr #DE has the highest average all-crime clearance rate
## # A tibble: 50 × 2
## state_abb avg_allclr
## <chr> <dbl>
## 1 DE 45.6
## 2 SD 39.7
## 3 WV 34.4
## 4 ID 33.8
## 5 ND 32.8
## 6 AK 31.3
## 7 NC 30.9
## 8 NE 30.4
## 9 MD 30.4
## 10 FL 29.0
## # ℹ 40 more rows
year_robbery_clr <- ucr_100k %>%
group_by(year) %>%
summarize(avg_robbery_clr = mean(robbery_clr_pct, na.rm = TRUE)) %>%
arrange(desc(avg_robbery_clr))
year_robbery_clr #Top three years with highest robbery clearance rates are 1965, 1964, and 2018.
## # A tibble: 61 × 2
## year avg_robbery_clr
## <dbl> <dbl>
## 1 1965 38.9
## 2 1964 36.9
## 3 2018 36.2
## 4 1966 35.9
## 5 2013 35.9
## 6 2019 35.7
## 7 2020 35.5
## 8 2017 35.3
## 9 2015 35.1
## 10 2014 34.9
## # ℹ 51 more rows
#Comment on trends: Robbery clearance rates peak in the mid-1960s, decline over the following decades, and then rise again in the mid to late 2010s. However, the high clearance rates in the 1960s may be due to limited agency participation and less standardized reporting early in the UCR program, which could've inflated clearance percentages. Similarly, more recent increases could result from improved reporting systems or fewer robbery incidents being reported, rather than a true increase in the number of cases solved.
Now on to some new questions. Let’s ask three basic questions that could be answered via regression:
1.) We might think that there are returns to size (efficiencies discovered in scale), so that bigger departments might be better at fighting crime. Using a simple bivariate regression, test whether the size of a city (in population) is correlated with a higher clearance rate for crimes.
2.) Consider only the last 30 years. Has law enforcement gotten better during this time at clearing robberies? (Note that these are reported crimes, so technically we cannot fully answer this question, but use the data as if it were complete)
3.) Has the rate of robbery increased over the last 30 years? Is it better or worse in smaller cities? Use a multivariate regression explaining the per capita rate of robbery over the past 30 years with both a time trend and a city’s population.
For each of the three questions above, run the necessary regression, and explain each of the estimates, including the intercept where appropriate. Answer the actual underlying substantive question to the best of your ability.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich) #for SEs
## Warning: package 'sandwich' was built under R version 4.4.2
#1.Simple linear model (population in raw persons)
lm_pop_raw <- lm(all_clr_pct ~ population_1, data = ucr_100k)
summary(lm_pop_raw)
##
## Call:
## lm(formula = all_clr_pct ~ population_1, data = ucr_100k)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.343 -6.956 -0.208 7.027 74.692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.554e+01 1.003e-01 254.58 <2e-16 ***
## population_1 -1.969e-06 1.624e-07 -12.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.33 on 19912 degrees of freedom
## (627 observations deleted due to missingness)
## Multiple R-squared: 0.007328, Adjusted R-squared: 0.007278
## F-statistic: 147 on 1 and 19912 DF, p-value: < 2.2e-16
#(heteroskedasticity-consistent) SEs
coeftest(lm_pop_raw, vcov = vcovHC(lm_pop_raw, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5540e+01 1.0195e-01 250.516 < 2.2e-16 ***
## population_1 -1.9692e-06 1.6312e-07 -12.072 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This regression examined whether larger city populations are associated with higher clearance rates for all crimes. The model shows a statistically significant but negative relationship between population and clearance rate (b = −1.97e−06, p < 0.001). This means that for every additional 100,000 residents, the predicted clearance rate decreases by roughly 0.2 percentage points. The intercept of 25.5 represents the expected clearance rate when population is near zero (baseline). Although the association is highly significant, the model’s R² of 0.007 indicates that population size alone explains less than 1% of the variation in clearance rates. So, larger jurisdictions tend to have slightly lower clearance rates, but the effect is small.
#2.
# Filter to last 30 years (e.g., 1990–2020)
ucr_30 <- subset(ucr_100k, year >= 1990)
# Regression: robbery clearance rate on year
lm_rob_year <- lm(robbery_clr_pct ~ year, data = ucr_30)
summary(lm_rob_year)
##
## Call:
## lm(formula = robbery_clr_pct ~ year, data = ucr_30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.67 -10.13 -1.50 8.70 366.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -560.22067 33.52415 -16.71 <2e-16 ***
## year 0.29500 0.01671 17.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.5 on 12493 degrees of freedom
## (320 observations deleted due to missingness)
## Multiple R-squared: 0.02434, Adjusted R-squared: 0.02426
## F-statistic: 311.7 on 1 and 12493 DF, p-value: < 2.2e-16
#The regression shows a significant positive relationship between year and robbery clearance rate (b = 0.295, p < 0.001). This indicates that clearance rates have increased by about 0.3 percentage points per year over the last 30 years. So perhaps, law enforcement has gotten slightly better at clearing robberies over time, though the low R² (0.024) suggests that year alone explains little of the variation.
#3
# Run multivariate regression
# Create robbery rate per 100k and run multivariate regression
ucr_30 <- ucr_100k %>%
mutate(robbery_rate = (actual_robbery_total / population_1) * 100000) %>%
filter(year >= 1990)
lm_rob_rate <- lm(robbery_rate ~ year + log(population_1 + 1), data = ucr_30)
summary(lm_rob_rate)
##
## Call:
## lm(formula = robbery_rate ~ year + log(population_1 + 1), data = ucr_30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -414.52 -118.35 -39.89 67.26 1973.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13758.5449 385.8827 35.66 <2e-16 ***
## year -7.3373 0.1917 -38.27 <2e-16 ***
## log(population_1 + 1) 93.4454 2.4677 37.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 192.3 on 12812 degrees of freedom
## Multiple R-squared: 0.1849, Adjusted R-squared: 0.1848
## F-statistic: 1453 on 2 and 12812 DF, p-value: < 2.2e-16
coeftest(lm_rob_rate, vcov = vcovHC(lm_rob_rate, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13758.54489 454.07250 30.300 < 2.2e-16 ***
## year -7.33734 0.22955 -31.964 < 2.2e-16 ***
## log(population_1 + 1) 93.44543 2.97514 31.409 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This regression examines how robbery rates (per 100,000 residents) vary with time and city size over the past 30 years. The coefficient on year is −7.34 (p < 0.001), indicating that robbery rates have decreased by about 7 incidents per 100,000 residents each year on average. The coefficient on log(population) is 93.45 (p < 0.001), showing that larger cities tend to have higher robbery rates per capita than smaller ones. The R² of 0.18 suggests that the model explains about 18% of the variation in robbery rates. So, robbery has become less common over time, but remains more frequent in larger urban areas.