Inferential Statistics II

This homework gives students the opportunity to practice the second set of inferential statistics lessons. At least as much as the last homework, we want to be thinking about the INTERPRETATION of the output, so most of the work here is accurately describing what you found, and what it means. We are going to use a new dataset, covering the Uniform Crime Reporting Program from 1960 through 2020. You should find the dataset, alongside a manual, in the same module as this assignment on Canvas. In this dataset, a row can be considered an agency-year combination - each row reflects the crimes. arrests, and clearances reported to a particular field office in a particular year.

setwd("C:/Users/emily/OneDrive/바탕 화면/FELS/Use_Misuse of Data")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ucr <- readRDS("ucr.rds")

Question 1 - Practice Cleaning Data

The codebook for the dataset is available in the same folder as this homework, and is very useful to look through when completing this assignment. However, we are going to start simply. I want you to focus on areas where the population is over 100,000, using the column “population_1.” Create a second dataset that subsets to only those observations, and keep only the population column and the coloumns “agency_name”, “state_abb”, “year”, “actual_robbery_total”, “actual_all_crimes”, “tot_clr_robbery_total”, and “tot_clr_all_crimes”.

ucr_100k <- ucr %>%
  filter(population_1 > 100000) %>%
  select(
    population_1,
    agency_name,
    state_abb,
    year,
    actual_robbery_total,
    actual_all_crimes,
    tot_clr_robbery_total,
    tot_clr_all_crimes
  )
head(ucr_100k)
##   population_1 agency_name state_abb year actual_robbery_total
## 1       286388   anchorage        AK 2020                  558
## 2       287731   anchorage        AK 2019                  621
## 3       291992   anchorage        AK 2018                  717
## 4       296188   anchorage        AK 2017                  778
## 5       299097   anchorage        AK 2016                  695
## 6       301239   anchorage        AK 2015                  621
##   actual_all_crimes tot_clr_robbery_total tot_clr_all_crimes
## 1             16856                   207               5001
## 2             19993                   196               5872
## 3             22398                   276               6417
## 4             23750                   216               5997
## 5             22336                   205               5679
## 6             19474                   213               5686

Question 2 - Attribute Creation and Descriptive Statistics

When data comes to you, it is frequently not in the final version you would prefer. You may have to create new variables that are more similar to outcomes that you want to use in your analysis. Here, raw numbers of crimes and clearances may not be very helpful, particularly since units have such different populations (even after pulling only the largest geographies). So create two new variables: the percent of robberies “cleared” (solved), and the percent of all crimes cleared. What is the average clearance rate for robberies across all units? For all crimes? Which state has the highest average clearance rate for all crimes over the years (here, consider each unit within a state equally, rather than weighting by population)? Finally, which years saw the highest average clearance rate for robberies? Is there a trend you can spot there?

# Create percent clearance variables
ucr_100k <- ucr_100k %>%
  mutate(
    robbery_clr_pct = (tot_clr_robbery_total / actual_robbery_total) * 100,
    all_clr_pct = (tot_clr_all_crimes / actual_all_crimes) * 100)
# Averages across all observations
mean_robbery_clr <- mean(ucr_100k$robbery_clr_pct, na.rm = TRUE)
mean_allclr <- mean(ucr_100k$all_clr_pct, na.rm = TRUE)

mean_robbery_clr #30.4
## [1] 30.43298
mean_allclr #24.9
## [1] 24.9432
state_highest_allclr <- ucr_100k %>%
  group_by(state_abb) %>%
  summarize(avg_allclr = mean(all_clr_pct, na.rm = TRUE)) %>%
  arrange(desc(avg_allclr))

state_highest_allclr #DE has the highest average all-crime clearance rate
## # A tibble: 50 × 2
##    state_abb avg_allclr
##    <chr>          <dbl>
##  1 DE              45.6
##  2 SD              39.7
##  3 WV              34.4
##  4 ID              33.8
##  5 ND              32.8
##  6 AK              31.3
##  7 NC              30.9
##  8 NE              30.4
##  9 MD              30.4
## 10 FL              29.0
## # ℹ 40 more rows
year_robbery_clr <- ucr_100k %>%
  group_by(year) %>%
  summarize(avg_robbery_clr = mean(robbery_clr_pct, na.rm = TRUE)) %>%
  arrange(desc(avg_robbery_clr))

year_robbery_clr #Top three years with highest robbery clearance rates are 1965, 1964, and 2018.
## # A tibble: 61 × 2
##     year avg_robbery_clr
##    <dbl>           <dbl>
##  1  1965            38.9
##  2  1964            36.9
##  3  2018            36.2
##  4  1966            35.9
##  5  2013            35.9
##  6  2019            35.7
##  7  2020            35.5
##  8  2017            35.3
##  9  2015            35.1
## 10  2014            34.9
## # ℹ 51 more rows
#Comment on trends: Robbery clearance rates peak in the mid-1960s, decline over the following decades, and then rise again in the mid to late 2010s. However, the high clearance rates in the 1960s may be due to limited agency participation and less standardized reporting early in the UCR program, which could've inflated clearance percentages. Similarly, more recent increases could result from improved reporting systems or fewer robbery incidents being reported, rather than a true increase in the number of cases solved.

Question 3 - Regressions!!!!

Now on to some new questions. Let’s ask three basic questions that could be answered via regression:

1.) We might think that there are returns to size (efficiencies discovered in scale), so that bigger departments might be better at fighting crime. Using a simple bivariate regression, test whether the size of a city (in population) is correlated with a higher clearance rate for crimes.

2.) Consider only the last 30 years. Has law enforcement gotten better during this time at clearing robberies? (Note that these are reported crimes, so technically we cannot fully answer this question, but use the data as if it were complete)

3.) Has the rate of robbery increased over the last 30 years? Is it better or worse in smaller cities? Use a multivariate regression explaining the per capita rate of robbery over the past 30 years with both a time trend and a city’s population.

For each of the three questions above, run the necessary regression, and explain each of the estimates, including the intercept where appropriate. Answer the actual underlying substantive question to the best of your ability.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(lmtest)    
## Warning: package 'lmtest' was built under R version 4.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(sandwich)   #for SEs
## Warning: package 'sandwich' was built under R version 4.4.2
#1.Simple linear model (population in raw persons)
lm_pop_raw <- lm(all_clr_pct ~ population_1, data = ucr_100k)
summary(lm_pop_raw)
## 
## Call:
## lm(formula = all_clr_pct ~ population_1, data = ucr_100k)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.343  -6.956  -0.208   7.027  74.692 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.554e+01  1.003e-01  254.58   <2e-16 ***
## population_1 -1.969e-06  1.624e-07  -12.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.33 on 19912 degrees of freedom
##   (627 observations deleted due to missingness)
## Multiple R-squared:  0.007328,   Adjusted R-squared:  0.007278 
## F-statistic:   147 on 1 and 19912 DF,  p-value: < 2.2e-16
#(heteroskedasticity-consistent) SEs
coeftest(lm_pop_raw, vcov = vcovHC(lm_pop_raw, type = "HC1"))
## 
## t test of coefficients:
## 
##                 Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)   2.5540e+01  1.0195e-01 250.516 < 2.2e-16 ***
## population_1 -1.9692e-06  1.6312e-07 -12.072 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This regression examined whether larger city populations are associated with higher clearance rates for all crimes. The model shows a statistically significant but negative relationship between population and clearance rate (b = −1.97e−06, p < 0.001). This means that for every additional 100,000 residents, the predicted clearance rate decreases by roughly 0.2 percentage points. The intercept of 25.5 represents the expected clearance rate when population is near zero (baseline). Although the association is highly significant, the model’s R² of 0.007 indicates that population size alone explains less than 1% of the variation in clearance rates. So, larger jurisdictions tend to have slightly lower clearance rates, but the effect is small.

#2. 
# Filter to last 30 years (e.g., 1990–2020)
ucr_30 <- subset(ucr_100k, year >= 1990)

# Regression: robbery clearance rate on year
lm_rob_year <- lm(robbery_clr_pct ~ year, data = ucr_30)
summary(lm_rob_year)
## 
## Call:
## lm(formula = robbery_clr_pct ~ year, data = ucr_30)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.67 -10.13  -1.50   8.70 366.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -560.22067   33.52415  -16.71   <2e-16 ***
## year           0.29500    0.01671   17.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.5 on 12493 degrees of freedom
##   (320 observations deleted due to missingness)
## Multiple R-squared:  0.02434,    Adjusted R-squared:  0.02426 
## F-statistic: 311.7 on 1 and 12493 DF,  p-value: < 2.2e-16
#The regression shows a significant positive relationship between year and robbery clearance rate (b = 0.295, p < 0.001). This indicates that clearance rates have increased by about 0.3 percentage points per year over the last 30 years. So perhaps, law enforcement has gotten slightly better at clearing robberies over time, though the low R² (0.024) suggests that year alone explains little of the variation.

#3
# Run multivariate regression
# Create robbery rate per 100k and run multivariate regression
ucr_30 <- ucr_100k %>%
  mutate(robbery_rate = (actual_robbery_total / population_1) * 100000) %>%
  filter(year >= 1990)

lm_rob_rate <- lm(robbery_rate ~ year + log(population_1 + 1), data = ucr_30)
summary(lm_rob_rate)
## 
## Call:
## lm(formula = robbery_rate ~ year + log(population_1 + 1), data = ucr_30)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414.52 -118.35  -39.89   67.26 1973.80 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           13758.5449   385.8827   35.66   <2e-16 ***
## year                     -7.3373     0.1917  -38.27   <2e-16 ***
## log(population_1 + 1)    93.4454     2.4677   37.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 192.3 on 12812 degrees of freedom
## Multiple R-squared:  0.1849, Adjusted R-squared:  0.1848 
## F-statistic:  1453 on 2 and 12812 DF,  p-value: < 2.2e-16
coeftest(lm_rob_rate, vcov = vcovHC(lm_rob_rate, type = "HC1"))
## 
## t test of coefficients:
## 
##                          Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)           13758.54489   454.07250  30.300 < 2.2e-16 ***
## year                     -7.33734     0.22955 -31.964 < 2.2e-16 ***
## log(population_1 + 1)    93.44543     2.97514  31.409 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This regression examines how robbery rates (per 100,000 residents) vary with time and city size over the past 30 years. The coefficient on year is −7.34 (p < 0.001), indicating that robbery rates have decreased by about 7 incidents per 100,000 residents each year on average. The coefficient on log(population) is 93.45 (p < 0.001), showing that larger cities tend to have higher robbery rates per capita than smaller ones. The R² of 0.18 suggests that the model explains about 18% of the variation in robbery rates. So, robbery has become less common over time, but remains more frequent in larger urban areas.