Introduction

Crime rates across the United States vary greatly by state, city, demographic and community type. In metropolitan areas, violent crime and property crime rates area higher than elsewhere. Rural areas experience lower rates of both property crime and violent crime. In cities outside of metropolitan areas, violent crime is lower than the national average while property crime rates are higher than it. The 50 states have populations and demographics that vary greatly, as does the distribution of those populations and demographics in a given state’s cities and rural areas. This in turn can be responsible for the large variation in crime rates across the states.

This analysis concerns the prevalence of crime accross the United States. Specifically, this analysis is concerned with determining the relationships, if any, between the rates of different crimes per city and per state in the US in the year 2013.

The Data

The data used for this analysis was taken from the FBI website, at www.fbi.gov/table-8/2013. The FBI data contains all crime known to law enforcement by city, by state, in the year 2013. The FBI data is provided in tables with the following variables as columns, one table per state:

Varable Name Type Description
City character name of city
Population continuous number of inhabitants
Violent crime continuous counts of violent crime
Murder and nonnegligent manslaughter continuous counts of murder and nonnegligent manslaughter
Rape (revised definition)1 continuous counts of rape under the revised definition
Rape (legacy definition)2 continuous counts of rape under the legacy definition
Robbery continuous counts of robbery
Aggravated assault continuous counts of aggravated assault
Property crime continuous counts of property crime
Burglary continuous counts of burglary
Larceny-theft continuous counts of larceny
Motor vehicle theft continuous counts of motor-vehicle theft
Arson3 continuous counts of arson

Data Preparation

Each of the 50 tables from the FBI websites was on its own page, and so to get all 50 tables required scraping each one off its respective page. The tables were scraped in a single loop using rvest and stringr for url formatting. To achieve this, the following code was used:

table.list <- list()
for (i in 1:length(state.names)) {
  curr.state.url <- str_replace(url.template, "STATENAME", state.names[i])
  curr.page <- read_html(curr.state.url)
  curr.nodes <- html_nodes(curr.page, "table")
  curr.table <- html_table(curr.nodes[[3]])
  curr.table$State <- state.names[i]
  table.list[[i]] <- curr.table
}
cn <- c(COL_NAMES)
for (i in 1:50) {
  colnames(table.list[[i]]) <- cn
}
fbi <- do.call(rbind, table.list)

The table, fbi, was full of innapropriate data types and NA values. To remove the NA values and format and convert the necessary columns to numerics, the following code was used (requiring the stringr package):

numerify <- function(vec) {
  vec[is.na(vec)] <- 0
  vec <- str_trim(vec)
  vec <- str_replace_all(vec, ",", "")
  vec <- as.numeric(vec)
  vec
}
for (i in 2:13) {
  fbi[,i] <- numerify(fbi[,i])
}

The fbi table had two columns for rape: one for counts of rape under a legacy definition and one for rape under a revised definition. Most cities used exclusively one or the other definition of rape, so the two columns were combined into a single one called ‘Rape’ using the following code:

fbi$Rape <- fbi$Rape_legacydef + fbi$Rape_reviseddef
fbi <- fbi[,-which(colnames(fbi) %in% c("Rape_legacydef", "Rape_reviseddef"))]

Anticipating a trivial and removable relationship between city population and the counts of the different offences in that city, the following code produced a second table, fbi.pc, that contained only the per capita rates of each type of crime in each city.

for (i in 4:13) {
  fbi.pc[,i] <- fbi.pc[,i]/fbi.pc[,3]
}

Two more tables were produced: ‘states.data’, summarizing the ‘fbi’ table to one state per row, and ‘states.data.pc’ which was the per capita analogue of ‘states.data’. The code below accomplished this:

colnames(states.data) <- colnames(fbi)[-(1:2)]
rownames(states.data) <- levels(fbi$State)
j <- 1
for (currstate in levels(fbi$State)) {
  currstate.data <- fbi[fbi$State == currstate,3:13]
  currstate.row <- c()
  for (i in 1:11) {
    currstate.row <- c(currstate.row, sum(currstate.data[,i]))
  }
  states.data[j,] <- currstate.row
  j <- j + 1
}
states.data.pc <- states.data
for(i in 2:ncol(states.data.pc)) {
  states.data.pc[,i] <- states.data.pc[,i]/states.data.pc[,1]
}

These two tables were created in anticipation of exploratory analysis that would reveal stronger relationships between the variables at a state-wide scale or possible geographic relationships that could be exposed when represented on a map of the 50 states.

Exploratory Analysis

The first step in exploring the data was testing for the ‘normality’ of the distribution of various crime counts. Histograms revealed that in the fbi table, the distribution of crime counts was definitely not normal. However, this may have been due to the non-normal distribution of populations per city. Histograms of the per capita crime rates in the fbi.pc table showed that per capita crime rates were not noramlly distributed either. For example:

range(fbi.pc$Violent_crime)
## [1] 0.0000000 0.2286996
hist(fbi.pc$Violent_crime, breaks = 300, xlim = c(0, 0.05))

The non-normalcy may have been caused by the large number of crime-free, tiny cities in the data. Regardless, the next step was to determine the level of correlation between the various numeric data. Correlation matrices were used to this end, on both the original fbi table and its per-capita analague, fbi.pc.

correlations <- matrix(nrow = ncol(fbi) - 2, ncol = ncol(fbi) - 2)
colnames(correlations) <- strtrim(colnames(fbi)[-(1:2)], width = 5)
rownames(correlations) <- colnames(correlations)
for(i in 1:nrow(correlations)) {
  for(j in 1:ncol(correlations)) {
    correlations[i,j] <- cor(fbi[,i+2], fbi[,j+2])
    correlations[i,j] <- ifelse(correlations[i,j]  > 0.85, correlations[i,j], 0)
  }            # show only correlations greater than 0.85
}
##           Popul     Viole     Murde      Rape     Robbe     Aggra
## Popul 1.0000000 0.9125152 0.0000000 0.0000000 0.9133497 0.9116237
## Viole 0.9125152 1.0000000 0.0000000 0.0000000 0.9107158 0.9912989
## Murde 0.0000000 0.0000000 1.0000000 0.0000000 0.9000605 0.0000000
## Rape  0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
## Robbe 0.9133497 0.9107158 0.9000605 0.0000000 1.0000000 0.8810148
## Aggra 0.9116237 0.9912989 0.0000000 0.0000000 0.8810148 1.0000000
## Prope 0.8900635 0.8712587 0.8535042 0.8541784 0.9015219 0.8530655
## Burgl 0.0000000 0.0000000 0.0000000 0.8620383 0.0000000 0.0000000
## Larce 0.9157740 0.8809077 0.0000000 0.0000000 0.9021707 0.8696344
## Vehic 0.0000000 0.0000000 0.8632082 0.0000000 0.0000000 0.0000000
## Arson 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##           Prope     Burgl     Larce     Vehic Arson
## Popul 0.8900635 0.0000000 0.9157740 0.0000000     0
## Viole 0.8712587 0.0000000 0.8809077 0.0000000     0
## Murde 0.8535042 0.0000000 0.0000000 0.8632082     0
## Rape  0.8541784 0.8620383 0.0000000 0.0000000     0
## Robbe 0.9015219 0.0000000 0.9021707 0.0000000     0
## Aggra 0.8530655 0.0000000 0.8696344 0.0000000     0
## Prope 1.0000000 0.9539204 0.9908607 0.8977033     0
## Burgl 0.9539204 1.0000000 0.9116071 0.9104283     0
## Larce 0.9908607 0.9116071 1.0000000 0.0000000     0
## Vehic 0.8977033 0.9104283 0.0000000 1.0000000     0
## Arson 0.0000000 0.0000000 0.0000000 0.0000000     1
correlations.pc <- matrix(nrow = ncol(fbi.pc) - 2, ncol = ncol(fbi.pc) - 2)
colnames(correlations.pc) <- strtrim(colnames(fbi.pc)[-(1:2)], width = 5)
rownames(correlations.pc) <- colnames(correlations.pc)
for(i in 1:nrow(correlations.pc)) {
  for(j in 1:ncol(correlations.pc)) {
    correlations.pc[i,j] <- cor(fbi.pc[,i+2], fbi.pc[,j+2])
    correlations.pc[i,j] <- ifelse(correlations.pc[i,j]  > 0.6, correlations.pc[i,j], 0)
  }                  # show only correlations greater than 0.6
}
##       Popul     Viole Murde Rape     Robbe     Aggra     Prope     Burgl
## Popul     1 0.0000000     0    0 0.0000000 0.0000000 0.0000000 0.0000000
## Viole     0 1.0000000     0    0 0.8064598 0.9430857 0.7579023 0.7648670
## Murde     0 0.0000000     1    0 0.0000000 0.0000000 0.0000000 0.0000000
## Rape      0 0.0000000     0    1 0.0000000 0.0000000 0.0000000 0.0000000
## Robbe     0 0.8064598     0    0 1.0000000 0.0000000 0.8074161 0.7266518
## Aggra     0 0.9430857     0    0 0.0000000 1.0000000 0.0000000 0.6710672
## Prope     0 0.7579023     0    0 0.8074161 0.0000000 1.0000000 0.8267095
## Burgl     0 0.7648670     0    0 0.7266518 0.6710672 0.8267095 1.0000000
## Larce     0 0.7172965     0    0 0.7456065 0.0000000 0.9857455 0.7610126
## Vehic     0 0.7042295     0    0 0.8545309 0.0000000 0.9030646 0.7468966
## Arson     0 0.0000000     0    0 0.0000000 0.0000000 0.0000000 0.0000000
##           Larce     Vehic Arson
## Popul 0.0000000 0.0000000     0
## Viole 0.7172965 0.7042295     0
## Murde 0.0000000 0.0000000     0
## Rape  0.0000000 0.0000000     0
## Robbe 0.7456065 0.8545309     0
## Aggra 0.0000000 0.0000000     0
## Prope 0.9857455 0.9030646     0
## Burgl 0.7610126 0.7468966     0
## Larce 1.0000000 0.8308301     0
## Vehic 0.8308301 1.0000000     0
## Arson 0.0000000 0.0000000     1

We see that population is (trivially) strongly correlated the annual counts of numerous types of crime. Yet, the second matrix shows that population is never correlated stronger than 0.6 with the per-capita rates of any type of crime recorded in the fbi data. This rejects the hypothesis that population density causes an increase in crime rates. Both matrices show that the prevalence of many types of crime are correlated strongly with one another, though the second matrix has significantly lower values than the first.

For good measure, lists of the correlation between population and all other variables was computed for both tables.

sapply(4:13, function(n) {
  cr <- cor(fbi$Population, fbi[,n])
  paste(colnames(fbi)[n], ": ", cr)
})
##  [1] "Violent_crime :  0.912515194934291"                  
##  [2] "Murder_and_non-neg_manslaughter :  0.770750968663591"
##  [3] "Rape :  0.748843905530843"                           
##  [4] "Robbery :  0.913349657807348"                        
##  [5] "Aggravated_assualt :  0.911623695209433"             
##  [6] "Property_crime :  0.890063492158245"                 
##  [7] "Burglary :  0.771194800760887"                       
##  [8] "Larceny_theft :  0.915774015391173"                  
##  [9] "Vehicle_theft :  0.739215722833229"                  
## [10] "Arson :  0.582871007162747"
sapply(4:13, function(n) {
  cr <- cor(fbi.pc$Population, fbi.pc[,n])
  paste(colnames(fbi.pc)[n], ": ", cr)
})
##  [1] "Violent_crime :  0.0627790030458067"                  
##  [2] "Murder_and_non-neg_manslaughter :  0.0342151354450646"
##  [3] "Rape :  0.0129530821526244"                           
##  [4] "Robbery :  0.0993535919376426"                        
##  [5] "Aggravated_assualt :  0.0388286803307819"             
##  [6] "Property_crime :  0.0148087116243106"                 
##  [7] "Burglary :  0.0286886254138961"                       
##  [8] "Larceny_theft :  0.011479505747391"                   
##  [9] "Vehicle_theft :  0.0147462796972066"                  
## [10] "Arson :  0.0325553205066627"

As expected, in the first table, population is very strongly correlated with every single kind of crime besides arson. In accordance with the trend set by the previous two matrices, the second table of correlations shows that population is not a good predictor of per-capita crime rates in a given city.

The last step of the data exploration was determining if the relationships being investigated were more or less pronounced when the data was aggregated by state. The following plots show, in their similarity or disimilarity, that there is a strong correlation between population and number of violent crimes in a state but a weak correlation between population and violent crimes per capita in a state.

Modelling and Modelling Evaluation and Results

A final measure of the relationship between the prevalences of various types of crime and population is the accuracy of predictive models which attempt to determine one value from the others. Given that all the relevant variables in the data (all four tables) are numeric, simple multivarible linear regression models were the obvious first choice. One linear model was used to predict violent crime in all four tables, using population and all non-violent crime columns as predictor variables. The results are shown below.

## 
## Call:
## lm(formula = Violent_crime ~ . - Rape - Robbery - Aggravated_assualt - 
##     `Murder_and_non-neg_manslaughter` - City - State, data = fbi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17678.0     -9.3     27.4     40.6   8782.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.733e+01  3.132e+00 -15.113  < 2e-16 ***
## Population      5.130e-03  7.112e-05  72.137  < 2e-16 ***
## Property_crime  1.198e-01  4.743e-02   2.527  0.01153 *  
## Burglary        3.020e-01  4.573e-02   6.603 4.24e-11 ***
## Larceny_theft  -1.556e-01  4.911e-02  -3.169  0.00154 ** 
## Vehicle_theft  -1.115e-01  5.266e-02  -2.117  0.03429 *  
## Arson          -3.707e+00  2.504e-01 -14.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294 on 9273 degrees of freedom
## Multiple R-squared:  0.8589, Adjusted R-squared:  0.8588 
## F-statistic:  9404 on 6 and 9273 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Violent_crime ~ . - Rape - Robbery - Aggravated_assualt - 
##     `Murder_and_non-neg_manslaughter`, data = states.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19650.5  -1474.7   -414.5    967.8  14641.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.114e+02  1.091e+03  -0.102   0.9191    
## Population      2.857e-03  6.414e-04   4.455 5.89e-05 ***
## Property_crime  5.561e-01  1.064e+00   0.522   0.6040    
## Burglary       -2.310e-01  9.979e-01  -0.231   0.8181    
## Larceny_theft  -5.376e-01  1.091e+00  -0.493   0.6247    
## Vehicle_theft  -4.827e-01  1.190e+00  -0.406   0.6869    
## Arson          -7.878e+00  4.234e+00  -1.861   0.0696 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5522 on 43 degrees of freedom
## Multiple R-squared:  0.949,  Adjusted R-squared:  0.9419 
## F-statistic: 133.3 on 6 and 43 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Violent_crime ~ . - Rape - Robbery - Aggravated_assualt - 
##     `Murder_and_non-neg_manslaughter` - City - State, data = fbi.pc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042965 -0.001167 -0.000564  0.000543  0.069127 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.634e-04  4.107e-05  13.717  < 2e-16 ***
## Population      1.655e-09  2.425e-10   6.823 9.50e-12 ***
## Property_crime -1.059e-01  8.526e-02  -1.242   0.2143    
## Burglary        3.797e-01  8.539e-02   4.447 8.81e-06 ***
## Larceny_theft   1.281e-01  8.529e-02   1.502   0.1331    
## Vehicle_theft   1.650e-01  8.533e-02   1.934   0.0532 .  
## Arson           1.291e+00  1.133e-01  11.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002779 on 9273 degrees of freedom
## Multiple R-squared:  0.6439, Adjusted R-squared:  0.6437 
## F-statistic:  2795 on 6 and 9273 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Violent_crime ~ . - Rape - Robbery - Aggravated_assualt - 
##     `Murder_and_non-neg_manslaughter`, data = states.data.pc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0040596 -0.0008516 -0.0001300  0.0005987  0.0037609 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -1.139e-04  1.137e-03  -0.100    0.921
## Population     -2.969e-12  4.857e-11  -0.061    0.952
## Property_crime  6.949e-01  1.379e+00   0.504    0.617
## Burglary       -6.003e-01  1.368e+00  -0.439    0.663
## Larceny_theft  -5.781e-01  1.386e+00  -0.417    0.679
## Vehicle_theft   1.032e-01  1.418e+00   0.073    0.942
## Arson          -5.409e+00  3.557e+00  -1.521    0.136
## 
## Residual standard error: 0.001632 on 43 degrees of freedom
## Multiple R-squared:  0.5018, Adjusted R-squared:  0.4323 
## F-statistic: 7.218 on 6 and 43 DF,  p-value: 2.271e-05

The results of these models confirm that while there is a relationship between population and crime per capita, the relationship is weak and population alone is not sufficient to predict crime rate per capita. The models also show that though the per capita rates of many types of crime can be used to predict the per capita rates of other crime on the city-wide level (model 3 R-squared 0.64) to some extent, they are low-accuracy predictors for the data aggregated by state.