Load and clean the data

require(dplyr, quietly = T, warn.conflicts = F)
require(ggplot2, quietly = T, warn.conflicts = F)

arrest_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/CUNY_bridge/master/Arrests.csv')
arrest_df <- arrest_df %>% select(-1)
## convert "Yes/No" to TRUE/FALSE
arrest_df$released <- arrest_df$released=="Yes"
arrest_df$employed <- arrest_df$employed=="Yes"
arrest_df$citizen <- arrest_df$citizen=="Yes"

Arrests by year

Have possession arrests gone down over the years?

arrest_df %>% ggplot(aes(year) ) + geom_histogram(bins = 30) + labs(title = "Arrests by Year", x="Year", y="Number of Arrests")

The overall trend is not a steady decline. But we do see a substantial drop between 2001 and 2002. Unfortunately the data is not more recent. I suspect the drops would continue.

Age

How is age related to arrests?

arrest_df$age %>% median; arrest_df$age %>% mean
## [1] 21
## [1] 23.84654

The median age of 21 is lower than I might have thought. The mean is of course higher than the median because the age distribution is skewed to the right (i.e., older), which can be reaffirmed by a simple boxplot:

arrest_df$age %>% boxplot(main="Boxplot of Arrest Age", horizontal=T, xlab="Age")

Gender

Were males more likely to be arrested than females?

arrest_df$sex %>% summary()
## Female   Male 
##    443   4783

That is a resounding “YES!”
Of course, this in and of itself does not prove causation (i.e., gender bias in arrests) because we don’t know if males are in fact more likely to use marijuana than females. But the fact that the ratio here is so high (over 10 to 1) relative to the ratio of men to women in the overall population (around 1 to 1) is definitely a bit… interesting.

Race

What percentage of those arrested were black?

arrest_df$colour %>% summary()
## Black White 
##  1288  3938
summary(arrest_df$colour)/length(arrest_df$colour)
##   Black   White 
## 0.24646 0.75354

While once again we don’t necessarily know the causation, we can note that the percentage of black folks arrested (~25%) is quite a bit higher than the population percentage (recent figures estimate 8 to 9 percent of Toronto is black.)

Here is the percentage by year:

arrest_df %>% group_by(year) %>% summarise(pct_black=mean(colour=="Black")) %>% ggplot(aes(x=year, y=pct_black)) + geom_line() + geom_point() + labs(title="Black Arrest Percentage by Year", x="Year", y="Percentage of Those Arrested Who Were Black")

Conclusions

This dataset has some interesting information. I wish it were more recent and more specific (e.g., the current binary labels of “Black/White” for race are way too general and could be biasing the proportions.) Perhaps the most interesting statistic is the high ratio of male arrests to female arrests.