Read a CSV file from my GitHub.
library("rio")
ArrestMarij <- import("https://raw.githubusercontent.com/SieSiongWong/CUNY-SPS-R-Workshop/master/Arrests.csv")
There are three goals I would like to achieve in this analysis. First, I want to figure out which race was arrested significantly more than the other for simple possession of small quantities of marijuana in Toronto between year 1997 and 2002. Second, I want to see whether police released on summons based on previous records of arrest or based on other factors. Third, whether there is a particular group of individuals were arrested more than the other. The factors to consider in this analysis would be such as their race, age, gender, citizen status, and employment status.
First of all, lets take a look at the structure of the Arrests for Marijuana Possession dataset. There are total of 5,226 observations and have 8 actual variables (not including the number list) in this dataset. In this data frame, it has 3 numeric type columns and 5 character type columns (not including the number list).
str(ArrestMarij)
## 'data.frame': 5226 obs. of 9 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ released: chr "Yes" "No" "Yes" "No" ...
## $ colour : chr "White" "Black" "White" "Black" ...
## $ year : int 2002 1999 2000 2000 1999 1998 1999 1998 2000 2001 ...
## $ age : int 21 17 24 46 27 16 40 34 23 30 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ employed: chr "Yes" "Yes" "Yes" "Yes" ...
## $ citizen : chr "Yes" "Yes" "Yes" "Yes" ...
## $ checks : int 3 3 3 1 1 0 0 1 4 3 ...
Now, lets do some data exploration and visualization by plotting some charts. By looking at the first two figures and the table, you will notify that age group between 15 to 24 has the largest number of arrests of marijuana possession. The third figure also shows that the age group between 15 to 24 which number of arrests, because of other crimes, was significantly larger than other age group.
## Histogram of age:- Figure 1
hist(ArrestMarij$age, main="Total Marijuana Possession Arrests at Each Age", xlab="Age", ylab="Total Arrests", ylim=c(0,2500), col=rainbow(30))
library(dplyr)
library(ggplot2)

## Create age group.
labs <- c(paste(seq(0, 95, by=5), seq(0+5-1, 100-1, by=5), sep="-"), paste(100, "+", sep=""))
ArrestMarij$AgeGroup <- cut(ArrestMarij$age, breaks = c(seq(0, 100, by = 5), Inf), labels = labs, right = FALSE)
## Group by age group and then count the arrests.
AgeGroup_tbl <- ArrestMarij %>% group_by(AgeGroup) %>% summarize(
TotalMarijuanaArrests=length(colour), TotalPreviousArrests=sum(checks), TotalArrests=TotalMarijuanaArrests+TotalPreviousArrests)
## Display the top 10 Age Group for Marijuana Possession Arrests.
head(arrange(AgeGroup_tbl, desc(TotalMarijuanaArrests)), n=10)
## # A tibble: 10 x 4
## AgeGroup TotalMarijuanaArrests TotalPreviousArrests TotalArrests
## <fct> <int> <int> <int>
## 1 15-19 1901 2713 4614
## 2 20-24 1526 2542 4068
## 3 25-29 615 1060 1675
## 4 30-34 411 798 1209
## 5 35-39 320 679 999
## 6 40-44 192 385 577
## 7 10-14 107 95 202
## 8 45-49 96 164 260
## 9 50-54 38 78 116
## 10 55-59 10 22 32
## GGPlot Histogram for Age Group vs. TotalMarijuanaArrests:- Figure 2
ggplot(AgeGroup_tbl, aes(x=AgeGroup, y=TotalMarijuanaArrests)) + geom_histogram(stat="identity", fill="hotpink") + xlab("Age Group") + ylab("Total Arrests") + ggtitle("Age Group vs. Total Marijuana Possession Arrests") + theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),
axis.title.y = element_text(size=11, face="bold"), axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) + ylim(0,2000)

## GGPlot Histogram for Age Group vs. Previous Arrests:- Figure 3
ggplot(AgeGroup_tbl, aes(x=AgeGroup, y=TotalPreviousArrests)) + geom_histogram(stat="identity", fill="hotpink") + xlab("Age Group") + ylab("Total Arrests") + ggtitle("Age Group vs. Previous Arrests") + theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),
axis.title.y = element_text(size=11, face="bold"), axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) + ylim(0,3000)

You can see that the trend and pattern for both marijuana possession arrests and previous arrest records are very identical. The age between 15 to 24 has the largest amount of arrests and decreasing as getting older.
## Scatter Plot of marijuana arrests vs. previous arrests.
library(dplyr)
library(ggplot2)
AgeMarijArrests_tbl <- ArrestMarij %>% group_by(age) %>% summarize(MarijArrests=length(colour))
AgePreviousArrests_tbl <- ArrestMarij %>% group_by(age) %>% summarize(PreviousArrests=sum(checks))
ggplot(NULL) + geom_point(data=AgeMarijArrests_tbl, aes(x=age, y=MarijArrests, color="Marijuana Arrests")) + ggtitle("Marijuana Possession Arrests vs. Previous Arrests") + xlab("Age") + ylab("Total Arrests") + geom_line(data=AgePreviousArrests_tbl, aes(x=age, y=PreviousArrests, color="Previous Arrests")) + xlab("Age") + ylab("Total Arrests") + theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),
axis.title.y = element_text(size=11, face="bold"), axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) + scale_fill_manual(values=c("Marijuana Arrests"="red", "Previous Arrests"="darkblue")) + theme(legend.title=element_blank())

From the table 1 and figure 1 below, you can see the total number of white arrests was triple the black. This shows us that white individuals were much more likely to have been arrested for marijuana possession. This also means that out of 10 marijuana possession arrests, about 7.5 arrests would be white individuals. Also, from the table 2 and figure 2, it shows the white total arrest was increase every year from 1997 to 2000 but starting to drop a bit in 2001 and drop sharply in 2002. I believe the year 2002 data is not complete that is why the total arrests was significantly less than previous year.
## Bar plot of white vs. black:- Table 1 & Figure 1
WhiteCount <- length(which(ArrestMarij$colour=="White"))
BlackCount <- length(which(ArrestMarij$colour=="Black"))
BlackWhiteTable <- matrix(c(WhiteCount,BlackCount),ncol=2,nrow=1,byrow=TRUE)
dimnames(BlackWhiteTable) = list(c("Total of Arrests"), c("White", "Black"))
barplot(BlackWhiteTable, beside=T,las=0.5, cex.names=1.2,cex.axis=1.0, ylab="Total of Arrests", axes=FALSE, ylim=c(0,4500), cex.lab=1.1, main="White vs. Black", col=c("blue","red"))
axis(2, at = seq(0, 4500, 500), las = 0.9)

BlackWhiteTable
## White Black
## Total of Arrests 3938 1288
## Scatter Plot of year vs. black & white:- Table 2 & Figure 2
YearWhiteBlackCount_tbl <- ArrestMarij %>% group_by(year) %>% summarize(WhiteArrests=length(which(colour=="White")), BlackArrests=length(which(colour=="Black")))
names(YearWhiteBlackCount_tbl) <- sub("^year$", "Year", names(YearWhiteBlackCount_tbl))
YearWhiteBlackCount_tbl
## # A tibble: 6 x 3
## Year WhiteArrests BlackArrests
## <int> <int> <int>
## 1 1997 369 123
## 2 1998 637 240
## 3 1999 849 250
## 4 2000 991 279
## 5 2001 900 311
## 6 2002 192 85
YearWhiteBlackCount_tbl <- transform(YearWhiteBlackCount_tbl, Year = as.Date(as.character(Year), "%Y"))
plot(YearWhiteBlackCount_tbl$WhiteArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"),type="b", col="red", lty=2, main="Year Vs White & Black Arrests", xlab="Year", ylab="Arrests", ylim=c(0,1200), pch=19)
lines(YearWhiteBlackCount_tbl$BlackArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"), type="b", col="green", pch=18)
legend("topright", lty=c(2,1), legend=c("White", "Black"), col=c("red", "green"), cex=0.8, text.font=4)

It is good to know whether the released on summons has any correlation with other factors. From the conditional distribution table and chart, it shows that black individuals was at 26% released rate as compared to the white at 14%. And the white was 10% more likely not be able to be released on summons as compared to the black.
library(dplyr)
## Group by released and then count the white and black arrests.
ReleasedSummonsRace_tbl <- ArrestMarij %>% group_by(released) %>% summarize(White=length(which(colour=="White")), Black=length(which(colour=="Black")))
## Create into a table.
ReleasedSummonsRace_tbl2 <- matrix(c(ReleasedSummonsRace_tbl$White, ReleasedSummonsRace_tbl$Black), ncol=2)
colnames(ReleasedSummonsRace_tbl2) <- c("White", "Black")
rownames(ReleasedSummonsRace_tbl2) <- c("Released", "Custody")
ReleasedSummonsRace_tbl2 <- as.table(ReleasedSummonsRace_tbl2)
## Conditional Distribution.
round(prop.table(ReleasedSummonsRace_tbl2,2)*100, digits=2)
## White Black
## Released 14.20 25.85
## Custody 85.80 74.15
barplot(prop.table(ReleasedSummonsRace_tbl2,2)*100, beside=T, ylab="%", ylim=c(0,100), main=" Released or Custody by Race", col=c("mediumspringgreen", "lightslateblue"))
legend("center", legend = c("Released", "Custody"), fill = c("mediumspringgreen", "lightslateblue"),cex=0.85)

Conclusion: Based on all the analysis done above, it can be concluded that the white individuals had significantly larger arrests of marijuana possession as compared to the black individuals. Also, the released on summons favor those individuals who had zero or less arrest records, were employed, and were citizen of Canada. Lastly, individuals who were white, male, employed, citizen, and age between 15-24 were arrested more than other because this particular group of individuals has much higher chance of marijuana possession on them.