Selected (CSV) dataset: “Arrests” from carData library.
We selected Arrests dataset from carData library in R.This data is on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The data are part of a larger data set featured in a series of articles in the Toronto Star newspaper.
Data <- carData::Arrests
str(Data)
## 'data.frame': 5226 obs. of 8 variables:
## $ released: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
## $ colour : Factor w/ 2 levels "Black","White": 2 1 2 1 1 1 2 2 1 2 ...
## $ year : int 2002 1999 2000 2000 1999 1998 1999 1998 2000 2001 ...
## $ age : int 21 17 24 46 27 16 40 34 23 30 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 1 2 2 ...
## $ employed: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
## $ citizen : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ checks : int 3 3 3 1 1 0 0 1 4 3 ...
##Questions for analysis The goal of doing analyis on this Marijuana Arrests dataset is to find out the answers for the following questions. (1) Figure out which race was arrested significantly more than the other for simple possession of small quantities of marijuana in Toronto between year 1997 and 2002. (2) Whether police released on summons based on previous records of arrest or based on other factors. (3) And whether there is a particular group of individuals were arrested more than the other. The factors to consider in this analysis would be such as their race, age, gender, citizen status, and employment status.
##Data Exploration It is a data frame with 5226 observations on the following 8 variables.
released - Whether or not the arrestee was released with a summons; a factor with levels: No; Yes.
colour- The arrestee’s race; a factor with levels: Black; White.
year-1997 through 2002; a numeric vector.
age- in years; a numeric vector.
sex a factor with levels: Female; Male.
employed a factor with levels: No; Yes.
citizen a factor with levels: No; Yes.
checks- Number of police data bases (of previous arrests, previous convictions, parole status, etc. - 6 in all) on which the arrestee’s name appeared; a numeric vector
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(DataExplorer)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(gt)
##
## Attaching package: 'gt'
## The following object is masked from 'package:Hmisc':
##
## html
describe(Data)
## Data
##
## 8 Variables 5226 Observations
## --------------------------------------------------------------------------------
## released
## n missing distinct
## 5226 0 2
##
## Value No Yes
## Frequency 892 4334
## Proportion 0.171 0.829
## --------------------------------------------------------------------------------
## colour
## n missing distinct
## 5226 0 2
##
## Value Black White
## Frequency 1288 3938
## Proportion 0.246 0.754
## --------------------------------------------------------------------------------
## year
## n missing distinct Info Mean Gmd
## 5226 0 6 0.958 2000 1.564
##
## lowest : 1997 1998 1999 2000 2001, highest: 1998 1999 2000 2001 2002
##
## Value 1997 1998 1999 2000 2001 2002
## Frequency 492 877 1099 1270 1211 277
## Proportion 0.094 0.168 0.210 0.243 0.232 0.053
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 5226 0 53 0.996 23.85 8.626 15 16
## .25 .50 .75 .90 .95
## 18 21 27 37 41
##
## lowest : 12 13 14 15 16, highest: 60 61 62 64 66
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 5226 0 2
##
## Value Female Male
## Frequency 443 4783
## Proportion 0.085 0.915
## --------------------------------------------------------------------------------
## employed
## n missing distinct
## 5226 0 2
##
## Value No Yes
## Frequency 1115 4111
## Proportion 0.213 0.787
## --------------------------------------------------------------------------------
## citizen
## n missing distinct
## 5226 0 2
##
## Value No Yes
## Frequency 771 4455
## Proportion 0.148 0.852
## --------------------------------------------------------------------------------
## checks
## n missing distinct Info Mean Gmd
## 5226 0 7 0.94 1.636 1.708
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##
## Value 0 1 2 3 4 5 6
## Frequency 1851 854 789 953 643 127 9
## Proportion 0.354 0.163 0.151 0.182 0.123 0.024 0.002
## --------------------------------------------------------------------------------
#Data wrangling and Visualization ##Univariate Analysis
plot_bar(Data ,title = "Total Marijuana Possession Arrests by each factor")
We can see that there is huge variablility shown by each levels of factor variables.
WhiteCount <- length(which(Data$colour=="White"))
BlackCount <- length(which(Data$colour=="Black"))
BlackWhiteTable <- matrix(c(WhiteCount,BlackCount),ncol=2,nrow=1,byrow=TRUE)
dimnames(BlackWhiteTable) = list(c("Total of Arrests"), c("White", "Black"))
BlackWhiteTable /sum(BlackWhiteTable)
## White Black
## Total of Arrests 0.75354 0.24646
We can see the total number of white arrests was triple the black. This shows us that white individuals were much more likely to have been arrested for marijuana possession. This also means that out of 10 marijuana possession arrests, about 7.5 arrests would be white individuals.
#Year
Data %>% ggplot(aes(year))+
geom_bar(fill =colors()[550]) +
labs(title = "Total Marijuana Possession Arrests by year",
x = "year",
y = "Total Arrests")
#Create age group
labs <- c(paste(seq(0, 95, by=5), seq(0+5-1, 100-1, by=5), sep="-"), paste(100, "+", sep=""))
Data$AgeGroup <- cut(Data$age, breaks = c(seq(0, 100, by = 5), Inf), labels = labs, right = FALSE)
Data %>% ggplot(aes(AgeGroup))+
geom_bar(fill =colors()[550]) +
labs(title = "Total Marijuana Possession Arrests at Each Age",
x = "Age Group",
y = "Total Arrests")
we can see that no. of arrests are more in year 2000 than any other year. Also, the age group 15-19 has more no. of arrests that means youngsters are more inclined in criminal activities.And it is decreasing as getting older.
##Multivariate Analysis ###Scatter Plot of marijuana arrests vs. previous arrests:
AgeMarijArrests_tbl <- Data %>% group_by(age) %>% summarise(MarijArrests=length(colour))
## `summarise()` ungrouping output (override with `.groups` argument)
AgePreviousArrests_tbl <- Data %>% group_by(age) %>% summarise(PreviousArrests=sum(checks))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(NULL) + geom_point(data=AgeMarijArrests_tbl, aes(x=age, y=MarijArrests, color="Marijuana Arrests")) +
ggtitle("Marijuana Possession Arrests vs. Previous Arrests") +
xlab("Age") + ylab("Total Arrests") +
geom_line(data=AgePreviousArrests_tbl, aes(x=age, y=PreviousArrests, color="Previous Arrests")) +
xlab("Age") + ylab("Total Arrests") +
theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),axis.title.y = element_text(size=11, face="bold"), axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) +
scale_fill_manual(values=c("Marijuana Arrests"="red", "Previous Arrests"="darkblue")) +
theme(legend.title=element_blank())+
scale_color_manual(values = colours()[551:552])
###Scatter Plot of year vs. black & white:
YearWhiteBlackCount_tbl <- Data %>% group_by(year) %>% summarise(WhiteArrests=length(which(colour=="White")), BlackArrests=length(which(colour=="Black")))
## `summarise()` ungrouping output (override with `.groups` argument)
names(YearWhiteBlackCount_tbl) <- sub("^year$", "Year", names(YearWhiteBlackCount_tbl))
YearWhiteBlackCount_tbl
## # A tibble: 6 x 3
## Year WhiteArrests BlackArrests
## <int> <int> <int>
## 1 1997 369 123
## 2 1998 637 240
## 3 1999 849 250
## 4 2000 991 279
## 5 2001 900 311
## 6 2002 192 85
YearWhiteBlackCount_tbl <- transform(YearWhiteBlackCount_tbl, Year = as.Date(as.character(Year), "%Y"))
plot(YearWhiteBlackCount_tbl$WhiteArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"),type="b", col=colors()[551],lty =2, main="Total Marijuana Possession Arrests by skin color and Year ", xlab="Year", ylab="Arrests", ylim=c(0,1200), pch=18)
lines(YearWhiteBlackCount_tbl$BlackArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"), type="b", col=colors()[552], pch=18)
legend("topright", lty=c(2,1), legend=c("White", "Black"),col = colors()[551:552], cex=0.8)
It shows the white total arrest was increasing every year from 1997 to 2000 but starting to drop a bit in 2001 and dropped sharply in 2002.
###Scatter plot of released vs. previous records:
ReleasedCustodyRecord_tbl <- Data %>% group_by(checks) %>% summarise("Released"=length(which(released=="Yes")), "Custody"=length(which(released=="No")))
## `summarise()` ungrouping output (override with `.groups` argument)
names(ReleasedCustodyRecord_tbl) <- sub("^checks$", "PreviousArrests", names(ReleasedCustodyRecord_tbl))
ggplot(NULL) + geom_line(data=ReleasedCustodyRecord_tbl, aes(x=PreviousArrests, y=Released), color=colors()[551]) +
ggtitle("Released vs Previous Arrests") +
xlab("Previous Arrests") + ylab("Total Released") +
theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),axis.title.y = element_text(size=11, face="bold"),
axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) +
theme(legend.position = "none")
it is very obvious from the figurw that “ZERO” on previous records of arrest would have much higher chance of getting released on summons and would be more likely to get into custody when the number previous records of arrest was high.
###Released v/s color:
## Group by released and then count the white and black arrests.
ReleasedSummonsRace_tbl <- Data %>% group_by(released) %>% summarise(White=length(which(colour=="White")), Black=length(which(colour=="Black")))
## `summarise()` ungrouping output (override with `.groups` argument)
## Create into a table.
ReleasedSummonsRace_tbl2 <- matrix(c(ReleasedSummonsRace_tbl$White, ReleasedSummonsRace_tbl$Black), ncol=2)
colnames(ReleasedSummonsRace_tbl2) <- c("White", "Black")
rownames(ReleasedSummonsRace_tbl2) <- c("Released", "Custody")
ReleasedSummonsRace_tbl2 <- as.table(ReleasedSummonsRace_tbl2)
## Conditional Distribution:
round(prop.table(ReleasedSummonsRace_tbl2,2)*100, digits=2)
## White Black
## Released 14.20 25.85
## Custody 85.80 74.15
barplot(prop.table(ReleasedSummonsRace_tbl2,2)*100, beside=T, ylab="%", ylim=c(0,100), main=" Released or Custody by Race", col=colors()[551:552])
legend("bottom", legend = c("Released", "Custody"), fill =colors()[551:552] ,cex=0.7)
It is good to know whether the released on summons has any correlation with other factors. From the conditional distribution table and figure , both shows that black individuals was at 26% released rate as compared to the white at 14%. And the white was 10% more likely not be able to be released on summons as compared to the black.
###Released v/s Gender:
## Group by released and then count the gender.
ReleasedSummonsGender_tbl <- Data %>% group_by(released) %>% summarise(Male=length(which(sex=="Male")), Female=length(which(sex=="Female")))
## `summarise()` ungrouping output (override with `.groups` argument)
## Create into a table.
ReleasedSummonsGender_tbl2 <- matrix(c(ReleasedSummonsGender_tbl$Male, ReleasedSummonsGender_tbl$Female), ncol=2)
colnames(ReleasedSummonsGender_tbl2) <- c("Male", "Female")
rownames(ReleasedSummonsGender_tbl2) <- c("Custody", "Released")
ReleasedSummonsGender_tbl2 <- as.table(ReleasedSummonsGender_tbl2)
## Conditional Distribution:
round(prop.table(ReleasedSummonsGender_tbl2,2)*100, digits=2)
## Male Female
## Custody 17.33 14.22
## Released 82.67 85.78
barplot(prop.table(ReleasedSummonsGender_tbl2,2)*100, beside=T, ylab="%", main=" Released or Custody by Gender", ylim=c(0,100), col=colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)
From the figure and table above, we can see there are not much different between getting a released on summons vs gender.
###Released v/s employment status:
## Group by released and then count the employment status.
ReleasedSummonsEmploy_tbl <- Data %>% group_by(released) %>% summarise(Employed=length(which(employed=="Yes")), Unemployed=length(which(employed=="No")))
## `summarise()` ungrouping output (override with `.groups` argument)
## Create into a table.
ReleasedSummonsEmploy_tbl2 <- matrix(c(ReleasedSummonsEmploy_tbl$Employed, ReleasedSummonsEmploy_tbl$Unemployed), ncol=2)
colnames(ReleasedSummonsEmploy_tbl2) <- c("Employed", "Unemployed")
rownames(ReleasedSummonsEmploy_tbl2) <- c("Custody", "Released")
ReleasedSummonsEmploy_tbl2 <- as.table(ReleasedSummonsEmploy_tbl2)
## Conditional Distribution: Table 6 and Figure 10
round(prop.table(ReleasedSummonsEmploy_tbl2,2)*100, digits=2)
## Employed Unemployed
## Custody 13.21 31.30
## Released 86.79 68.70
barplot(prop.table(ReleasedSummonsEmploy_tbl2,2)*100, beside=T, ylab="%", main=" Released or Custody by Employment Status", ylim=c(0,100), col =colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)
From the figure and table above , we can see that unemployment individuals have larger amount in custody and smaller amount in released on summons; therefore, they would have higher chance of not succeed in getting released on summons as compared to those who were employed.
###Released v/s Citizenship status:
## Group by released and then count the citizenship status.
ReleasedSummonsCitiz_tbl <- Data %>% group_by(released) %>% summarise(Citizen=length(which(citizen=="Yes")), NonCitizen=length(which(citizen=="No")))
## `summarise()` ungrouping output (override with `.groups` argument)
## Create into a table.
ReleasedSummonsCitiz_tbl2 <- matrix(c(ReleasedSummonsCitiz_tbl$Citizen, ReleasedSummonsCitiz_tbl$NonCitizen), ncol=2)
colnames(ReleasedSummonsCitiz_tbl2) <- c("Citizen", "Non Citizen")
rownames(ReleasedSummonsCitiz_tbl2) <- c("Custody", "Released")
ReleasedSummonsCitiz_tbl2 <- as.table(ReleasedSummonsCitiz_tbl2)
## Conditional Distribution:
round(prop.table(ReleasedSummonsCitiz_tbl2,2)*100, digits=2)
## Citizen Non Citizen
## Custody 15.26 27.50
## Released 84.74 72.50
barplot(prop.table(ReleasedSummonsCitiz_tbl2,2)*100, beside=T, main=" Released or Custody by Citizenship Status", ylab="%", ylim=c(0,100), col=colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)
From the figure and table above , we can see the same thing like the above employment status analysis. Individuals who were citizen have higher chance of getting released on summons as compared to those who were non citizen.
##Summary of all variable
Age_tbl <- Data %>% group_by(AgeGroup) %>% summarise("Released"=length(which(released=="Yes")), "Custody"=length(which(released=="No")), "White"=length(which(colour=="White")), "Black"=length(which(colour=="Black")), "Male"=length(which(sex=="Male")), "Female"=length(which(sex=="Female")), "Employed"=length(which(employed=="Yes")), "Unemployed"=length(which(employed=="No")), "Citizen"=length(which(citizen=="Yes")), "Non Citizen"=length(which(citizen=="No")))
## `summarise()` ungrouping output (override with `.groups` argument)
head(arrange(Age_tbl, desc(Released))) %>% gt::gt()
| AgeGroup | Released | Custody | White | Black | Male | Female | Employed | Unemployed | Citizen | Non Citizen |
|---|---|---|---|---|---|---|---|---|---|---|
| 15-19 | 1590 | 311 | 1532 | 369 | 1710 | 191 | 1599 | 302 | 1671 | 230 |
| 20-24 | 1285 | 241 | 1097 | 429 | 1430 | 96 | 1165 | 361 | 1294 | 232 |
| 25-29 | 498 | 117 | 430 | 185 | 562 | 53 | 472 | 143 | 512 | 103 |
| 30-34 | 344 | 67 | 304 | 107 | 378 | 33 | 299 | 112 | 338 | 73 |
| 35-39 | 258 | 62 | 234 | 86 | 291 | 29 | 239 | 81 | 262 | 58 |
| 40-44 | 145 | 47 | 144 | 48 | 173 | 19 | 127 | 65 | 156 | 36 |
Table shows the summary of total arrests for all variables by age group. We can clearly see that from age 15-24 accounts for majority of the arrests for each variable. The figure shows the variation of arrests for each variable spread across each age group. The released, white, male, employed, and citizen variables which total arrests variation are much larger than others.
###Boxplot for each variable:
Age_tbl2 <- melt(Age_tbl, id.vars=c("AgeGroup"))
ggplot(Age_tbl2, aes(x=reorder(factor(variable),value, fun=median),y=value,fill=factor(variable))) + geom_boxplot() + labs(title="Marijuana Possession Arrests by Different Variables") + ylab("Total Arrests") + theme(legend.position = "none", axis.title.x = element_blank(), axis.text.x=element_text(angle=45)) + theme(plot.title = element_text(hjust=0.5)) +
theme(axis.text.x = element_text(margin = margin(t = 20, r = 20, b = 0, l = 0)))
The figure shows the variation of arrests for each variable spread across each age group. The released, white, male, employed, and citizen variables which total arrests variation are much larger than others.
#Conclusion ###Based on all the analysis done above, it can be concluded that the white individuals have significantly larger arrests of marijuana possession as compared to the black individuals. Also, the released on summons favor those individuals who had zero or less arrest records, were employed, and were citizen of Canada. Lastly, individuals who were white, male, employed, citizen, and age between 15-24 were arrested more often than other because the analysis shows that this particular group of individuals has much higher chance of marijuana possession on them.