Week 2 R Homework

by Jhalak Das

01/05/2020

Selected (CSV) dataset: “Arrests” from carData library.

Introduction

We selected Arrests dataset from carData library in R.This data is on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The data are part of a larger data set featured in a series of articles in the Toronto Star newspaper.

Data <- carData::Arrests
str(Data)

## 'data.frame':    5226 obs. of  8 variables:
##  $ released: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
##  $ colour  : Factor w/ 2 levels "Black","White": 2 1 2 1 1 1 2 2 1 2 ...
##  $ year    : int  2002 1999 2000 2000 1999 1998 1999 1998 2000 2001 ...
##  $ age     : int  21 17 24 46 27 16 40 34 23 30 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 1 2 2 ...
##  $ employed: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
##  $ citizen : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ checks  : int  3 3 3 1 1 0 0 1 4 3 ...

##Questions for analysis The goal of doing analyis on this Marijuana Arrests dataset is to find out the answers for the following questions. (1) Figure out which race was arrested significantly more than the other for simple possession of small quantities of marijuana in Toronto between year 1997 and 2002. (2) Whether police released on summons based on previous records of arrest or based on other factors. (3) And whether there is a particular group of individuals were arrested more than the other. The factors to consider in this analysis would be such as their race, age, gender, citizen status, and employment status.

##Data Exploration It is a data frame with 5226 observations on the following 8 variables.

released - Whether or not the arrestee was released with a summons; a factor with levels: No; Yes.

colour- The arrestee’s race; a factor with levels: Black; White.

year-1997 through 2002; a numeric vector.

age- in years; a numeric vector.

sex a factor with levels: Female; Male.

employed a factor with levels: No; Yes.

citizen a factor with levels: No; Yes.

checks- Number of police data bases (of previous arrests, previous convictions, parole status, etc. - 6 in all) on which the arrestee’s name appeared; a numeric vector

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(DataExplorer)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(gt)

## 
## Attaching package: 'gt'

## The following object is masked from 'package:Hmisc':
## 
##     html

describe(Data)

## Data 
## 
##  8  Variables      5226  Observations
## --------------------------------------------------------------------------------
## released 
##        n  missing distinct 
##     5226        0        2 
##                       
## Value         No   Yes
## Frequency    892  4334
## Proportion 0.171 0.829
## --------------------------------------------------------------------------------
## colour 
##        n  missing distinct 
##     5226        0        2 
##                       
## Value      Black White
## Frequency   1288  3938
## Proportion 0.246 0.754
## --------------------------------------------------------------------------------
## year 
##        n  missing distinct     Info     Mean      Gmd 
##     5226        0        6    0.958     2000    1.564 
## 
## lowest : 1997 1998 1999 2000 2001, highest: 1998 1999 2000 2001 2002
##                                               
## Value       1997  1998  1999  2000  2001  2002
## Frequency    492   877  1099  1270  1211   277
## Proportion 0.094 0.168 0.210 0.243 0.232 0.053
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5226        0       53    0.996    23.85    8.626       15       16 
##      .25      .50      .75      .90      .95 
##       18       21       27       37       41 
## 
## lowest : 12 13 14 15 16, highest: 60 61 62 64 66
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     5226        0        2 
##                         
## Value      Female   Male
## Frequency     443   4783
## Proportion  0.085  0.915
## --------------------------------------------------------------------------------
## employed 
##        n  missing distinct 
##     5226        0        2 
##                       
## Value         No   Yes
## Frequency   1115  4111
## Proportion 0.213 0.787
## --------------------------------------------------------------------------------
## citizen 
##        n  missing distinct 
##     5226        0        2 
##                       
## Value         No   Yes
## Frequency    771  4455
## Proportion 0.148 0.852
## --------------------------------------------------------------------------------
## checks 
##        n  missing distinct     Info     Mean      Gmd 
##     5226        0        7     0.94    1.636    1.708 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency   1851   854   789   953   643   127     9
## Proportion 0.354 0.163 0.151 0.182 0.123 0.024 0.002
## --------------------------------------------------------------------------------

#Data wrangling and Visualization ##Univariate Analysis

plot_bar(Data ,title = "Total Marijuana Possession Arrests by each factor")

We can see that there is huge variablility shown by each levels of factor variables.

WhiteCount <- length(which(Data$colour=="White"))
BlackCount <- length(which(Data$colour=="Black"))
BlackWhiteTable <- matrix(c(WhiteCount,BlackCount),ncol=2,nrow=1,byrow=TRUE)
dimnames(BlackWhiteTable) = list(c("Total of Arrests"), c("White", "Black"))
BlackWhiteTable /sum(BlackWhiteTable)

##                    White   Black
## Total of Arrests 0.75354 0.24646

We can see the total number of white arrests was triple the black. This shows us that white individuals were much more likely to have been arrested for marijuana possession. This also means that out of 10 marijuana possession arrests, about 7.5 arrests would be white individuals.

#Year
Data %>% ggplot(aes(year))+
  geom_bar(fill =colors()[550]) +
  labs(title = "Total Marijuana Possession Arrests by year",
       x = "year",
       y = "Total Arrests")

#Create age group
labs <- c(paste(seq(0, 95, by=5), seq(0+5-1, 100-1, by=5), sep="-"), paste(100, "+", sep=""))
Data$AgeGroup <- cut(Data$age, breaks = c(seq(0, 100, by = 5), Inf), labels = labs, right = FALSE)

Data %>% ggplot(aes(AgeGroup))+
  geom_bar(fill =colors()[550]) +
  labs(title = "Total Marijuana Possession Arrests at Each Age",
       x = "Age Group",
       y = "Total Arrests")

we can see that no. of arrests are more in year 2000 than any other year. Also, the age group 15-19 has more no. of arrests that means youngsters are more inclined in criminal activities.And it is decreasing as getting older.

##Multivariate Analysis ###Scatter Plot of marijuana arrests vs. previous arrests:

AgeMarijArrests_tbl <- Data %>% group_by(age) %>% summarise(MarijArrests=length(colour))

## `summarise()` ungrouping output (override with `.groups` argument)

AgePreviousArrests_tbl <- Data %>% group_by(age) %>% summarise(PreviousArrests=sum(checks))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(NULL) + geom_point(data=AgeMarijArrests_tbl, aes(x=age, y=MarijArrests, color="Marijuana Arrests")) + 
  ggtitle("Marijuana Possession Arrests vs. Previous Arrests") +
  xlab("Age") + ylab("Total Arrests") + 
  geom_line(data=AgePreviousArrests_tbl, aes(x=age, y=PreviousArrests, color="Previous Arrests")) + 
  xlab("Age") + ylab("Total Arrests") + 
  theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),axis.title.y = element_text(size=11, face="bold"), axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) + 
  scale_fill_manual(values=c("Marijuana Arrests"="red", "Previous Arrests"="darkblue")) +
  theme(legend.title=element_blank())+
  scale_color_manual(values = colours()[551:552])

###Scatter Plot of year vs. black & white:

YearWhiteBlackCount_tbl <- Data %>% group_by(year) %>% summarise(WhiteArrests=length(which(colour=="White")), BlackArrests=length(which(colour=="Black")))

## `summarise()` ungrouping output (override with `.groups` argument)

names(YearWhiteBlackCount_tbl) <- sub("^year$", "Year", names(YearWhiteBlackCount_tbl))
YearWhiteBlackCount_tbl

## # A tibble: 6 x 3
##    Year WhiteArrests BlackArrests
##   <int>        <int>        <int>
## 1  1997          369          123
## 2  1998          637          240
## 3  1999          849          250
## 4  2000          991          279
## 5  2001          900          311
## 6  2002          192           85

YearWhiteBlackCount_tbl <-  transform(YearWhiteBlackCount_tbl, Year = as.Date(as.character(Year), "%Y"))

plot(YearWhiteBlackCount_tbl$WhiteArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"),type="b", col=colors()[551],lty =2, main="Total Marijuana Possession Arrests by skin color and Year ", xlab="Year", ylab="Arrests", ylim=c(0,1200), pch=18)
lines(YearWhiteBlackCount_tbl$BlackArrests~as.Date(YearWhiteBlackCount_tbl$Year, "%y"), type="b", col=colors()[552], pch=18)
legend("topright", lty=c(2,1), legend=c("White", "Black"),col = colors()[551:552], cex=0.8)

It shows the white total arrest was increasing every year from 1997 to 2000 but starting to drop a bit in 2001 and dropped sharply in 2002.

###Scatter plot of released vs. previous records:

ReleasedCustodyRecord_tbl <- Data %>% group_by(checks) %>% summarise("Released"=length(which(released=="Yes")), "Custody"=length(which(released=="No")))

## `summarise()` ungrouping output (override with `.groups` argument)

names(ReleasedCustodyRecord_tbl) <- sub("^checks$", "PreviousArrests", names(ReleasedCustodyRecord_tbl))

ggplot(NULL) + geom_line(data=ReleasedCustodyRecord_tbl, aes(x=PreviousArrests, y=Released), color=colors()[551]) + 
  ggtitle("Released vs Previous Arrests") + 
  xlab("Previous Arrests") + ylab("Total Released") + 
  theme(plot.title = element_text(hjust=0.5), axis.title.x = element_text(size=11, face="bold"),axis.title.y = element_text(size=11, face="bold"), 
        axis.text.x = element_text(face="bold"), axis.text.y = element_text(face="bold")) + 
  theme(legend.position = "none")

it is very obvious from the figurw that “ZERO” on previous records of arrest would have much higher chance of getting released on summons and would be more likely to get into custody when the number previous records of arrest was high.

###Released v/s color:

## Group by released and then count the white and black arrests.
ReleasedSummonsRace_tbl <- Data %>% group_by(released) %>% summarise(White=length(which(colour=="White")), Black=length(which(colour=="Black")))

## `summarise()` ungrouping output (override with `.groups` argument)

## Create into a table.
ReleasedSummonsRace_tbl2 <- matrix(c(ReleasedSummonsRace_tbl$White, ReleasedSummonsRace_tbl$Black), ncol=2)
colnames(ReleasedSummonsRace_tbl2) <- c("White", "Black")
rownames(ReleasedSummonsRace_tbl2) <- c("Released", "Custody")
ReleasedSummonsRace_tbl2 <- as.table(ReleasedSummonsRace_tbl2)

## Conditional Distribution: 
round(prop.table(ReleasedSummonsRace_tbl2,2)*100, digits=2)

##          White Black
## Released 14.20 25.85
## Custody  85.80 74.15

barplot(prop.table(ReleasedSummonsRace_tbl2,2)*100, beside=T, ylab="%", ylim=c(0,100), main=" Released or Custody by Race", col=colors()[551:552])
legend("bottom", legend = c("Released", "Custody"), fill =colors()[551:552] ,cex=0.7)

It is good to know whether the released on summons has any correlation with other factors. From the conditional distribution table and figure , both shows that black individuals was at 26% released rate as compared to the white at 14%. And the white was 10% more likely not be able to be released on summons as compared to the black.

###Released v/s Gender:

## Group by released and then count the gender.
ReleasedSummonsGender_tbl <- Data %>% group_by(released) %>% summarise(Male=length(which(sex=="Male")), Female=length(which(sex=="Female")))

## `summarise()` ungrouping output (override with `.groups` argument)

## Create into a table.
ReleasedSummonsGender_tbl2 <- matrix(c(ReleasedSummonsGender_tbl$Male, ReleasedSummonsGender_tbl$Female), ncol=2)
colnames(ReleasedSummonsGender_tbl2) <- c("Male", "Female")
rownames(ReleasedSummonsGender_tbl2) <- c("Custody", "Released")
ReleasedSummonsGender_tbl2 <- as.table(ReleasedSummonsGender_tbl2)

## Conditional Distribution: 
round(prop.table(ReleasedSummonsGender_tbl2,2)*100, digits=2)

##           Male Female
## Custody  17.33  14.22
## Released 82.67  85.78

barplot(prop.table(ReleasedSummonsGender_tbl2,2)*100, beside=T, ylab="%", main=" Released or Custody by Gender", ylim=c(0,100), col=colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)

From the figure and table above, we can see there are not much different between getting a released on summons vs gender.

###Released v/s employment status:

## Group by released and then count the employment status.
ReleasedSummonsEmploy_tbl <- Data %>% group_by(released) %>% summarise(Employed=length(which(employed=="Yes")), Unemployed=length(which(employed=="No")))

## `summarise()` ungrouping output (override with `.groups` argument)

## Create into a table.
ReleasedSummonsEmploy_tbl2 <- matrix(c(ReleasedSummonsEmploy_tbl$Employed, ReleasedSummonsEmploy_tbl$Unemployed), ncol=2)
colnames(ReleasedSummonsEmploy_tbl2) <- c("Employed", "Unemployed")
rownames(ReleasedSummonsEmploy_tbl2) <- c("Custody", "Released")
ReleasedSummonsEmploy_tbl2 <- as.table(ReleasedSummonsEmploy_tbl2)

## Conditional Distribution: Table 6 and Figure 10
round(prop.table(ReleasedSummonsEmploy_tbl2,2)*100, digits=2)

##          Employed Unemployed
## Custody     13.21      31.30
## Released    86.79      68.70

barplot(prop.table(ReleasedSummonsEmploy_tbl2,2)*100, beside=T, ylab="%", main=" Released or Custody by Employment Status", ylim=c(0,100), col =colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)

From the figure and table above , we can see that unemployment individuals have larger amount in custody and smaller amount in released on summons; therefore, they would have higher chance of not succeed in getting released on summons as compared to those who were employed.

###Released v/s Citizenship status:

## Group by released and then count the citizenship status.
ReleasedSummonsCitiz_tbl <- Data %>% group_by(released) %>% summarise(Citizen=length(which(citizen=="Yes")), NonCitizen=length(which(citizen=="No")))

## `summarise()` ungrouping output (override with `.groups` argument)

## Create into a table.
ReleasedSummonsCitiz_tbl2 <- matrix(c(ReleasedSummonsCitiz_tbl$Citizen, ReleasedSummonsCitiz_tbl$NonCitizen), ncol=2)
colnames(ReleasedSummonsCitiz_tbl2) <- c("Citizen", "Non Citizen")
rownames(ReleasedSummonsCitiz_tbl2) <- c("Custody", "Released")

ReleasedSummonsCitiz_tbl2 <- as.table(ReleasedSummonsCitiz_tbl2)

## Conditional Distribution: 
round(prop.table(ReleasedSummonsCitiz_tbl2,2)*100, digits=2)

##          Citizen Non Citizen
## Custody    15.26       27.50
## Released   84.74       72.50

barplot(prop.table(ReleasedSummonsCitiz_tbl2,2)*100, beside=T, main=" Released or Custody by Citizenship Status", ylab="%", ylim=c(0,100), col=colors()[551:552])
legend("bottom", legend = c("Custody", "Released"), fill = colors()[551:552],cex=0.75)

From the figure and table above , we can see the same thing like the above employment status analysis. Individuals who were citizen have higher chance of getting released on summons as compared to those who were non citizen.

##Summary of all variable

Age_tbl <- Data %>% group_by(AgeGroup) %>% summarise("Released"=length(which(released=="Yes")), "Custody"=length(which(released=="No")), "White"=length(which(colour=="White")), "Black"=length(which(colour=="Black")), "Male"=length(which(sex=="Male")), "Female"=length(which(sex=="Female")), "Employed"=length(which(employed=="Yes")), "Unemployed"=length(which(employed=="No")), "Citizen"=length(which(citizen=="Yes")), "Non Citizen"=length(which(citizen=="No")))

## `summarise()` ungrouping output (override with `.groups` argument)

head(arrange(Age_tbl, desc(Released))) %>% gt::gt()

AgeGroup	Released	Custody	White	Black	Male	Female	Employed	Unemployed	Citizen	Non Citizen
15-19	1590	311	1532	369	1710	191	1599	302	1671	230
20-24	1285	241	1097	429	1430	96	1165	361	1294	232
25-29	498	117	430	185	562	53	472	143	512	103
30-34	344	67	304	107	378	33	299	112	338	73
35-39	258	62	234	86	291	29	239	81	262	58
40-44	145	47	144	48	173	19	127	65	156	36

Table shows the summary of total arrests for all variables by age group. We can clearly see that from age 15-24 accounts for majority of the arrests for each variable. The figure shows the variation of arrests for each variable spread across each age group. The released, white, male, employed, and citizen variables which total arrests variation are much larger than others.

###Boxplot for each variable:

Age_tbl2 <- melt(Age_tbl, id.vars=c("AgeGroup"))
ggplot(Age_tbl2, aes(x=reorder(factor(variable),value, fun=median),y=value,fill=factor(variable))) + geom_boxplot() + labs(title="Marijuana Possession Arrests by Different Variables") + ylab("Total Arrests") + theme(legend.position = "none", axis.title.x = element_blank(), axis.text.x=element_text(angle=45)) + theme(plot.title = element_text(hjust=0.5)) + 
  theme(axis.text.x = element_text(margin = margin(t = 20, r = 20, b = 0, l = 0)))

The figure shows the variation of arrests for each variable spread across each age group. The released, white, male, employed, and citizen variables which total arrests variation are much larger than others.

#Conclusion ###Based on all the analysis done above, it can be concluded that the white individuals have significantly larger arrests of marijuana possession as compared to the black individuals. Also, the released on summons favor those individuals who had zero or less arrest records, were employed, and were citizen of Canada. Lastly, individuals who were white, male, employed, citizen, and age between 15-24 were arrested more often than other because the analysis shows that this particular group of individuals has much higher chance of marijuana possession on them.