**This open source dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/
Description:
Data on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The data are part of a larger data set featured in a series of articles in the Toronto Star newspaper
Format:
A data frame with 5226 observations on the following 8 variables.
released - Whether or not the arrestee was released with a summons; a factor with levels: No; Yes.
colour - The arrestee’s race; a factor with levels: Black; White.
year - 1997 through 2002; a numeric vector.
age - in years; a numeric vector.
sex - a factor with levels: Female; Male.
employed - a factor with levels: No; Yes.
citizen - a factor with levels: No; Yes.
checks - Number of police data bases (of previous arrests, previous convictions, parole status, etc. – 6 in all) on which the arrestee’s name appeared; a numeric vector
Questions for analysis:
1. People at which ages are more likely to have possession of marijuana?
2. Is the total number of people who are arrested for possession of marijuana increasing or decreasing each year?
3. Is an individual’s demographics (gender, race, employment, citizen status) affect the person’s likelihood to be released?
4. Is an individual’s previous record of arrests affects the person’s likelihood to be released?
Import the data set “arrests” from the wroking directory.
arrests <- read.csv(file="Arrests.csv", header=TRUE, sep=",")
place the original .csv in a github file and have R read from the link.
library(readr)
gitHubArrests <- read.csv("https://raw.githubusercontent.com/ezaccountz/SPS_Bridge_R_HW3/master/Arrests.csv")
First, let’s have a quick view the summary of the age variable
summary(arrests$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 18.00 21.00 23.85 27.00 66.00
boxplot(arrests$age)
The mean is 23.85 and we can confirm from the box plot that most of the people are in the range of 18 to 27.
Next, let’s see a more detailed distribution of the age variable
hist(x = arrests$age,main = "Arrests for Marijuana Possession", xlab="Age", ylab="Number of Arrests")
We can conculde that people around 20 are more likely to have possession of marijuana.
Next, let check the number of people arrested each year
arrests_years <- data.frame(matrix(ncol= 6, nrow = 1))
colnames(arrests_years) <- c(1997:2002)
rownames(arrests_years) <- c("Number_of_Arrested")
for (i in c(1997:2002))
{
arrests_years["Number_of_Arrested",toString(i)] <- length(arrests$year[arrests$year == i])
}
arrests_years
## 1997 1998 1999 2000 2001 2002
## Number_of_Arrested 492 877 1099 1270 1211 277
The number is increasing from 1997 to 2000. The number starts to decrease from 2001 and then dramatically dropped in 2002.
attach(arrests_years)
plot(x = colnames(arrests_years), y = arrests_years["Number_of_Arrested",],type = "b",main = "Arrests for Marijuana Possession", xlab="Year", ylab="Number of Arrests", las = 2)
Next, let’s create a summary of the people arrested with different demographics
demographics <- data.frame(matrix(ncol= 8, nrow = 7))
colnames(demographics) <- c("White", "Black", "Male", "Female", "Employed", "Unemployed", "Citizen","Non_Citizen")
rownames(demographics) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
demographics[toString(i),"White"] <- length(arrests$colour[arrests$colour == "White" & arrests$year == i])
demographics[toString(i),"Black"] <- length(arrests$colour[arrests$colour == "Black" & arrests$year == i])
demographics[toString(i),"Male"] <- length(arrests$colour[arrests$sex == "Male" & arrests$year == i])
demographics[toString(i),"Female"] <- length(arrests$colour[arrests$sex == "Female" & arrests$year == i])
demographics[toString(i),"Employed"] <- length(arrests$colour[arrests$employed == "Yes" & arrests$year == i])
demographics[toString(i),"Unemployed"] <- length(arrests$colour[arrests$employed == "No" & arrests$year == i])
demographics[toString(i),"Citizen"] <- length(arrests$colour[arrests$citizen == "Yes" & arrests$year == i])
demographics[toString(i),"Non_Citizen"] <- length(arrests$colour[arrests$citizen == "No" & arrests$year == i])
}
for (cn in colnames(demographics))
{
demographics["Total",cn] = sum(demographics[1:6,cn])
}
demographics
## White Black Male Female Employed Unemployed Citizen Non_Citizen
## 1997 369 123 460 32 357 135 178 314
## 1998 637 240 782 95 686 191 789 88
## 1999 849 250 999 100 862 237 989 110
## 2000 991 279 1165 105 1042 228 1156 114
## 2001 900 311 1118 93 950 261 1095 116
## 2002 192 85 259 18 214 63 248 29
## Total 3938 1288 4783 443 4111 1115 4455 771
Create a summary of the people released with different demographics
demographics_Released <- data.frame(matrix(ncol= 8, nrow = 7))
colnames(demographics_Released) <- c("White", "Black", "Male", "Female", "Employed", "Unemployed", "Citizen","Non_Citizen")
rownames(demographics_Released) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
demographics_Released[toString(i),"White"] <- length(arrests$colour[arrests$colour == "White" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Black"] <- length(arrests$colour[arrests$colour == "Black" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Male"] <- length(arrests$colour[arrests$sex == "Male" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Female"] <- length(arrests$colour[arrests$sex == "Female" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Employed"] <- length(arrests$colour[arrests$employed == "Yes" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Unemployed"] <- length(arrests$colour[arrests$employed == "No" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Citizen"] <- length(arrests$colour[arrests$citizen == "Yes" & arrests$year == i & arrests$released == "Yes"])
demographics_Released[toString(i),"Non_Citizen"] <- length(arrests$colour[arrests$citizen == "No" & arrests$year == i & arrests$released == "Yes"])
}
for (cn in colnames(demographics_Released))
{
demographics_Released["Total",cn] = sum(demographics_Released[1:6,cn])
}
demographics_Released
## White Black Male Female Employed Unemployed Citizen Non_Citizen
## 1997 292 84 346 30 298 78 142 234
## 1998 560 163 640 83 598 125 667 56
## 1999 733 182 831 84 750 165 830 85
## 2000 884 208 997 95 926 166 1013 79
## 2001 755 251 931 75 815 191 924 82
## 2002 155 67 209 13 181 41 199 23
## Total 3379 955 3954 380 3568 766 3775 559
Now we can compare the percentages of the people who are released according to their demographics
demographics_compare <- demographics_Released/demographics
demographics_compare
## White Black Male Female Employed Unemployed
## 1997 0.7913279 0.6829268 0.7521739 0.9375000 0.8347339 0.5777778
## 1998 0.8791209 0.6791667 0.8184143 0.8736842 0.8717201 0.6544503
## 1999 0.8633687 0.7280000 0.8318318 0.8400000 0.8700696 0.6962025
## 2000 0.8920283 0.7455197 0.8557940 0.9047619 0.8886756 0.7280702
## 2001 0.8388889 0.8070740 0.8327370 0.8064516 0.8578947 0.7318008
## 2002 0.8072917 0.7882353 0.8069498 0.7222222 0.8457944 0.6507937
## Total 0.8580498 0.7414596 0.8266778 0.8577878 0.8679153 0.6869955
## Citizen Non_Citizen
## 1997 0.7977528 0.7452229
## 1998 0.8453739 0.6363636
## 1999 0.8392315 0.7727273
## 2000 0.8762976 0.6929825
## 2001 0.8438356 0.7068966
## 2002 0.8024194 0.7931034
## Total 0.8473625 0.7250324
Let’s have a quick comparison of the totals
attach(demographics_compare)
barplot(as.matrix(demographics_compare["Total",]),beside=TRUE, col=c("darkgreen","lightblue"), ylim = c(0:1))
From the chart: Whites, people who are employed, and citizens are more likely to be released.
Next, let’s create a summary of the people arrested with different number of records of previous arrests.
previousArrested <- data.frame(matrix(ncol= 7, nrow = 7))
colnames(previousArrested) <- c(0:6)
rownames(previousArrested) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
for (j in c(0:6))
{
previousArrested[toString(i),j+1] <- length(arrests$colour[arrests$checks == j & arrests$year == i])
}
}
for (cn in colnames(previousArrested))
{
previousArrested["Total",cn] = sum(previousArrested[0:6,cn])
}
previousArrested
## 0 1 2 3 4 5 6
## 1997 164 60 86 77 84 21 0
## 1998 311 129 132 180 98 27 0
## 1999 408 170 158 182 155 22 4
## 2000 445 232 190 220 151 29 3
## 2001 424 218 181 243 120 23 2
## 2002 99 45 42 51 35 5 0
## Total 1851 854 789 953 643 127 9
Create a summary of the people released with different number of records of previous arrests.
previousArrested_Released <- data.frame(matrix(ncol= 7, nrow = 7))
colnames(previousArrested_Released) <- c(0:6)
rownames(previousArrested_Released) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
for (j in c(0:6))
{
previousArrested_Released[toString(i),j+1] <- length(arrests$colour[arrests$checks == j & arrests$year == i & arrests$released == "Yes"])
}
}
for (cn in colnames(previousArrested_Released))
{
previousArrested_Released["Total",cn] = sum(previousArrested_Released[1:6,cn])
}
previousArrested_Released
## 0 1 2 3 4 5 6
## 1997 140 49 63 58 54 12 0
## 1998 279 109 115 138 63 19 0
## 1999 375 147 132 129 112 16 4
## 2000 420 213 172 176 95 14 2
## 2001 392 198 152 175 74 15 0
## 2002 86 41 35 36 22 2 0
## Total 1692 757 669 712 420 78 6
Now we can compare the percentages of the people who are released according to their records of previous arrests.
previousArrested_compare <- previousArrested_Released/previousArrested
previousArrested_compare
## 0 1 2 3 4 5
## 1997 0.8536585 0.8166667 0.7325581 0.7532468 0.6428571 0.5714286
## 1998 0.8971061 0.8449612 0.8712121 0.7666667 0.6428571 0.7037037
## 1999 0.9191176 0.8647059 0.8354430 0.7087912 0.7225806 0.7272727
## 2000 0.9438202 0.9181034 0.9052632 0.8000000 0.6291391 0.4827586
## 2001 0.9245283 0.9082569 0.8397790 0.7201646 0.6166667 0.6521739
## 2002 0.8686869 0.9111111 0.8333333 0.7058824 0.6285714 0.4000000
## Total 0.9141005 0.8864169 0.8479087 0.7471144 0.6531882 0.6141732
## 6
## 1997 NaN
## 1998 NaN
## 1999 1.0000000
## 2000 0.6666667
## 2001 0.0000000
## 2002 NaN
## Total 0.6666667
Let’s have a quick comparison of the totals
attach(previousArrested_compare)
barplot(as.matrix(previousArrested_compare["Total",]),beside=TRUE,ylim = c(0:1), xlab = "Number of previous arrests")
abline(h = 0)
Since we have only 9 people arrested with 6 previous records, they are not sufficient to estimate the likelihood of being released for the people who have 6 previous arrests. For the people who have 0 to 5 previous records, the chance of being released decreases as the number of records increases.
Conclusion:
Through the analysis, we find out that people around 20s are more likely to have possession of marijuana. People after age 55 rarely have possession of marijuana. The total number of people who have possession of marijuana increases each year from 1997 to 2000. The number starts to decrease in 2001 and then dramatically dropped in 2002. It may indicate that the movement had put in great effort in stopping people from possession of marijuana. However, it could also happen that the local police simply arrest less people intentionally. For the people who are arrested, Whites have higher percentage to be released than blacks; people who are employed have higher percentage to be released than people who are unemployed; citizens have higher percentage to be released than non-citizens. The percentage of male being released and the percentage of female being released are about the same. People who have more records of arrests have lower chance to be released.