SPS_Bridge_R

**This open source dataset is downloaded from http://vincentarelbundock.github.io/Rdatasets/
Description:
Data on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The data are part of a larger data set featured in a series of articles in the Toronto Star newspaper

Format:
A data frame with 5226 observations on the following 8 variables.
released - Whether or not the arrestee was released with a summons; a factor with levels: No; Yes.
colour - The arrestee’s race; a factor with levels: Black; White.
year - 1997 through 2002; a numeric vector.
age - in years; a numeric vector.
sex - a factor with levels: Female; Male.
employed - a factor with levels: No; Yes.
citizen - a factor with levels: No; Yes.
checks - Number of police data bases (of previous arrests, previous convictions, parole status, etc. – 6 in all) on which the arrestee’s name appeared; a numeric vector

Questions for analysis:
1. People at which ages are more likely to have possession of marijuana?
2. Is the total number of people who are arrested for possession of marijuana increasing or decreasing each year?
3. Is an individual’s demographics (gender, race, employment, citizen status) affect the person’s likelihood to be released?
4. Is an individual’s previous record of arrests affects the person’s likelihood to be released?

Import the data set “arrests” from the wroking directory.

arrests <- read.csv(file="Arrests.csv", header=TRUE, sep=",")

place the original .csv in a github file and have R read from the link.

library(readr)
gitHubArrests <- read.csv("https://raw.githubusercontent.com/ezaccountz/SPS_Bridge_R_HW3/master/Arrests.csv")

First, let’s have a quick view the summary of the age variable

summary(arrests$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   18.00   21.00   23.85   27.00   66.00

boxplot(arrests$age)

The mean is 23.85 and we can confirm from the box plot that most of the people are in the range of 18 to 27.
Next, let’s see a more detailed distribution of the age variable

hist(x = arrests$age,main = "Arrests for Marijuana Possession", xlab="Age", ylab="Number of Arrests")

We can conculde that people around 20 are more likely to have possession of marijuana.

Next, let check the number of people arrested each year

arrests_years <- data.frame(matrix(ncol= 6, nrow = 1))
colnames(arrests_years) <- c(1997:2002)
rownames(arrests_years) <- c("Number_of_Arrested")
for (i in c(1997:2002))
{
  arrests_years["Number_of_Arrested",toString(i)] <- length(arrests$year[arrests$year == i])
}
arrests_years

##                    1997 1998 1999 2000 2001 2002
## Number_of_Arrested  492  877 1099 1270 1211  277

The number is increasing from 1997 to 2000. The number starts to decrease from 2001 and then dramatically dropped in 2002.

attach(arrests_years)
plot(x = colnames(arrests_years), y = arrests_years["Number_of_Arrested",],type = "b",main = "Arrests for Marijuana Possession", xlab="Year", ylab="Number of Arrests", las = 2)

Next, let’s create a summary of the people arrested with different demographics

demographics <- data.frame(matrix(ncol= 8, nrow = 7))
colnames(demographics) <- c("White", "Black", "Male", "Female", "Employed", "Unemployed", "Citizen","Non_Citizen")
rownames(demographics) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
  demographics[toString(i),"White"] <- length(arrests$colour[arrests$colour == "White" & arrests$year == i])
  demographics[toString(i),"Black"] <- length(arrests$colour[arrests$colour == "Black" & arrests$year == i])
  demographics[toString(i),"Male"] <- length(arrests$colour[arrests$sex == "Male" & arrests$year == i])
  demographics[toString(i),"Female"] <- length(arrests$colour[arrests$sex == "Female" & arrests$year == i])
  demographics[toString(i),"Employed"] <- length(arrests$colour[arrests$employed == "Yes" & arrests$year == i])
  demographics[toString(i),"Unemployed"] <- length(arrests$colour[arrests$employed == "No" & arrests$year == i])
  demographics[toString(i),"Citizen"] <- length(arrests$colour[arrests$citizen == "Yes" & arrests$year == i])
  demographics[toString(i),"Non_Citizen"] <- length(arrests$colour[arrests$citizen == "No" & arrests$year == i])
}
for (cn in colnames(demographics))
{
  demographics["Total",cn] = sum(demographics[1:6,cn])
}
demographics

##       White Black Male Female Employed Unemployed Citizen Non_Citizen
## 1997    369   123  460     32      357        135     178         314
## 1998    637   240  782     95      686        191     789          88
## 1999    849   250  999    100      862        237     989         110
## 2000    991   279 1165    105     1042        228    1156         114
## 2001    900   311 1118     93      950        261    1095         116
## 2002    192    85  259     18      214         63     248          29
## Total  3938  1288 4783    443     4111       1115    4455         771

Create a summary of the people released with different demographics

demographics_Released <- data.frame(matrix(ncol= 8, nrow = 7))
colnames(demographics_Released) <- c("White", "Black", "Male", "Female", "Employed", "Unemployed", "Citizen","Non_Citizen")
rownames(demographics_Released) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
  demographics_Released[toString(i),"White"] <- length(arrests$colour[arrests$colour == "White" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Black"] <- length(arrests$colour[arrests$colour == "Black" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Male"] <- length(arrests$colour[arrests$sex == "Male" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Female"] <- length(arrests$colour[arrests$sex == "Female" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Employed"] <- length(arrests$colour[arrests$employed == "Yes" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Unemployed"] <- length(arrests$colour[arrests$employed == "No" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Citizen"] <- length(arrests$colour[arrests$citizen == "Yes" & arrests$year == i & arrests$released == "Yes"])
  demographics_Released[toString(i),"Non_Citizen"] <- length(arrests$colour[arrests$citizen == "No" & arrests$year == i & arrests$released == "Yes"])
}

for (cn in colnames(demographics_Released))
{
  demographics_Released["Total",cn] = sum(demographics_Released[1:6,cn])
}
demographics_Released

##       White Black Male Female Employed Unemployed Citizen Non_Citizen
## 1997    292    84  346     30      298         78     142         234
## 1998    560   163  640     83      598        125     667          56
## 1999    733   182  831     84      750        165     830          85
## 2000    884   208  997     95      926        166    1013          79
## 2001    755   251  931     75      815        191     924          82
## 2002    155    67  209     13      181         41     199          23
## Total  3379   955 3954    380     3568        766    3775         559

Now we can compare the percentages of the people who are released according to their demographics

demographics_compare <- demographics_Released/demographics
demographics_compare

##           White     Black      Male    Female  Employed Unemployed
## 1997  0.7913279 0.6829268 0.7521739 0.9375000 0.8347339  0.5777778
## 1998  0.8791209 0.6791667 0.8184143 0.8736842 0.8717201  0.6544503
## 1999  0.8633687 0.7280000 0.8318318 0.8400000 0.8700696  0.6962025
## 2000  0.8920283 0.7455197 0.8557940 0.9047619 0.8886756  0.7280702
## 2001  0.8388889 0.8070740 0.8327370 0.8064516 0.8578947  0.7318008
## 2002  0.8072917 0.7882353 0.8069498 0.7222222 0.8457944  0.6507937
## Total 0.8580498 0.7414596 0.8266778 0.8577878 0.8679153  0.6869955
##         Citizen Non_Citizen
## 1997  0.7977528   0.7452229
## 1998  0.8453739   0.6363636
## 1999  0.8392315   0.7727273
## 2000  0.8762976   0.6929825
## 2001  0.8438356   0.7068966
## 2002  0.8024194   0.7931034
## Total 0.8473625   0.7250324

Let’s have a quick comparison of the totals

attach(demographics_compare)
barplot(as.matrix(demographics_compare["Total",]),beside=TRUE, col=c("darkgreen","lightblue"), ylim = c(0:1))

From the chart: Whites, people who are employed, and citizens are more likely to be released.

Next, let’s create a summary of the people arrested with different number of records of previous arrests.

previousArrested <- data.frame(matrix(ncol= 7, nrow = 7))
colnames(previousArrested) <- c(0:6)
rownames(previousArrested) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
  for (j in c(0:6))
  {
    previousArrested[toString(i),j+1] <- length(arrests$colour[arrests$checks == j & arrests$year == i])
  }
}


for (cn in colnames(previousArrested))
{
  previousArrested["Total",cn] = sum(previousArrested[0:6,cn])
}
previousArrested

##          0   1   2   3   4   5 6
## 1997   164  60  86  77  84  21 0
## 1998   311 129 132 180  98  27 0
## 1999   408 170 158 182 155  22 4
## 2000   445 232 190 220 151  29 3
## 2001   424 218 181 243 120  23 2
## 2002    99  45  42  51  35   5 0
## Total 1851 854 789 953 643 127 9

Create a summary of the people released with different number of records of previous arrests.

previousArrested_Released <- data.frame(matrix(ncol= 7, nrow = 7))
colnames(previousArrested_Released) <- c(0:6)
rownames(previousArrested_Released) <- c(1997,1998,1999,2000,2001,2002,"Total")
for (i in c(1997:2002))
{
  for (j in c(0:6))
  {
    previousArrested_Released[toString(i),j+1] <- length(arrests$colour[arrests$checks == j & arrests$year == i & arrests$released == "Yes"])
  }
}
for (cn in colnames(previousArrested_Released))
{
  previousArrested_Released["Total",cn] = sum(previousArrested_Released[1:6,cn])
}
previousArrested_Released

##          0   1   2   3   4  5 6
## 1997   140  49  63  58  54 12 0
## 1998   279 109 115 138  63 19 0
## 1999   375 147 132 129 112 16 4
## 2000   420 213 172 176  95 14 2
## 2001   392 198 152 175  74 15 0
## 2002    86  41  35  36  22  2 0
## Total 1692 757 669 712 420 78 6

Now we can compare the percentages of the people who are released according to their records of previous arrests.

previousArrested_compare <- previousArrested_Released/previousArrested
previousArrested_compare

##               0         1         2         3         4         5
## 1997  0.8536585 0.8166667 0.7325581 0.7532468 0.6428571 0.5714286
## 1998  0.8971061 0.8449612 0.8712121 0.7666667 0.6428571 0.7037037
## 1999  0.9191176 0.8647059 0.8354430 0.7087912 0.7225806 0.7272727
## 2000  0.9438202 0.9181034 0.9052632 0.8000000 0.6291391 0.4827586
## 2001  0.9245283 0.9082569 0.8397790 0.7201646 0.6166667 0.6521739
## 2002  0.8686869 0.9111111 0.8333333 0.7058824 0.6285714 0.4000000
## Total 0.9141005 0.8864169 0.8479087 0.7471144 0.6531882 0.6141732
##               6
## 1997        NaN
## 1998        NaN
## 1999  1.0000000
## 2000  0.6666667
## 2001  0.0000000
## 2002        NaN
## Total 0.6666667

Let’s have a quick comparison of the totals

attach(previousArrested_compare)
barplot(as.matrix(previousArrested_compare["Total",]),beside=TRUE,ylim = c(0:1), xlab = "Number of previous arrests")
abline(h = 0)

Since we have only 9 people arrested with 6 previous records, they are not sufficient to estimate the likelihood of being released for the people who have 6 previous arrests. For the people who have 0 to 5 previous records, the chance of being released decreases as the number of records increases.

Conclusion:
Through the analysis, we find out that people around 20s are more likely to have possession of marijuana. People after age 55 rarely have possession of marijuana. The total number of people who have possession of marijuana increases each year from 1997 to 2000. The number starts to decrease in 2001 and then dramatically dropped in 2002. It may indicate that the movement had put in great effort in stopping people from possession of marijuana. However, it could also happen that the local police simply arrest less people intentionally. For the people who are arrested, Whites have higher percentage to be released than blacks; people who are employed have higher percentage to be released than people who are unemployed; citizens have higher percentage to be released than non-citizens. The percentage of male being released and the percentage of female being released are about the same. People who have more records of arrests have lower chance to be released.

SPS_Bridge_R_HW3

Euclid Zhang

8/4/2019