For this project I will explore the dataset “Sex Offenders: City of Chicago” provided by the city of Chicago state government. The dataset has 10 variables:
LAST: chr - last name of offender FIRST: chr - first name of offender BLOCK: chr - block offender lives on GENDER: chr - gender of offender RACE: chr - race of offender BIRTH.DATE: chr - birth date of offender AGE: num - age of offender HEIGHT: int - heigt of offender WEIGHT: int - weight of offender VICTIM.MINOR: chr - was the offender’s victim(s) a minor Y/N
I will explore a few questions, including:
what are the demographics of the offenders (age, race, gender)? age of the victims? location of the offenders?
City of Chicago, (2021, July 9). Sex Offenders: City of Chicago. Chicago Data Portal. https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy.
#bring in data set
sexoff <- read.csv("sex-offenders.csv")
head(sexoff)
## LAST FIRST BLOCK GENDER RACE BIRTH.DATE AGE HEIGHT WEIGHT
## 1 MCGINNIS DELL 0000X E 100TH PL MALE BLACK 07/26/1982 37 505 141
## 2 WHITE CHARLES 0000X E 100TH ST MALE BLACK 05/02/1961 58 509 180
## 3 SIMON GERA 0000X E 110TH PL MALE BLACK 05/21/1952 67 504 110
## 4 WARD RICHARD 0000X E 110TH PL MALE BLACK 07/23/1949 70 506 190
## 5 WORTHON SEBASTIAN 0000X E 119TH PL MALE BLACK 10/22/1982 37 600 180
## 6 SCOTT MICHAEL 0000X E 91ST ST MALE BLACK 05/22/1957 62 506 170
## VICTIM.MINOR
## 1 Y
## 2 Y
## 3 Y
## 4 Y
## 5 Y
## 6 N
## bring in a few libraries
library(stats, dplyr, tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
The data is clean enough to make simple graphs regarding demographics of the offender and age of victims.
## portion of offenders by race
race_table <- table(sexoff$RACE)
prop.table(race_table)
##
## ASIAN/PACIFIC ISLANDER BLACK BLACK HISPANIC
## 0.014814815 0.612037037 0.003703704
## UNKNOWN WHITE WHITE HISPANIC
## 0.001851852 0.179629630 0.187962963
## creating a count_race column so I can order the bars
count_race <- sexoff %>%
group_by(RACE) %>%
mutate(count_race = n())
##bar graph
## removing the legend for space
ggplot(count_race, aes(x = reorder(RACE, count_race), fill = RACE)) +
geom_bar(stat="count") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
xlab("RACE")+
theme(legend.position = "none") +
ggtitle("Number of Offenders by Race")
The majority of sex offenders in Chicago are Black, followed by White Hispanic, White, Asian/Pacific Islander, Black Hispanic, and unknown. I’m going to change
## portion of offenders that are male
genders_table <- table(sexoff$GENDER)
prop.table(genders_table)
##
## FEMALE MALE
## 0.01851852 0.98148148
## bar graph for number of offenders by gender
ggplot(sexoff) +
stat_count(mapping = aes(x = GENDER, fill = GENDER)) +
scale_fill_manual(values=c("#56B4E9", "#E69F00")) +
xlab(NULL) +
ggtitle("Number of Offenders by Gender")
Sex offenders are predominantly male.
## portion of victims that are minors
minors <- table(sexoff$VICTIM.MINOR)
prop.table(minors)
##
## N Y
## 0.2833333 0.7166667
##plotting number of minor victims with bar graph, and using different colors
ggplot(sexoff) +
stat_count(mapping = aes(x = VICTIM.MINOR, fill = VICTIM.MINOR)) +
scale_fill_manual(values=c("#9977FF", "#CCFFEE")) +
xlab("VICTIM IS A MINOR?")+
theme(legend.title = element_blank()) +
ggtitle("Number of Minor Victims")
Victim’s are primarily minors.
library(tibble)
sex_off <- as_tibble(sexoff)
##boxplot to show the median age, removing the legend, changing the color
ggplot(sex_off, aes(x = RACE, y = AGE, fill = RACE)) +
geom_boxplot() +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
theme(legend.position="none") +
ggtitle("Median Age of Offender by Race") +
scale_fill_brewer(type = "seq",palette = 10,direction = 1, aesthetics = "fill")
Median age of offenders appear to be between 45 and 62 when broken down by race. Looks like there are two outliers. I’ll explore if there are outliers for all of the ages combined as one group.
## I will explore the age variable to do a outlier analysis
summary(sexoff)
## LAST FIRST BLOCK GENDER
## Length:1080 Length:1080 Length:1080 Length:1080
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## RACE BIRTH.DATE AGE HEIGHT
## Length:1080 Length:1080 Min. :19.00 Min. : 60.0
## Class :character Class :character 1st Qu.:39.00 1st Qu.:507.0
## Mode :character Mode :character Median :49.00 Median :509.0
## Mean :49.36 Mean :531.9
## 3rd Qu.:58.00 3rd Qu.:600.0
## Max. :96.00 Max. :608.0
## WEIGHT VICTIM.MINOR
## Min. :100.0 Length:1080
## 1st Qu.:160.0 Class :character
## Median :185.0 Mode :character
## Mean :190.2
## 3rd Qu.:210.0
## Max. :607.0
Mediand age for all sex offenders is 49.
## looking at max and min ages
min(sexoff$AGE)
## [1] 19
max(sexoff$AGE)
## [1] 96
#histogram of all ages
hist(sexoff$AGE,
xlab = "Age",
main = "Histogram of Age")
So the outlier for White Hispanic is an outlier for that group only, but not for all ages. The outlier for black however, is still an outlier for all ages.
Going to filter to for ages above 80 to pinpoint the two outliers.
## using a fitler to look at the top ages since my histogram and box plot show these as potential outliers
age_outlier <- sexoff %>%
filter(sexoff$AGE > 80)
age_outlier
## LAST FIRST BLOCK GENDER RACE BIRTH.DATE AGE
## 1 CORY HARDING 015XX W 68TH ST MALE BLACK 09/04/1923 96
## 2 WILLIAMS DONALD 029XX W BRYN MAWR AVE MALE WHITE 03/08/1935 84
## 3 MIRTIA ELIGHA 041XX S WENTWORTH AVE MALE BLACK 03/31/1938 81
## 4 MOODY OSEVELL 060XX S WASHTENAW AVE MALE BLACK 04/17/1935 84
## 5 CUEVAS OMAR 061XX W THORNDALE AVE MALE WHITE HISPANIC 10/06/1935 84
## 6 SHIPP SAM 086XX S MAY ST MALE BLACK 04/19/1937 82
## 7 BRYANT MARY 099XX S ABERDEEN ST FEMALE BLACK 09/21/1938 81
## HEIGHT WEIGHT VICTIM.MINOR
## 1 508 147 Y
## 2 507 225 Y
## 3 508 160 Y
## 4 507 160 Y
## 5 600 154 Y
## 6 601 235 Y
## 7 505 155 Y
## finding the percentiles
lowerage <- quantile(sexoff$AGE, 0.01)
lowerage
## 1%
## 27
upperage <- quantile(sexoff$AGE, 0.99)
upperage
## 99%
## 79
According to the quantile function using interval between 1% and 99% all ages before 27 and above 79 can be considered outliers. Hence almost all registered sex offenders living in Chicago are between the age of 27 and 79.
Exploring if there is a connection between race of offender and whether the victim is a minor.
##creating a new df
sexoff2 <- sexoff%>%
group_by(RACE, VICTIM.MINOR) %>%
summarize(MINOR_COUNT=n())
## `summarise()` has grouped output by 'RACE'. You can override using the `.groups` argument.
head(sexoff2)
## # A tibble: 6 x 3
## # Groups: RACE [3]
## RACE VICTIM.MINOR MINOR_COUNT
## <chr> <chr> <int>
## 1 ASIAN/PACIFIC ISLANDER N 1
## 2 ASIAN/PACIFIC ISLANDER Y 15
## 3 BLACK N 240
## 4 BLACK Y 421
## 5 BLACK HISPANIC N 1
## 6 BLACK HISPANIC Y 3
## I am trying to use jcolors package without much success so far
devtools::install_github("jaredhuling/jcolors")
## Skipping install of 'jcolors' from a github remote, the SHA1 (f2007550) has not changed since last install.
## Use `force = TRUE` to force installation
library(jcolors)
jcolors('default')
## kelly_green rich_electric_blue maximum_red majorelle_blue
## "#29BF12" "#00A5CF" "#DE1A1A" "#574AE2"
## fluorescent_orange
## "#FFBF00"
ggplot(sexoff2, aes(x=RACE, y = MINOR_COUNT, fill = VICTIM.MINOR)) +
geom_bar(stat='identity', position = "dodge") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
scale_y_continuous(trans = 'log2')
This chart shows a higher nubmer of victims were minors when the offender was black. There are high instances for this as well with White and White Hispanice offendors.
## re-reading data in and chaning certain categorical variables to factors.
sexoff3 <- within(read.csv("sex-offenders.csv"), {
RACE <- as.factor(RACE)
GENDER <- as.factor(GENDER)
VICTIM.MINOR <- as.factor(VICTIM.MINOR)
})
##creating table that shows if victim was a minor grouped by the race of offender
table(sexoff3$RACE, sexoff3$VICTIM.MINOR)
##
## N Y
## ASIAN/PACIFIC ISLANDER 1 15
## BLACK 240 421
## BLACK HISPANIC 1 3
## UNKNOWN 0 2
## WHITE 31 163
## WHITE HISPANIC 33 170
## chi test for minor and race relationship.
chisq.test(sexoff3$RACE, sexoff3$VICTIM.MINOR, correct = FALSE)
## Warning in chisq.test(sexoff3$RACE, sexoff3$VICTIM.MINOR, correct = FALSE): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: sexoff3$RACE and sexoff3$VICTIM.MINOR
## X-squared = 54.524, df = 5, p-value = 1.636e-10
From this data set and the low p-value, there is a relationship between whether the victim is a minor and the race of the offender.
## chi test for minor and age relationship.
chisq.test(sexoff3$AGE, sexoff3$VICTIM.MINOR, correct = FALSE)
## Warning in chisq.test(sexoff3$AGE, sexoff3$VICTIM.MINOR, correct = FALSE): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: sexoff3$AGE and sexoff3$VICTIM.MINOR
## X-squared = 76.102, df = 60, p-value = 0.07847
Given the higher than 0.05 p-value, there is most likely not a relationship between age of the offender and whether the victim is a minor.
Exploring geographic data.
## breaking up block into four variables so I can isolate quadrant
sexoff4 <- sexoff %>% separate(BLOCK, c("HouseNumber", "Quadrant", "Street", "StreetType"), " ", extra = "merge")
##grouping by quandrant
sexoff4 %>%
group_by(Quadrant) %>%
summarize(quandrant_count = n())
## # A tibble: 4 x 2
## Quadrant quandrant_count
## <chr> <int>
## 1 E 71
## 2 N 179
## 3 S 538
## 4 W 292
quadrant_count2 <- table(sexoff4$Quadrant)
prop.table(quadrant_count2)
##
## E N S W
## 0.06574074 0.16574074 0.49814815 0.27037037
There are a lot more sex offenders living in southside Chicago. I don’t actually know if ‘S’ means southside of Chicago or if its just a street naming convention that does not imply quandrant so this data may not tell us much.
Exploring if there are multiple sex offenders living on the same block.
##grouping by BLOCK
sexoff5 <- sexoff3 %>%
group_by(BLOCK) %>%
summarize(block_count = n())
sexoff5
## # A tibble: 902 x 2
## BLOCK block_count
## <chr> <int>
## 1 0000X E 100TH PL 1
## 2 0000X E 100TH ST 1
## 3 0000X E 110TH PL 2
## 4 0000X E 119TH PL 1
## 5 0000X E 91ST ST 1
## 6 0000X N HOYNE AVE 1
## 7 0000X W 110TH ST 1
## 8 0000X W 112TH PL 3
## 9 0000X W 112TH ST 1
## 10 0000X W 113TH PL 1
## # … with 892 more rows
## summarizing by blocks to show how many sex offenders live at each one.
sexoff5 %>%
group_by(block_count) %>%
summarize(BLOCK = n())
## # A tibble: 8 x 2
## block_count BLOCK
## <int> <int>
## 1 1 793
## 2 2 75
## 3 3 16
## 4 4 8
## 5 5 6
## 6 6 2
## 7 7 1
## 8 8 1
I’m now going to create a treemap to show which blocks have more than two sex offenders living on it. ’m filtering out two and below to make it more managable. Treemap seems the most appropriate since I do not have time data, other than date of birth.
##without a filter I'd have too many blocks for my treemap so reducing to 2 or more sex offenders.
sexoff5 %>%
filter(block_count > 1)
## # A tibble: 109 x 2
## BLOCK block_count
## <chr> <int>
## 1 0000X E 110TH PL 2
## 2 0000X W 112TH PL 3
## 3 003XX E 120TH PL 3
## 4 004XX S CLARK ST 2
## 5 006XX N DRAKE AVE 2
## 6 007XX W GRAND AVE 2
## 7 008XX W 51ST PL 4
## 8 010XX W BALMORAL AVE 2
## 9 012XX E 71ST PL 2
## 10 012XX E 79TH ST 5
## # … with 99 more rows
##bringing in treemap library
library(treemap)
##creating treemap with which uses block_count for size and color.
treemap(sexoff5, index="BLOCK", vSize="block_count",
vColor="block_count", type="value",
title = "Chicago Registered Sex Offenders",
title.legend = "Number of Registered Sex Offenders By Block",
palette="PuOr")
Not very useful looking map and I don’t like the color. I’ll filter out count of 2 and pick a different color.
##bringin viridis colors
library(viridis)
## Loading required package: viridisLite
##filtering out counts of 2
sexoff6 <- sexoff5 %>%
filter(block_count > 2)
##creating treemap
treemap(sexoff6, index="BLOCK", vSize="block_count",
vColor="block_count", type="value",
title = "Chicago Registered Sex Offenders",
title.legend = "Number of Registered Sex Offenders",
palette="Set1")
This is better, but I cannot get the color viridis right and I want more control over the text format so switching to geom_treemap using treemapify library.
##bring in treemapify library
library(treemapify)
##create geom_treemap. Making the font white, border black, left justifying font, and other font formats.
ggplot(sexoff6, aes(area=block_count, fill = block_count, label = BLOCK)) +
geom_treemap(colour="black") +
geom_treemap_text(colour = "white", place = "left", size = 1, grow = TRUE, reflow = TRUE) +
scale_fill_viridis_c() +
labs(title="Chicago Registered Sex Offenders by Block", fill="# of Offenders")
So now we have a treemap which shows the blocks in Chicago that have more than two sex offenders living there. This could be useful for a parent who is moving and wants to move to a block with two or less offenders. Conversely, sex offenders who are having trouble finding a place to rent or buy due to their criminal history could also use this map to find potential blocks more receptive to a sex offender moving to that location.
Brief Essay:
One reason I was drawn to this dataset is due to my line of work in homeless services. I work for a city government agency similar in size to Chicago, and one unique challenge we face is housing homeless clients whom are registered sex offenders. A crucial tenet one should have when working in this industry is that everyone deserves a home, even registered sex offenders. In fact, many social workers would attest that the best way to avoid a homeless sex offender is to stabilize their housing. Individuals who are unhoused are much more likely to not take medication or receive counseling, both of which are key factors in reducing instances of relapse among the sex offender community, hence, a housed sex offender is less likely to re-offend. The crux of the challenge lies with finding both a landlord and a neighborhood that will rent a unit to a sex offender, as most are understandably reluctant to do so.
Given that at least 21 states and 400 municipalities have laws restricting where a sex offenders can live, where should they go? (Sandra Norman-Eady) To ensure lower rates of recidivism you want to both keep sex offenders away from potential targets, but also not isolate them to the point of homelessness or retreating to an underground status which makes them difficult to track.
The above treemap represents blocks in Chicago that contain at least three sex offenders as residences. Keeping track of where sex offenders are being allowed to rent or buy in such a way could direct sex offenders to stabilize housing and provide the rest of society needed information that will help inform where they wish to move, or not move, their families to.
After completing this treemap there are two areas that warrant further exploration if more time were available. First, I could not create an interactivity layer for this treemap. Ideally, a tooltip would allow for more blocks to be show on the treemap that would leave little room for text that a tool tip or hoover widget could compensate for. Howver, I have included a link to a tableu treemap visualization I created that does include tooltips (https://public.tableau.com/app/profile/kris.sutton/viz/KSProject2ChicagoSexOffenders/Treemap). Second, the data contained some location information that would lend itself to a GIS map if I were able to find the needed GIS files that could accommodate its creation. Also, next time I would include historical data to explore any associated trends.
Despite the heavy material, I had a lot of fun creating these visualizations and I am happy with the results.
Sandra Norman-Eady, C. A. (n.d.). SEX OFFENDERS’ RESIDENCY RESTRICTIONS. https://www.cga.ct.gov/2007/rpt/2007-r-0380.htm.