MITx: 15.071x The Analytics Edge - VISUALIZING ATTRIBUTES OF PAROLE VIOLATORS

Introduction

In the crime lecture, we saw how we can use heatmaps to give a 2-dimensional representation of 3-dimensional data: we made heatmaps of crime counts by time of the day and day of the week. In this problem, we'll learn how to use histograms to show counts by one variable, and then how to visualize 3 dimensions by creating multiple histograms.

Data set is parole.csv. Variables are: male = 1 if the parolee is male, 0 if female race = 1 if the parolee is white, 2 otherwise age = the parolee's age in years at the time of release from prison state = a code for the parolee's state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. These three states were selected due to having a high representation in the dataset. time.served = the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months). max.sentence = the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months). multiple.offenses = 1 if the parolee was incarcerated for multiple offenses, 0 otherwise. crime = a code for the parolee's main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime. violator = 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

LOADING THE DATA

parole <- read.csv("parole.csv")
# Since male, state, and crime are all unordered factors, convert them to
# factor variables
parole$male = as.factor(parole$male)
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)

# Fraction of parole violators who are female
table(parole$male, parole$violator)
##    
##       0   1
##   0 116  14
##   1 481  64
14/78
## [1] 0.1795
# Crimes by State
table(parole$crime, parole$state)
##    
##       1   2   3   4
##   1  66  42  42 165
##   2   9  10  15  72
##   3  34  64  20  35
##   4  34   4   5  58

*Creating Basic Histogram

# Create a histogram to find out the distribution of the age of parolees
library(ggplot2)
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5, color = "blue")

plot of chunk unnamed-chunk-2

Adding Another Dimension

Now suppose we are interested in seeing how the age distribution of male parolees compares to the age distribution of female parolees.

One option would be to create a heatmap with age on one axis and male (a binary variable in our data set) on the other axis. Another option would be to stick with histograms, but to create a separate histogram for each gender. ggplot has the ability to do this automatically using the facet_grid command.

ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(male ~ 
    .)

plot of chunk unnamed-chunk-3

ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(. ~ 
    male)

plot of chunk unnamed-chunk-3

An alternative to faceting is to simply color the different groups differently. To color the data points by group, we need to tell ggplot that a property of the data (male or not male) should be translated to an aesthetic property of the histogram. We can do this by setting the fill parameter within the aesthetic to male.

ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5)

plot of chunk unnamed-chunk-4

Coloring the groups differently is a good way to see the breakdown of age by sex within the single, aggregated histogram. However, the bars here are stacked, meaning that the height of the blue/teal bars in each age bin represents the total number of parolees in that age bin, not just the number of parolees in that group.

An alternative to a single, stacked histogram is to create two histograms and overlay them on top of each other.

1) Tell ggplot not to stack the histograms by adding the argument position=“identity” to the geom_histogram function.

2) Make the bars semi-transparent so we can see both colors by adding the argument alpha=0.5 to the geom_histogram function.

ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5, 
    position = "identity", alpha = 0.5) + labs(fill = "Gender") + scale_fill_hue(guide = "legend", 
    breaks = c(0, 1), labels = c("Female", "Male"))

plot of chunk unnamed-chunk-5

Time Served

# Create a basic histogram with time.served on the x-axis. Set the bin width
# to one month.
ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 1)

plot of chunk unnamed-chunk-6

ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 0.1)

plot of chunk unnamed-chunk-6

Now, suppose we suspect that it is unlikely that each type of crime has the same distribution of time served. To visualize this, change the binwidth back to 1 month, and use facet_grid to create a separate histogram of time.served for each value of the variable crime.

ggplot(data = parole, aes(time.served, fill = crime)) + geom_histogram(binwidth = 1) + 
    facet_grid(. ~ crime) + ggtitle("Time Served by Crime") + xlab("Time Served") + 
    ylab("Number of People ") + labs(fill = "Crime") + scale_fill_hue(guide = "legend", 
    breaks = c(1, 2, 3, 4), labels = c("Other", "larceny", "Drug Related Crime", 
        "Driving Related Crime"))

plot of chunk unnamed-chunk-7

Now, instead of faceting the histograms, overlay them. Remember to set the position and alpha parameters so that the histograms are not stacked.

ggplot(data = parole, aes(x = time.served, fill = crime)) + geom_histogram(binwidth = 1, 
    position = "identity", alpha = 0.5) + labs(fill = "Crime") + scale_fill_hue(guide = "legend", 
    breaks = c(1, 2, 3, 4), labels = c("Other", "larceny", "Drug Related Crime", 
        "Driving Related Crime")) + ggtitle("Time Served by Crime") + xlab("Time Served") + 
    ylab("Number of People ")

plot of chunk unnamed-chunk-8