Lab Overview
Analyzing Crime in Chicago
Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online.
Chicago is the third most populous city in the United States, with a population of over 2.7 million people. The city of Chicago is shown in the map below, with the state of Illinois highlighted in red.
There are two main types of crimes: violent crimes, and property crimes. In this problem, we’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car. In this problem, we’ll use some basic data analysis in R to understand the motor vehicle thefts in Chicago.
Crime Data
Please download the file mvtWeek1.csv for this problem (do not open this file in any spreadsheet software before completing this problem because it might change the format of the Date field).
Here is a list of descriptions of the variables:
- ID: a unique identifier for each observation
- Date: the date the crime occurred
- LocationDescription: the location where the crime occurred
- Arrest: whether or not an arrest was made for the crime (
TRUE if an arrest was made, and FALSE if an arrest was not made)
- Domestic: whether or not the crime was a domestic crime, meaning that it was committed against a family member (
TRUE if it was domestic, and FALSE if it was not domestic)
- Beat: the area, or “beat” in which the crime occurred. This is the smallest regional division defined by the Chicago police department.
- District: the police district in which the crime occured. Each district is composed of many beats, and are defined by the Chicago Police Department.
- CommunityArea: the community area in which the crime occurred. Since the 1920s, Chicago has been divided into what are called “community areas”, of which there are now 77. The community areas were devised in an attempt to create socially homogeneous regions.
- Year: the year in which the crime occurred.
- Latitude: the latitude of the location at which the crime occurred.
- Longitude: the longitude of the location at which the crime occurred.
My Solutions
Problem 1
Problem 1: Loading the Data
1.1)
Read the dataset mvtWeek1.csv into R, using the read.csv function, and call the data frame mvt. Remember to navigate to the directory on your computer containing the file mvtWeek1.csv first. It may take a few minutes to read in the data, since it is pretty large. Then, use the str and summary functions to answer the following questions.
mvt <- read.csv('./Data/mvtWeek1.csv') %>% tbl_df()
summary(mvt)
str(mvt)
Question:
How many rows of data (observations) are in this dataset?
Answer:
There are 191641 observations in the mvt dataset.
1.2)
Question:
How many variables are in this dataset?
Answer:
There are 13 variable in the mvt dataset.
1.3)
Question:
Using the max function, what is the maximum value of the variable ID?
Answer:
The maximum value of the variable ID is 9181151.
1.4)
Question:
What is the minimum value of the variable Beat?
Answer:
The minimum value of the variable Beat is 111.
1.5)
Question:
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
Answer:
There are 15536 observations where mvt$Arrest == TRUE.
1.6)
Question:
How many observations have a LocationDescription value of “ALLEY”?
Answer:
There are 2308 observations where mvt$LocationDescription == 'ALLEY'.
Problem 2
Problem 2: Understanding Dates in R
2.1)
In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date. Use square brackets when looking at a certain entry of a variable.
Question:
In what format are the entries in the variable Date?
- Month/Day/Year Hour:Minute - Day/Month/Year Hour:Minute - Hour:Minute Month/Day/Year - Hour:Minute Day/Month/Year
Answer:
mvt$Date is formatted as Month/Day/Year Hour:Minute
2.2)
Now, let’s convert the mvt$Date character vector into Date objects in R.
Take a look at the variable DateConvert using the summary() function.
Question:
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
medianDate <- median(DateConvert)
answerMonthYear <- paste(months(medianDate), format(medianDate, '%Y'))
paste(medianDate, '->', answerMonthYear)
Answer:
The median date in the mvt dataset is May 2006.
2.3)
Now, let’s extract the month and the day of the week, and add these variables to mvt. We can do this with two simple functions.
This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
Using the table() function, answer the following questions.
Question:
In which month did the fewest motor vehicle thefts occur?
Answer:
The fewest motor vehicle thefts occured in February.
2.4)
Question:
On which weekday did the most motor vehicle thefts occur?
Answer:
Most motor vehicle thefts occured on Friday.
2.5)
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft.
Question:
Which month has the largest number of motor vehicle thefts for which an arrest was made?
Answer:
The greatest number of arrests made for motor vehicle thefts occured in January.
Problem 3
Problem 3: Visualizing Crime Trends
3.1)
Now, let’s make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type:
hist(mvt$Date, breaks=100)
Looking at the histogram, answer the following questions.
Question:
In general, does it look like crime increases or decreases from 2002 - 2012?
Answer:
Crime appears to decrease from 2002 - 2012.
Question:
In general, does it look like crime increases or decreases from 2005 - 2008?
Answer:
Crime appears to decrease from 2005 - 2008.
Question:
In general, does it look like crime increases or decreases from 2009 - 2011?
Answer:
Crime appears to increase from 2009 - 2011.
3.2)
Now, let’s see how arrests have changed over time. Create a boxplot of the variable Date, sorted by the variable Arrest (if you are not familiar with boxplots and would like to learn more, check out this tutorial).
mvt %>%
ggplot(aes(x=Arrest, y=Date, fill=Arrest)) +
coord_flip() +
scale_fill_brewer(palette="Dark2") +
theme(legend.position="none") +
labs(title='Frequency of Arrests Over Time', x='Arrest Made', y='Year') +
geom_boxplot()
In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
Question:
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
Answer:
From the boxplot, we can deduce that there were more crimes for which arrests were made in the first half of the time period between 2001 to 2012.
3.3)
Let’s investigate this further. Use the table() function for the next few questions.
Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.
Question:
For what proportion of motor vehicle thefts in 2001 was an arrest made?
Answer:
An arrest was made in 0.1041173 of motor vehicle thefts during 2001.
3.4)
Question:
For what proportion of motor vehicle thefts in 2007 was an arrest made?
Answer:
An arrest was made in 0.0848739 of motor vehicle thefts during 2007.
3.5)
Question:
For what proportion of motor vehicle thefts in 2012 was an arrest made?
Answer:
An arrest was made in 0.0390292 of motor vehicle thefts during 2012.
Note: Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don’t know when the arrests were actually made, our detective work in this area has reached a dead end.
Problem 4
Problem 4: Popular Locations
4.1)
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
Question:
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category?
Answer:
Top five locations for motor vehicle theft (excluding “Other”):
1. STREET
2. PARKING LOT/GARAGE(NON.RESID.)
3. ALLEY
4. GAS STATION
5. DRIVEWAY - RESIDENTIAL
4.2)
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set Top5.
Question:
How many observations are in Top5?
Answer:
There are 177510 observations in Top5.
4.3)
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
If you run the str() or table() function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
Use the Top5 data frame to answer the remaining questions.
Question:
One of the locations has a much higher arrest rate than the other locations. Which is it?
Answer:
The LocationDescription of GAS STATION has a much higher arrest rate than the other locations.
4.4)
Question:
On which day of the week do the most motor vehicle thefts at gas stations happen?
Answer:
Most motor vehicle thefts at gas stations happen on Saturday.
4.5)
Question:
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
Answer:
The fewest instances of motor vehicle theft in residential driveways occur on Saturday.
