Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online.
There are two main types of crimes: violent crimes, and property crimes. In this problem, we’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car. In this problem, we’ll use some basic data analysis in R to understand the motor vehicle thefts in Chicago.
Here is a list of descriptions of the variables:
ID : a unique identifier for each observation
Date : the date the crime occurred
LocationDescription : the location where the crime occurred
Arrest : whether or not an arrest was made for the crime (TRUE if an arrest was made, and FALSE if an arrest was not made)
Domestic : whether or not the crime was a domestic crime, meaning that it was committed against a family member (TRUE if it was domestic, and FALSE if it was not domestic)
Beat : the area, or “beat” in which the crime occurred. This is the smallest regional division defined by the Chicago police department.
District : the police district in which the crime occured. Each district is composed of many beats, and are defined by the Chicago Police Department.
CommunityArea : the community area in which the crime occurred. Since the 1920s, Chicago has been divided into what are called “community areas”, of which there are now 77. The community areas were devised in an attempt to create socially homogeneous regions.
Year : the year in which the crime occurred.
Latitude : the latitude of the location at which the crime occurred.
Longitude : the longitude of the location at which the crime occurred.
mvt <- read.csv("mvtWeek1.csv")
Analyzing the structure and summary of data
str(mvt)
## 'data.frame': 191641 obs. of 11 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
** There are 191641 obs. and 11 variables in the data set.
Looking at the summary of data
summary(mvt)
## ID Date
## Min. :1310022 5/16/08 0:00 : 11
## 1st Qu.:2832144 10/17/01 22:00: 10
## Median :4762956 4/13/04 21:00 : 10
## Mean :4968629 9/17/05 22:00 : 10
## 3rd Qu.:7201878 10/12/01 22:00: 9
## Max. :9181151 10/13/01 22:00: 9
## (Other) :191582
## LocationDescription Arrest Domestic
## STREET :156564 Mode :logical Mode :logical
## PARKING LOT/GARAGE(NON.RESID.): 14852 FALSE:176105 FALSE:191226
## OTHER : 4573 TRUE :15536 TRUE :415
## ALLEY : 2308 NA's :0 NA's :0
## GAS STATION : 2111
## DRIVEWAY - RESIDENTIAL : 1675
## (Other) : 9558
## Beat District CommunityArea Year
## Min. : 111 Min. : 1.00 Min. : 0 Min. :2001
## 1st Qu.: 722 1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
## Median :1121 Median :10.00 Median :32 Median :2006
## Mean :1259 Mean :11.82 Mean :38 Mean :2006
## 3rd Qu.:1733 3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
## Max. :2535 Max. :31.00 Max. :77 Max. :2012
## NA's :43056 NA's :24616
## Latitude Longitude
## Min. :41.64 Min. :-87.93
## 1st Qu.:41.77 1st Qu.:-87.72
## Median :41.85 Median :-87.68
## Mean :41.84 Mean :-87.68
## 3rd Qu.:41.92 3rd Qu.:-87.64
## Max. :42.02 Max. :-87.52
## NA's :2276 NA's :2276
Now we will try to answer few simple questions based on the summary of overall data.
1.) What is the maximum value of the variable “ID”?
max(mvt$ID)
## [1] 9181151
2.) What is the minimum value of the variable “Beat”?
min(mvt$Beat)
## [1] 111
3.) How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
summary(mvt$Arrest)
## Mode FALSE TRUE NA's
## logical 176105 15536 0
** There are 15536 number of crimes for which an arrest was made.
4.) How many observations have a LocationDescription value of ALLEY?
summary(mvt$LocationDescription == "ALLEY")
## Mode FALSE TRUE NA's
## logical 189333 2308 0
** There are 2308 observations which have a LocationDescription value of ALLEY
Summarising the date variables in the data set
summary(mvt$Date)
## 5/16/08 0:00 10/17/01 22:00 4/13/04 21:00 9/17/05 22:00 10/12/01 22:00
## 11 10 10 10 9
## 10/13/01 22:00 10/26/01 22:00 11/17/01 19:00 11/26/03 21:00 12/2/05 21:00
## 9 9 9 9 9
## 3/28/02 22:00 4/29/01 22:00 4/5/02 21:00 6/8/12 9:00 7/28/01 23:00
## 9 9 9 9 9
## 7/5/02 22:00 8/10/04 23:00 8/30/02 23:00 9/30/10 22:00 9/9/01 22:00
## 9 9 9 9 9
## 10/15/01 22:00 10/22/05 19:00 10/31/07 22:00 10/5/01 22:00 11/26/04 21:00
## 8 8 8 8 8
## 12/1/04 22:00 12/22/05 12:00 12/29/01 19:00 12/7/05 16:00 2/16/02 21:00
## 8 8 8 8 8
## 3/24/05 22:00 3/9/01 20:00 4/4/11 22:00 4/7/02 22:00 5/22/10 1:00
## 8 8 8 8 8
## 6/19/12 22:00 6/2/11 23:00 6/27/10 22:00 6/28/01 22:00 7/30/12 22:00
## 8 8 8 8 8
## 7/31/08 22:00 8/10/04 20:00 8/12/06 0:00 8/18/02 1:00 8/25/02 22:00
## 8 8 8 8 8
## 8/28/06 21:00 9/10/12 22:00 9/17/01 22:00 9/18/01 22:00 9/22/05 0:00
## 8 8 8 8 8
## 9/7/07 0:00 1/1/03 21:00 1/10/07 22:00 1/14/03 12:00 1/19/02 22:00
## 8 7 7 7 7
## 1/2/04 20:00 1/6/01 19:00 1/6/03 18:00 1/6/11 20:00 1/6/11 21:00
## 7 7 7 7 7
## 10/11/01 22:00 10/12/07 13:00 10/13/01 15:00 10/13/05 0:00 10/13/06 20:00
## 7 7 7 7 7
## 10/15/07 22:00 10/21/01 19:00 10/21/02 19:00 10/22/03 22:00 10/22/07 20:00
## 7 7 7 7 7
## 10/27/10 22:00 10/28/04 22:00 10/29/02 17:00 10/3/01 18:00 10/3/01 21:00
## 7 7 7 7 7
## 10/30/04 22:00 11/10/03 22:00 11/13/12 21:00 11/19/02 20:00 11/24/01 17:00
## 7 7 7 7 7
## 11/25/03 21:00 11/28/10 22:00 11/3/02 0:00 12/12/03 23:00 12/18/12 20:00
## 7 7 7 7 7
## 12/22/10 8:00 12/23/05 12:00 12/7/03 22:00 2/1/03 22:00 2/17/06 22:00
## 7 7 7 7 7
## 2/25/02 0:00 3/1/02 21:00 3/19/01 20:00 3/21/07 20:00 3/23/12 21:00
## 7 7 7 7 7
## 3/28/07 22:00 3/6/01 0:00 4/11/10 22:00 4/15/01 20:00 (Other)
## 7 7 7 7 190872
Converting the date in to specific format and summarising the Date
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
summary(DateConvert)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24"
## Max.
## "2012-12-31"
median(DateConvert)
## [1] "2006-05-21"
Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt.
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
mvt$Date = DateConvert
str(mvt)
## 'data.frame': 191641 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
Exploring few questions around related to dates of theft.
1.) In which month did the fewest motor vehicle thefts occur?
which.min(table(mvt$Month))
## February
## 4
2.) On which weekday did the most motor vehicle thefts occur?
which.max(table(mvt$Weekday))
## Friday
## 1
3.) Which month has the largest number of motor vehicle thefts for which an arrest was made?
which.max(table(mvt$Arrest == "TRUE", mvt$Month)[2,])
## January
## 5
4.) For what proportion of motor vehicle thefts in 2007 was an arrest made?
sum(mvt$Year == 2007 & mvt$Arrest == "TRUE")/ sum(mvt$Year == 2007)
## [1] 0.08487395
5.) For what proportion of motor vehicle thefts in 2012 was an arrest made?
sum(mvt$Year == 2012 & mvt$Arrest == "TRUE")/ sum(mvt$Year == 2012)
## [1] 0.03902924
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
** We want to find the top five locations where motor vehicle thefts occur.
sort(table(mvt$LocationDescription))
##
## AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 1
## AIRPORT EXTERIOR - SECURE AREA
## 1
## ANIMAL HOSPITAL
## 1
## APPLIANCE STORE
## 1
## CTA TRAIN
## 1
## JAIL / LOCK-UP FACILITY
## 1
## NEWSSTAND
## 1
## BRIDGE
## 2
## COLLEGE/UNIVERSITY RESIDENCE HALL
## 2
## CURRENCY EXCHANGE
## 2
## BOWLING ALLEY
## 3
## CLEANING STORE
## 3
## MEDICAL/DENTAL OFFICE
## 3
## ABANDONED BUILDING
## 4
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
## 4
## BARBERSHOP
## 4
## LAKEFRONT/WATERFRONT/RIVERBANK
## 4
## LIBRARY
## 4
## SAVINGS AND LOAN
## 4
## AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
## 5
## CHA APARTMENT
## 5
## DAY CARE CENTER
## 5
## FIRE STATION
## 5
## FOREST PRESERVE
## 6
## BANK
## 7
## CONVENIENCE STORE
## 7
## DRUG STORE
## 8
## OTHER COMMERCIAL TRANSPORTATION
## 8
## ATHLETIC CLUB
## 9
## AIRPORT VENDING ESTABLISHMENT
## 10
## AIRPORT PARKING LOT
## 11
## SCHOOL, PRIVATE, BUILDING
## 14
## TAVERN/LIQUOR STORE
## 14
## FACTORY/MANUFACTURING BUILDING
## 16
## BAR OR TAVERN
## 17
## WAREHOUSE
## 17
## MOVIE HOUSE/THEATER
## 18
## RESIDENCE PORCH/HALLWAY
## 18
## NURSING HOME/RETIREMENT HOME
## 21
## TAXICAB
## 21
## DEPARTMENT STORE
## 22
## HIGHWAY/EXPRESSWAY
## 22
## SCHOOL, PRIVATE, GROUNDS
## 23
## VEHICLE-COMMERCIAL
## 23
## AIRPORT EXTERIOR - NON-SECURE AREA
## 24
## OTHER RAILROAD PROP / TRAIN DEPOT
## 28
## SMALL RETAIL STORE
## 33
## CONSTRUCTION SITE
## 35
## CAR WASH
## 44
## COLLEGE/UNIVERSITY GROUNDS
## 47
## GOVERNMENT BUILDING/PROPERTY
## 48
## RESTAURANT
## 49
## CHURCH/SYNAGOGUE/PLACE OF WORSHIP
## 56
## GROCERY FOOD STORE
## 80
## HOSPITAL BUILDING/GROUNDS
## 101
## SCHOOL, PUBLIC, BUILDING
## 114
## HOTEL/MOTEL
## 124
## COMMERCIAL / BUSINESS OFFICE
## 126
## CTA GARAGE / OTHER PROPERTY
## 148
## SPORTS ARENA/STADIUM
## 166
## APARTMENT
## 184
## SCHOOL, PUBLIC, GROUNDS
## 206
## PARK PROPERTY
## 255
## POLICE FACILITY/VEH PARKING LOT
## 266
## AIRPORT/AIRCRAFT
## 363
## CHA PARKING LOT/GROUNDS
## 405
## SIDEWALK
## 462
## VEHICLE NON-COMMERCIAL
## 817
## VACANT LOT/LAND
## 985
## RESIDENCE-GARAGE
## 1176
## RESIDENCE
## 1302
## RESIDENTIAL YARD (FRONT/BACK)
## 1536
## DRIVEWAY - RESIDENTIAL
## 1675
## GAS STATION
## 2111
## ALLEY
## 2308
## OTHER
## 4573
## PARKING LOT/GARAGE(NON.RESID.)
## 14852
## STREET
## 156564
These are Street, Parking Lot/Garage (Non. Resid.), Alley, Gas Station, and Driveway - Residential.
Creating a subset of data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”.
Top5 <- subset(mvt, mvt$LocationDescription == "STREET" | mvt$LocationDescription == "PARKING LOT/GARAGE(NON.RESID.)"
| mvt$LocationDescription == "ALLEY" | mvt$LocationDescription == "GAS STATION"
| mvt$LocationDescription == "DRIVEWAY - RESIDENTIAL")
summary(Top5)
## ID Date
## Min. :1310022 Min. :2001-01-01
## 1st Qu.:2827268 1st Qu.:2003-07-08
## Median :4752514 Median :2006-05-16
## Mean :4959006 Mean :2006-08-18
## 3rd Qu.:7184899 3rd Qu.:2009-10-15
## Max. :9181151 Max. :2012-12-31
##
## LocationDescription Arrest Domestic
## STREET :156564 Mode :logical Mode :logical
## PARKING LOT/GARAGE(NON.RESID.): 14852 FALSE:163492 FALSE:177193
## ALLEY : 2308 TRUE :14018 TRUE :317
## GAS STATION : 2111 NA's :0 NA's :0
## DRIVEWAY - RESIDENTIAL : 1675
## ABANDONED BUILDING : 0
## (Other) : 0
## Beat District CommunityArea Year
## Min. : 111 Min. : 1.00 Min. : 0.00 Min. :2001
## 1st Qu.: 722 1st Qu.: 6.00 1st Qu.:22.00 1st Qu.:2003
## Median :1121 Median :10.00 Median :31.00 Median :2006
## Mean :1264 Mean :11.88 Mean :37.74 Mean :2006
## 3rd Qu.:1733 3rd Qu.:17.00 3rd Qu.:59.00 3rd Qu.:2009
## Max. :2535 Max. :31.00 Max. :77.00 Max. :2012
## NA's :39988 NA's :22857
## Latitude Longitude Month Weekday
## Min. :41.64 Min. :-87.92 Length:177510 Length:177510
## 1st Qu.:41.77 1st Qu.:-87.72 Class :character Class :character
## Median :41.85 Median :-87.68 Mode :character Mode :character
## Mean :41.85 Mean :-87.68
## 3rd Qu.:41.92 3rd Qu.:-87.64
## Max. :42.02 Max. :-87.52
## NA's :2099 NA's :2099
str(Top5)
## 'data.frame': 177510 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 72 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
## $ District : int 6 12 7 2 25 4 2 10 12 10 ...
## $ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
## $ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
** There are 177510 observation in Top5 subset Answering few questions related to location of arrest and thefts happened in Top 5 locations.
1.) One of the locations has a much higher arrest rate than the other locations. Which is it?
Top5$LocationDescription = factor(Top5$LocationDescription)
table(Top5$LocationDescription, Top5$Arrest , Top5$Arrest)
## , , = FALSE
##
##
## FALSE TRUE
## ALLEY 2059 0
## DRIVEWAY - RESIDENTIAL 1543 0
## GAS STATION 1672 0
## PARKING LOT/GARAGE(NON.RESID.) 13249 0
## STREET 144969 0
##
## , , = TRUE
##
##
## FALSE TRUE
## ALLEY 0 249
## DRIVEWAY - RESIDENTIAL 0 132
## GAS STATION 0 439
## PARKING LOT/GARAGE(NON.RESID.) 0 1603
## STREET 0 11595
** Gas Station has by far the highest percentage of arrests, with over 20% of motor vehicle thefts resulting in an arrest.
2.) On which day of the week do the most motor vehicle thefts at gas stations happen? – Saturday
3.) On which day of the week do the fewest motor vehicle thefts in residential driveways happen? – Saturday
** If you look at the boxplot, the one for Arrest=TRUE is definitely skewed towards the bottom of the plot, meaning that there were more crimes for which arrests were made in the first half of the time period.