Chicago is the third most populous city in the United States, with a population of over 2.7 million people. We’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car.
context in dataset mvtWeek1.csv
- ID: a unique identifier for each observation
- Date: the date the crime occurred
- LocationDescription: the location where the crime occurred
- Arrest: whether or not an arrest was made for the crime (TRUE if an arrest was made, and FALSE if an arrest was not made)
- Domestic: whether or not the crime was a domestic crime, meaning that it was committed against a family member (TRUE if it was domestic, and FALSE if it was not domestic)
- Beat: the area, or “beat” in which the crime occurred. This is the smallest regional division defined by the Chicago police department.
- District: the police district in which the crime occured. Each district is composed of many beats, and are defined by the Chicago Police Department.
- CommunityArea: the community area in which the crime occurred. Since the 1 920s, Chicago has been divided into what are called “community areas”, of which there are now 77. The community areas were devised in an attempt to create socially homogeneous regions.
- Year: the year in which the crime occurred.
- Latitude: the latitude of the location at which the crime occurred.
- Longitude: the longitude of the location at which the crime occurred.
mvt = read.csv("mvtWeek1.csv")
str(mvt)
## 'data.frame': 191641 obs. of 11 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
summary(mvt)
## ID Date
## Min. :1310022 5/16/08 0:00 : 11
## 1st Qu.:2832144 10/17/01 22:00: 10
## Median :4762956 4/13/04 21:00 : 10
## Mean :4968629 9/17/05 22:00 : 10
## 3rd Qu.:7201878 10/12/01 22:00: 9
## Max. :9181151 10/13/01 22:00: 9
## (Other) :191582
## LocationDescription Arrest Domestic
## STREET :156564 Mode :logical Mode :logical
## PARKING LOT/GARAGE(NON.RESID.): 14852 FALSE:176105 FALSE:191226
## OTHER : 4573 TRUE :15536 TRUE :415
## ALLEY : 2308 NA's :0 NA's :0
## GAS STATION : 2111
## DRIVEWAY - RESIDENTIAL : 1675
## (Other) : 9558
## Beat District CommunityArea Year
## Min. : 111 Min. : 1.00 Min. : 0 Min. :2001
## 1st Qu.: 722 1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
## Median :1121 Median :10.00 Median :32 Median :2006
## Mean :1259 Mean :11.82 Mean :38 Mean :2006
## 3rd Qu.:1733 3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
## Max. :2535 Max. :31.00 Max. :77 Max. :2012
## NA's :43056 NA's :24616
## Latitude Longitude
## Min. :41.64 Min. :-87.93
## 1st Qu.:41.77 1st Qu.:-87.72
## Median :41.85 Median :-87.68
## Mean :41.84 Mean :-87.68
## 3rd Qu.:41.92 3rd Qu.:-87.64
## Max. :42.02 Max. :-87.52
## NA's :2276 NA's :2276
How many observations have value TRUE in the Arrest variable
table(mvt$Arrest)
##
## FALSE TRUE
## 176105 15536
Check the date type: Month/Day/Year Hour:Minute
mvt$Date[1]
## [1] 12/31/12 23:15
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50
Convert these characters into a Date object
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
Extract the month and the day of the week and add these variables to our data frame mvt.
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
replace the old Date variable with DateConvert
mvt$Date = DateConvert
In which month did the fewest motor vehicle thefts occur: Feb.
table(mvt$Month)
##
## April August December February January July June
## 15280 16572 16426 13511 16047 16801 16002
## March May November October September
## 15758 16035 16063 17086 16060
On which weekday did the most motor vehicle thefts occur: Friday
table(mvt$Weekday)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 29284 27397 27118 26316 27319 26791 27416
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made: Jan.
table(mvt$Arrest, mvt$Month)
##
## April August December February January July June March May
## FALSE 14028 15243 15029 12273 14612 15477 14772 14460 14848
## TRUE 1252 1329 1397 1238 1435 1324 1230 1298 1187
##
## November October September
## FALSE 14807 15744 14812
## TRUE 1256 1342 1248
make a histogram
hist(mvt$Date, breaks=100)

- Create a boxplot of the variable “Date”, sorted by the variable “Arrest”. In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
- the one for Arrest=TRUE is definitely skewed towards the bottom of the plot, meaning that there were more crimes for which arrests were made in the first half of the time period.
boxplot(mvt$Date ~ mvt$Arrest)

what proportion of motor vehicle thefts in 2001 was an arrest made:
- 2152/(2152+1 8517) = 0.1041
table(mvt$Arrest, mvt$Year)
##
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## FALSE 18517 16638 14859 15169 14956 14796 13068 13425 11327 14796 15012
## TRUE 2152 2115 1798 1693 1528 1302 1212 1020 840 701 625
##
## 2012
## FALSE 13542
## TRUE 550
If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category
sort(table(mvt$LocationDescription))
##
## AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 1
## AIRPORT EXTERIOR - SECURE AREA
## 1
## ANIMAL HOSPITAL
## 1
## APPLIANCE STORE
## 1
## CTA TRAIN
## 1
## JAIL / LOCK-UP FACILITY
## 1
## NEWSSTAND
## 1
## BRIDGE
## 2
## COLLEGE/UNIVERSITY RESIDENCE HALL
## 2
## CURRENCY EXCHANGE
## 2
## BOWLING ALLEY
## 3
## CLEANING STORE
## 3
## MEDICAL/DENTAL OFFICE
## 3
## ABANDONED BUILDING
## 4
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
## 4
## BARBERSHOP
## 4
## LAKEFRONT/WATERFRONT/RIVERBANK
## 4
## LIBRARY
## 4
## SAVINGS AND LOAN
## 4
## AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
## 5
## CHA APARTMENT
## 5
## DAY CARE CENTER
## 5
## FIRE STATION
## 5
## FOREST PRESERVE
## 6
## BANK
## 7
## CONVENIENCE STORE
## 7
## DRUG STORE
## 8
## OTHER COMMERCIAL TRANSPORTATION
## 8
## ATHLETIC CLUB
## 9
## AIRPORT VENDING ESTABLISHMENT
## 10
## AIRPORT PARKING LOT
## 11
## SCHOOL, PRIVATE, BUILDING
## 14
## TAVERN/LIQUOR STORE
## 14
## FACTORY/MANUFACTURING BUILDING
## 16
## BAR OR TAVERN
## 17
## WAREHOUSE
## 17
## MOVIE HOUSE/THEATER
## 18
## RESIDENCE PORCH/HALLWAY
## 18
## NURSING HOME/RETIREMENT HOME
## 21
## TAXICAB
## 21
## DEPARTMENT STORE
## 22
## HIGHWAY/EXPRESSWAY
## 22
## SCHOOL, PRIVATE, GROUNDS
## 23
## VEHICLE-COMMERCIAL
## 23
## AIRPORT EXTERIOR - NON-SECURE AREA
## 24
## OTHER RAILROAD PROP / TRAIN DEPOT
## 28
## SMALL RETAIL STORE
## 33
## CONSTRUCTION SITE
## 35
## CAR WASH
## 44
## COLLEGE/UNIVERSITY GROUNDS
## 47
## GOVERNMENT BUILDING/PROPERTY
## 48
## RESTAURANT
## 49
## CHURCH/SYNAGOGUE/PLACE OF WORSHIP
## 56
## GROCERY FOOD STORE
## 80
## HOSPITAL BUILDING/GROUNDS
## 101
## SCHOOL, PUBLIC, BUILDING
## 114
## HOTEL/MOTEL
## 124
## COMMERCIAL / BUSINESS OFFICE
## 126
## CTA GARAGE / OTHER PROPERTY
## 148
## SPORTS ARENA/STADIUM
## 166
## APARTMENT
## 184
## SCHOOL, PUBLIC, GROUNDS
## 206
## PARK PROPERTY
## 255
## POLICE FACILITY/VEH PARKING LOT
## 266
## AIRPORT/AIRCRAFT
## 363
## CHA PARKING LOT/GROUNDS
## 405
## SIDEWALK
## 462
## VEHICLE NON-COMMERCIAL
## 817
## VACANT LOT/LAND
## 985
## RESIDENCE-GARAGE
## 1176
## RESIDENCE
## 1302
## RESIDENTIAL YARD (FRONT/BACK)
## 1536
## DRIVEWAY - RESIDENTIAL
## 1675
## GAS STATION
## 2111
## ALLEY
## 2308
## OTHER
## 4573
## PARKING LOT/GARAGE(NON.RESID.)
## 14852
## STREET
## 156564
Create a subset of your data, only taking observations for which the theft happened in one of these five locations
Top5 = subset(mvt, LocationDescription=="STREET" | LocationDescription=="PARKING LOT/GARAGE(NON.RESID.)" | LocationDescription=="ALLEY" | LocationDescription=="GAS STATION" | LocationDescription=="DRIVEWAY - RESIDENTIAL")
Top5$LocationDescription = factor(Top5$LocationDescription)
str(Top5)
## 'data.frame': 177510 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 5 levels "ALLEY","DRIVEWAY - RESIDENTIAL",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
## $ District : int 6 12 7 2 25 4 2 10 12 10 ...
## $ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
## $ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
table(Top5$LocationDescription, Top5$Arrest)
##
## FALSE TRUE
## ALLEY 2059 249
## DRIVEWAY - RESIDENTIAL 1543 132
## GAS STATION 1672 439
## PARKING LOT/GARAGE(NON.RESID.) 13249 1603
## STREET 144969 11595