How many rows of data (observations) are in this dataset?
mvt<-read.table("D:/Data/Unit1/mvtWeek1.csv", header = TRUE, sep = ",")
nrow(mvt)
## [1] 191641
How many variables are in this dataset?
ncol(mvt)
## [1] 11
Using the “max” function, what is the maximum value of the variable “ID”?
max(mvt$ID)
## [1] 9181151
What is the minimum value of the variable “Beat”?
min(mvt$Beat)
## [1] 111
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
sum(mvt$Arrest==TRUE)
## [1] 15536
How many observations have a LocationDescription value of ALLEY?
sum(mvt$LocationDescription=="ALLEY")
## [1] 2308
In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).
In what format are the entries in the variable Date?
mvt$Date[1]
## [1] 12/31/12 23:15
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50
#Month/Day/Year Hour:Minute
Now, let’s convert these characters into a Date object in R. In your R console, type
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
summary(DateConvert)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24"
## Max.
## "2012-12-31"
#2006/5/21
Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date = DateConvert
Using the table command, answer the following questions.
In which month did the fewest motor vehicle thefts occur?
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
mvt$Date = DateConvert
table(mvt$Month)
##
## 一月 七月 九月 二月 八月 十一月 十二月 十月 三月 五月
## 16047 16801 16060 13511 16572 16063 16426 17086 15758 16035
## 六月 四月
## 16002 15280
#Feb
On which weekday did the most motor vehicle thefts occur?
table(mvt$Weekday)
##
## 星期一 星期二 星期三 星期五 星期六 星期日 星期四
## 27397 26791 27416 29284 27118 26316 27319
#Friday
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
table(mvt$Arrest,mvt$Month)
##
## 一月 七月 九月 二月 八月 十一月 十二月 十月 三月 五月
## FALSE 14612 15477 14812 12273 15243 14807 15029 15744 14460 14848
## TRUE 1435 1324 1248 1238 1329 1256 1397 1342 1298 1187
##
## 六月 四月
## FALSE 14772 14028
## TRUE 1230 1252
Now, let’s make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
hist(mvt$Date, breaks=100)
Looking at the histogram, answer the following questions.
In general, does it look like crime increases or decreases from 2002 - 2012?
#Decreases
In general, does it look like crime increases or decreases from 2005 - 2008?
#Decreases
Now, let’s see how arrests have changed over time. Create a boxplot of the variable “Date”, sorted by the variable “Arrest” (if you are not familiar with boxplots and would like to learn more, check out this tutorial). In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
boxplot(mvt$Date~mvt$Arrest)
Let’s investigate this further. Use the table function for the next few questions.
For what proportion of motor vehicle thefts in 2001 was an arrest made?
Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.
table(mvt$Year,mvt$Arrest)
##
## FALSE TRUE
## 2001 18517 2152
## 2002 16638 2115
## 2003 14859 1798
## 2004 15169 1693
## 2005 14956 1528
## 2006 14796 1302
## 2007 13068 1212
## 2008 13425 1020
## 2009 11327 840
## 2010 14796 701
## 2011 15012 625
## 2012 13542 550
2152/(18517+2152)
## [1] 0.1041173
For what proportion of motor vehicle thefts in 2007 was an arrest made?
table(mvt$Year,mvt$Arrest)
##
## FALSE TRUE
## 2001 18517 2152
## 2002 16638 2115
## 2003 14859 1798
## 2004 15169 1693
## 2005 14956 1528
## 2006 14796 1302
## 2007 13068 1212
## 2008 13425 1020
## 2009 11327 840
## 2010 14796 701
## 2011 15012 625
## 2012 13542 550
1212/(13068+1212)
## [1] 0.08487395
For what proportion of motor vehicle thefts in 2012 was an arrest made?
table(mvt$Year,mvt$Arrest)
##
## FALSE TRUE
## 2001 18517 2152
## 2002 16638 2115
## 2003 14859 1798
## 2004 15169 1693
## 2005 14956 1528
## 2006 14796 1302
## 2007 13068 1212
## 2008 13425 1020
## 2009 11327 840
## 2010 14796 701
## 2011 15012 625
## 2012 13542 550
550/(13542+550)
## [1] 0.03902924
Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don’t know when the arrests were actually made, our detective work in this area has reached a dead end.
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
sort(table(mvt$LocationDescription))
##
## AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 1
## AIRPORT EXTERIOR - SECURE AREA
## 1
## ANIMAL HOSPITAL
## 1
## APPLIANCE STORE
## 1
## CTA TRAIN
## 1
## JAIL / LOCK-UP FACILITY
## 1
## NEWSSTAND
## 1
## BRIDGE
## 2
## COLLEGE/UNIVERSITY RESIDENCE HALL
## 2
## CURRENCY EXCHANGE
## 2
## BOWLING ALLEY
## 3
## CLEANING STORE
## 3
## MEDICAL/DENTAL OFFICE
## 3
## ABANDONED BUILDING
## 4
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
## 4
## BARBERSHOP
## 4
## LAKEFRONT/WATERFRONT/RIVERBANK
## 4
## LIBRARY
## 4
## SAVINGS AND LOAN
## 4
## AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
## 5
## CHA APARTMENT
## 5
## DAY CARE CENTER
## 5
## FIRE STATION
## 5
## FOREST PRESERVE
## 6
## BANK
## 7
## CONVENIENCE STORE
## 7
## DRUG STORE
## 8
## OTHER COMMERCIAL TRANSPORTATION
## 8
## ATHLETIC CLUB
## 9
## AIRPORT VENDING ESTABLISHMENT
## 10
## AIRPORT PARKING LOT
## 11
## SCHOOL, PRIVATE, BUILDING
## 14
## TAVERN/LIQUOR STORE
## 14
## FACTORY/MANUFACTURING BUILDING
## 16
## BAR OR TAVERN
## 17
## WAREHOUSE
## 17
## MOVIE HOUSE/THEATER
## 18
## RESIDENCE PORCH/HALLWAY
## 18
## NURSING HOME/RETIREMENT HOME
## 21
## TAXICAB
## 21
## DEPARTMENT STORE
## 22
## HIGHWAY/EXPRESSWAY
## 22
## SCHOOL, PRIVATE, GROUNDS
## 23
## VEHICLE-COMMERCIAL
## 23
## AIRPORT EXTERIOR - NON-SECURE AREA
## 24
## OTHER RAILROAD PROP / TRAIN DEPOT
## 28
## SMALL RETAIL STORE
## 33
## CONSTRUCTION SITE
## 35
## CAR WASH
## 44
## COLLEGE/UNIVERSITY GROUNDS
## 47
## GOVERNMENT BUILDING/PROPERTY
## 48
## RESTAURANT
## 49
## CHURCH/SYNAGOGUE/PLACE OF WORSHIP
## 56
## GROCERY FOOD STORE
## 80
## HOSPITAL BUILDING/GROUNDS
## 101
## SCHOOL, PUBLIC, BUILDING
## 114
## HOTEL/MOTEL
## 124
## COMMERCIAL / BUSINESS OFFICE
## 126
## CTA GARAGE / OTHER PROPERTY
## 148
## SPORTS ARENA/STADIUM
## 166
## APARTMENT
## 184
## SCHOOL, PUBLIC, GROUNDS
## 206
## PARK PROPERTY
## 255
## POLICE FACILITY/VEH PARKING LOT
## 266
## AIRPORT/AIRCRAFT
## 363
## CHA PARKING LOT/GROUNDS
## 405
## SIDEWALK
## 462
## VEHICLE NON-COMMERCIAL
## 817
## VACANT LOT/LAND
## 985
## RESIDENCE-GARAGE
## 1176
## RESIDENCE
## 1302
## RESIDENTIAL YARD (FRONT/BACK)
## 1536
## DRIVEWAY - RESIDENTIAL
## 1675
## GAS STATION
## 2111
## ALLEY
## 2308
## OTHER
## 4573
## PARKING LOT/GARAGE(NON.RESID.)
## 14852
## STREET
## 156564
#STREET,PARKING LOT,OTHER,ALLEY,GAS STATION
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.
Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.
How many observations are in Top5?
TopLocations = c("STREET", "PARKING LOT/GARAGE(NON.RESID.)", "ALLEY", "GAS STATION", "DRIVEWAY - RESIDENTIAL")
Top5 = subset(mvt, LocationDescription %in% TopLocations)
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
Top5$LocationDescription = factor(Top5$LocationDescription)
If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
Use the Top5 data frame to answer the remaining questions.
One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
table(Top5$LocationDescription,Top5$Arrest)
##
## FALSE TRUE
## ABANDONED BUILDING 0 0
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA 0 0
## AIRPORT BUILDING NON-TERMINAL - SECURE AREA 0 0
## AIRPORT EXTERIOR - NON-SECURE AREA 0 0
## AIRPORT EXTERIOR - SECURE AREA 0 0
## AIRPORT PARKING LOT 0 0
## AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA 0 0
## AIRPORT VENDING ESTABLISHMENT 0 0
## AIRPORT/AIRCRAFT 0 0
## ALLEY 2059 249
## ANIMAL HOSPITAL 0 0
## APARTMENT 0 0
## APPLIANCE STORE 0 0
## ATHLETIC CLUB 0 0
## BANK 0 0
## BAR OR TAVERN 0 0
## BARBERSHOP 0 0
## BOWLING ALLEY 0 0
## BRIDGE 0 0
## CAR WASH 0 0
## CHA APARTMENT 0 0
## CHA PARKING LOT/GROUNDS 0 0
## CHURCH/SYNAGOGUE/PLACE OF WORSHIP 0 0
## CLEANING STORE 0 0
## COLLEGE/UNIVERSITY GROUNDS 0 0
## COLLEGE/UNIVERSITY RESIDENCE HALL 0 0
## COMMERCIAL / BUSINESS OFFICE 0 0
## CONSTRUCTION SITE 0 0
## CONVENIENCE STORE 0 0
## CTA GARAGE / OTHER PROPERTY 0 0
## CTA TRAIN 0 0
## CURRENCY EXCHANGE 0 0
## DAY CARE CENTER 0 0
## DEPARTMENT STORE 0 0
## DRIVEWAY - RESIDENTIAL 1543 132
## DRUG STORE 0 0
## FACTORY/MANUFACTURING BUILDING 0 0
## FIRE STATION 0 0
## FOREST PRESERVE 0 0
## GAS STATION 1672 439
## GOVERNMENT BUILDING/PROPERTY 0 0
## GROCERY FOOD STORE 0 0
## HIGHWAY/EXPRESSWAY 0 0
## HOSPITAL BUILDING/GROUNDS 0 0
## HOTEL/MOTEL 0 0
## JAIL / LOCK-UP FACILITY 0 0
## LAKEFRONT/WATERFRONT/RIVERBANK 0 0
## LIBRARY 0 0
## MEDICAL/DENTAL OFFICE 0 0
## MOVIE HOUSE/THEATER 0 0
## NEWSSTAND 0 0
## NURSING HOME/RETIREMENT HOME 0 0
## OTHER 0 0
## OTHER COMMERCIAL TRANSPORTATION 0 0
## OTHER RAILROAD PROP / TRAIN DEPOT 0 0
## PARK PROPERTY 0 0
## PARKING LOT/GARAGE(NON.RESID.) 13249 1603
## POLICE FACILITY/VEH PARKING LOT 0 0
## RESIDENCE 0 0
## RESIDENCE-GARAGE 0 0
## RESIDENCE PORCH/HALLWAY 0 0
## RESIDENTIAL YARD (FRONT/BACK) 0 0
## RESTAURANT 0 0
## SAVINGS AND LOAN 0 0
## SCHOOL, PRIVATE, BUILDING 0 0
## SCHOOL, PRIVATE, GROUNDS 0 0
## SCHOOL, PUBLIC, BUILDING 0 0
## SCHOOL, PUBLIC, GROUNDS 0 0
## SIDEWALK 0 0
## SMALL RETAIL STORE 0 0
## SPORTS ARENA/STADIUM 0 0
## STREET 144969 11595
## TAVERN/LIQUOR STORE 0 0
## TAXICAB 0 0
## VACANT LOT/LAND 0 0
## VEHICLE-COMMERCIAL 0 0
## VEHICLE NON-COMMERCIAL 0 0
## WAREHOUSE 0 0
On which day of the week do the most motor vehicle thefts at gas stations happen? (Monday~Sunday)
table(Top5$Weekday,Top5$LocationDescription=="GAS STATION")
##
## FALSE TRUE
## 星期一 25008 280
## 星期二 24527 270
## 星期三 25025 273
## 星期五 26746 332
## 星期六 24917 338
## 星期日 24220 336
## 星期四 24956 282
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?(Monday~Sunday)
table(mvt$Weekday,mvt$LocationDescription=="DRIVEWAY - RESIDENTIAL")
##
## FALSE TRUE
## 星期一 27142 255
## 星期二 26548 243
## 星期三 27182 234
## 星期五 29027 257
## 星期六 26916 202
## 星期日 26095 221
## 星期四 27056 263
#Saturday