How many rows of data (observations) are in this dataset?
mvt = read.csv("Unit1/mvtWeek1.csv")
nrow(mvt)
[1] 191641
How many variables are in this dataset?
str(mvt)
'data.frame': 191641 obs. of 11 variables:
$ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
$ Date : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
$ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
$ District : int 6 12 16 7 2 25 4 2 10 12 ...
$ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Latitude : num 41.8 41.9 42 41.8 41.8 ...
$ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
Using the “max” function, what is the maximum value of the variable “ID”?
max(mvt$ID)
[1] 9181151
What is the minimum value of the variable “Beat”?
min(mvt$Beat)
[1] 111
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
summary(mvt)
ID Date
Min. :1310022 5/16/08 0:00 : 11
1st Qu.:2832144 10/17/01 22:00: 10
Median :4762956 4/13/04 21:00 : 10
Mean :4968629 9/17/05 22:00 : 10
3rd Qu.:7201878 10/12/01 22:00: 9
Max. :9181151 10/13/01 22:00: 9
(Other) :191582
LocationDescription
STREET :156564
PARKING LOT/GARAGE(NON.RESID.): 14852
OTHER : 4573
ALLEY : 2308
GAS STATION : 2111
DRIVEWAY - RESIDENTIAL : 1675
(Other) : 9558
Arrest Domestic Beat
Mode :logical Mode :logical Min. : 111
FALSE:176105 FALSE:191226 1st Qu.: 722
TRUE :15536 TRUE :415 Median :1121
Mean :1259
3rd Qu.:1733
Max. :2535
District CommunityArea Year
Min. : 1.00 Min. : 0 Min. :2001
1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
Median :10.00 Median :32 Median :2006
Mean :11.82 Mean :38 Mean :2006
3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
Max. :31.00 Max. :77 Max. :2012
NA's :43056 NA's :24616
Latitude Longitude
Min. :41.64 Min. :-87.93
1st Qu.:41.77 1st Qu.:-87.72
Median :41.85 Median :-87.68
Mean :41.84 Mean :-87.68
3rd Qu.:41.92 3rd Qu.:-87.64
Max. :42.02 Max. :-87.52
NA's :2276 NA's :2276
How many observations have a LocationDescription value of ALLEY?
summary(mvt)
ID Date
Min. :1310022 5/16/08 0:00 : 11
1st Qu.:2832144 10/17/01 22:00: 10
Median :4762956 4/13/04 21:00 : 10
Mean :4968629 9/17/05 22:00 : 10
3rd Qu.:7201878 10/12/01 22:00: 9
Max. :9181151 10/13/01 22:00: 9
(Other) :191582
LocationDescription
STREET :156564
PARKING LOT/GARAGE(NON.RESID.): 14852
OTHER : 4573
ALLEY : 2308
GAS STATION : 2111
DRIVEWAY - RESIDENTIAL : 1675
(Other) : 9558
Arrest Domestic Beat
Mode :logical Mode :logical Min. : 111
FALSE:176105 FALSE:191226 1st Qu.: 722
TRUE :15536 TRUE :415 Median :1121
Mean :1259
3rd Qu.:1733
Max. :2535
District CommunityArea Year
Min. : 1.00 Min. : 0 Min. :2001
1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
Median :10.00 Median :32 Median :2006
Mean :11.82 Mean :38 Mean :2006
3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
Max. :31.00 Max. :77 Max. :2012
NA's :43056 NA's :24616
Latitude Longitude
Min. :41.64 Min. :-87.93
1st Qu.:41.77 1st Qu.:-87.72
Median :41.85 Median :-87.68
Mean :41.84 Mean :-87.68
3rd Qu.:41.92 3rd Qu.:-87.64
Max. :42.02 Max. :-87.52
NA's :2276 NA's :2276
In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).
In what format are the entries in the variable Date?
summary(mvt)
ID Date
Min. :1310022 5/16/08 0:00 : 11
1st Qu.:2832144 10/17/01 22:00: 10
Median :4762956 4/13/04 21:00 : 10
Mean :4968629 9/17/05 22:00 : 10
3rd Qu.:7201878 10/12/01 22:00: 9
Max. :9181151 10/13/01 22:00: 9
(Other) :191582
LocationDescription
STREET :156564
PARKING LOT/GARAGE(NON.RESID.): 14852
OTHER : 4573
ALLEY : 2308
GAS STATION : 2111
DRIVEWAY - RESIDENTIAL : 1675
(Other) : 9558
Arrest Domestic Beat
Mode :logical Mode :logical Min. : 111
FALSE:176105 FALSE:191226 1st Qu.: 722
TRUE :15536 TRUE :415 Median :1121
Mean :1259
3rd Qu.:1733
Max. :2535
District CommunityArea Year
Min. : 1.00 Min. : 0 Min. :2001
1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
Median :10.00 Median :32 Median :2006
Mean :11.82 Mean :38 Mean :2006
3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
Max. :31.00 Max. :77 Max. :2012
NA's :43056 NA's :24616
Latitude Longitude
Min. :41.64 Min. :-87.93
1st Qu.:41.77 1st Qu.:-87.72
Median :41.85 Median :-87.68
Mean :41.84 Mean :-87.68
3rd Qu.:41.92 3rd Qu.:-87.64
Max. :42.02 Max. :-87.52
NA's :2276 NA's :2276
Now, let’s convert these characters into a Date object in R. In your R console, type
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
median(DateConvert)
[1] "2006-05-21"
Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date = DateConvert
Using the table command, answer the following questions.
In which month did the fewest motor vehicle thefts occur?
table(mvt$Month)
10月 11月 12月 1月 2月 3月 4月 5月 6月
17086 16063 16426 16047 13511 15758 15280 16035 16002
7月 8月 9月
16801 16572 16060
On which weekday did the most motor vehicle thefts occur?
table(mvt$Weekday)
周二 周六 周日 周三 周四 周五 周一
26791 27118 26316 27416 27319 29284 27397
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
table(mvt$Arrest,mvt$Month)
10月 11月 12月 1月 2月 3月 4月
FALSE 15744 14807 15029 14612 12273 14460 14028
TRUE 1342 1256 1397 1435 1238 1298 1252
5月 6月 7月 8月 9月
FALSE 14848 14772 15477 15243 14812
TRUE 1187 1230 1324 1329 1248
Now, let’s make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
Looking at the histogram, answer the following questions.
In general, does it look like crime increases or decreases from 2002 - 2012?
r{r}
<!-- rnb-source-end -->
<!-- rnb-output-begin eyJkYXRhIjoi6Yyv6KqkOiDkuI3og73nlKjpm7bplbfluqbnmoTorormlbjlkI1cbiJ9 -->
錯誤: 不能用零長度的變數名 ```
In general, does it look like crime increases or decreases from 2005 - 2008?
ANSWER: Decreases
Now, let’s see how arrests have changed over time. Create a boxplot of the variable “Date”, sorted by the variable “Arrest” (if you are not familiar with boxplots and would like to learn more, check out this tutorial). In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
Let’s investigate this further. Use the table function for the next few questions.
For what proportion of motor vehicle thefts in 2001 was an arrest made?
Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.
2152/(18517+2152)
[1] 0.1041173
For what proportion of motor vehicle thefts in 2007 was an arrest made?
1212/(13068+1212)
[1] 0.08487395
For what proportion of motor vehicle thefts in 2012 was an arrest made?
550/(13542+550)
[1] 0.03902924
Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don’t know when the arrests were actually made, our detective work in this area has reached a dead end.
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
sort(table(mvt$LocationDescription))
AIRPORT BUILDING NON-TERMINAL - SECURE AREA
1
AIRPORT EXTERIOR - SECURE AREA
1
ANIMAL HOSPITAL
1
APPLIANCE STORE
1
CTA TRAIN
1
JAIL / LOCK-UP FACILITY
1
NEWSSTAND
1
BRIDGE
2
COLLEGE/UNIVERSITY RESIDENCE HALL
2
CURRENCY EXCHANGE
2
BOWLING ALLEY
3
CLEANING STORE
3
MEDICAL/DENTAL OFFICE
3
ABANDONED BUILDING
4
AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
4
BARBERSHOP
4
LAKEFRONT/WATERFRONT/RIVERBANK
4
LIBRARY
4
SAVINGS AND LOAN
4
AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
5
CHA APARTMENT
5
DAY CARE CENTER
5
FIRE STATION
5
FOREST PRESERVE
6
BANK
7
CONVENIENCE STORE
7
DRUG STORE
8
OTHER COMMERCIAL TRANSPORTATION
8
ATHLETIC CLUB
9
AIRPORT VENDING ESTABLISHMENT
10
AIRPORT PARKING LOT
11
SCHOOL, PRIVATE, BUILDING
14
TAVERN/LIQUOR STORE
14
FACTORY/MANUFACTURING BUILDING
16
BAR OR TAVERN
17
WAREHOUSE
17
MOVIE HOUSE/THEATER
18
RESIDENCE PORCH/HALLWAY
18
NURSING HOME/RETIREMENT HOME
21
TAXICAB
21
DEPARTMENT STORE
22
HIGHWAY/EXPRESSWAY
22
SCHOOL, PRIVATE, GROUNDS
23
VEHICLE-COMMERCIAL
23
AIRPORT EXTERIOR - NON-SECURE AREA
24
OTHER RAILROAD PROP / TRAIN DEPOT
28
SMALL RETAIL STORE
33
CONSTRUCTION SITE
35
CAR WASH
44
COLLEGE/UNIVERSITY GROUNDS
47
GOVERNMENT BUILDING/PROPERTY
48
RESTAURANT
49
CHURCH/SYNAGOGUE/PLACE OF WORSHIP
56
GROCERY FOOD STORE
80
HOSPITAL BUILDING/GROUNDS
101
SCHOOL, PUBLIC, BUILDING
114
HOTEL/MOTEL
124
COMMERCIAL / BUSINESS OFFICE
126
CTA GARAGE / OTHER PROPERTY
148
SPORTS ARENA/STADIUM
166
APARTMENT
184
SCHOOL, PUBLIC, GROUNDS
206
PARK PROPERTY
255
POLICE FACILITY/VEH PARKING LOT
266
AIRPORT/AIRCRAFT
363
CHA PARKING LOT/GROUNDS
405
SIDEWALK
462
VEHICLE NON-COMMERCIAL
817
VACANT LOT/LAND
985
RESIDENCE-GARAGE
1176
RESIDENCE
1302
RESIDENTIAL YARD (FRONT/BACK)
1536
DRIVEWAY - RESIDENTIAL
1675
GAS STATION
2111
ALLEY
2308
OTHER
4573
PARKING LOT/GARAGE(NON.RESID.)
14852
STREET
156564
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.
Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.
How many observations are in Top5?
str(Top5)
'data.frame': 177510 obs. of 13 variables:
$ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
$ Date : Date, format: "2012-12-31" ...
$ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 72 72 72 72 72 72 72 72 ...
$ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
$ District : int 6 12 7 2 25 4 2 10 12 10 ...
$ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
$ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
$ Month : chr "12月" "12月" "12月" "12月" ...
$ Weekday : chr "周一" "周一" "周一" "周一" ...
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
Top5$LocationDescription = factor(Top5$LocationDescription)
If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
Use the Top5 data frame to answer the remaining questions.
One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
Top5$LocationDescription = factor(Top5$LocationDescription)
str(Top5)
'data.frame': 177510 obs. of 14 variables:
$ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
$ Date : Date, format: "2012-12-31" ...
$ LocationDescription: Factor w/ 5 levels "ALLEY","DRIVEWAY - RESIDENTIAL",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
$ District : int 6 12 7 2 25 4 2 10 12 10 ...
$ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
$ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
$ Month : chr "12月" "12月" "12月" "12月" ...
$ Weekday : chr "周一" "周一" "周一" "周一" ...
$ ArrestRate : num 0.1079 0.0493 0.208 0.1079 0.0741 ...
table(Top5$LocationDescription,Top5$Arrest)
FALSE TRUE
ALLEY 2059 249
DRIVEWAY - RESIDENTIAL 1543 132
GAS STATION 1672 439
PARKING LOT/GARAGE(NON.RESID.) 13249 1603
STREET 144969 11595
ArrestRate = c(249/(249+2059), 132/(132+2543), 439/(439+1672), 1603/(1602+13249), 11595/(11595+144969))
ArrestRate
[1] 0.10788562 0.04934579 0.20795831 0.10793886
[5] 0.07405917
On which day of the week do the most motor vehicle thefts at gas stations happen? (Monday~Sunday)
table(Top5$Weekday, Top5$LocationDescription)
ALLEY DRIVEWAY - RESIDENTIAL GAS STATION
周二 323 243 270
周六 341 202 338
周日 307 221 336
周三 317 234 273
周四 315 263 282
周五 385 257 332
周一 320 255 280
PARKING LOT/GARAGE(NON.RESID.) STREET
周二 2073 21888
周六 2199 22175
周日 1936 21756
周三 2103 22371
周四 2082 22296
周五 2331 23773
周一 2128 22305
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?(Monday~Sunday)
table(Top5$Weekday, Top5$LocationDescription)
ALLEY DRIVEWAY - RESIDENTIAL GAS STATION
周二 323 243 270
周六 341 202 338
周日 307 221 336
周三 317 234 273
周四 315 263 282
周五 385 257 332
周一 320 255 280
PARKING LOT/GARAGE(NON.RESID.) STREET
周二 2073 21888
周六 2199 22175
周日 1936 21756
周三 2103 22371
周四 2082 22296
周五 2331 23773
周一 2128 22305