【1.1】 How many rows of data (observations) are in this dataset?
D = read.csv("data/mvtWeek1.csv", stringsAsFactors=F)
nrow(D)
[1] 191641
summary(D)
ID Date LocationDescription Arrest
Min. :1310022 Length:191641 Length:191641 Mode :logical
1st Qu.:2832144 Class :character Class :character FALSE:176105
Median :4762956 Mode :character Mode :character TRUE :15536
Mean :4968629
3rd Qu.:7201878
Max. :9181151
Domestic Beat District CommunityArea
Mode :logical Min. : 111 Min. : 1 Min. : 0
FALSE:191226 1st Qu.: 722 1st Qu.: 6 1st Qu.:22
TRUE :415 Median :1121 Median :10 Median :32
Mean :1259 Mean :12 Mean :38
3rd Qu.:1733 3rd Qu.:17 3rd Qu.:60
Max. :2535 Max. :31 Max. :77
NA's :43056 NA's :24616
Year Latitude Longitude
Min. :2001 Min. :41.6 Min. :-87.9
1st Qu.:2003 1st Qu.:41.8 1st Qu.:-87.7
Median :2006 Median :41.9 Median :-87.7
Mean :2006 Mean :41.8 Mean :-87.7
3rd Qu.:2009 3rd Qu.:41.9 3rd Qu.:-87.6
Max. :2012 Max. :42.0 Max. :-87.5
NA's :2276 NA's :2276
【1.2】How many variables are in this dataset?
ncol(D)
[1] 11
【1.3】Using the “max” function, what is the maximum value of the variable “ID”?
max(D$ID)
[1] 9181151
【1.4】 What is the minimum value of the variable “Beat”?
min(D$Beat)
[1] 111
【1.5】 How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
sum(D$Arrest == TRUE)
[1] 15536
【1.6】 How many observations have a LocationDescription value of ALLEY?
sum(D$LocationDescription == "ALLEY")
[1] 2308
【2.1】 In what format are the entries in the variable Date?
head(D$Date) # Month/Day/Year Hour:Minute
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00"
[5] "12/31/12 21:30" "12/31/12 20:30"
ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")
par(cex=0.7)
hist(ts,"quarter",las=2,freq=T,xlab="")
table(format(ts,'%w'))
0 1 2 3 4 5 6
26316 27397 26791 27416 27319 29284 27118
table(format(ts,'%m'))
01 02 03 04 05 06 07 08 09 10 11 12
16047 13511 15758 15280 16035 16002 16801 16572 16060 17086 16063 16426
table(weekday=format(ts,'%w'), month=format(ts,'%m'))
month
weekday 01 02 03 04 05 06 07 08 09 10 11 12
0 2110 1837 2075 2070 2168 2239 2339 2304 2352 2424 2254 2144
1 2395 1937 2200 2323 2359 2187 2457 2288 2258 2399 2323 2271
2 2317 1885 2270 2118 2222 2183 2412 2251 2142 2416 2258 2317
3 2259 2007 2242 2060 2345 2347 2408 2428 2239 2484 2182 2415
4 2334 1904 2263 2099 2402 2190 2385 2464 2320 2280 2253 2425
5 2392 2036 2443 2388 2340 2566 2459 2591 2390 2692 2475 2512
6 2240 1905 2265 2222 2199 2290 2341 2246 2359 2391 2318 2342
library(dplyr)
library(d3heatmap)
table(format(ts,"%m"), format(ts,"%Y")) %>%
as.data.frame.matrix %>%
d3heatmap(F,F,col=colorRamp(c('seagreen','lightyellow','red')))
【2.2】 What is the month and year of the median date in our dataset?
median(ts)
[1] "2006-05-21 12:30:00 CST"
【2.3】 In which month did the fewest motor vehicle thefts occur?
sort(table(format(ts,"%m")))
02 04 03 06 05 01 09 11 12 08 07 10
13511 15280 15758 16002 16035 16047 16060 16063 16426 16572 16801 17086
【2.4】 On which weekday did the most motor vehicle thefts occur?
format(ts,"%w") %>% table %>% sort
.
0 2 6 4 1 3 5
26316 26791 27118 27319 27397 27416 29284
【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?
ts[D$Arrest] %>% format('%m') %>% table %>% sort
.
05 06 02 09 04 11 03 07 08 10 12 01
1187 1230 1238 1248 1252 1256 1298 1324 1329 1342 1397 1435
【3.1】 (a) In general, does it look like crime increases or decreases from 2002 - 2012? (b) In general, does it look like crime increases or decreases from 2005 - 2008? (c) In general, does it look like crime increases or decreases from 2009 - 2011?
hist(ts,'year',las=2)
# 2002~2012 : decrease
# 2005~2008 : decrease
# 2009~2011 : increase
【3.2】 Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period?
table(ts > as.POSIXct("2007-01-01"))
FALSE TRUE
105523 86118
【3.3】 For what proportion of motor vehicle thefts in 2001 was an arrest made?
table(D$Arrest, format(ts,'%Y')) %>% prop.table(2) %>% round(3) # 0.104
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
FALSE 0.896 0.887 0.892 0.900 0.907 0.919 0.915 0.929 0.931 0.955 0.960
TRUE 0.104 0.113 0.108 0.100 0.093 0.081 0.085 0.071 0.069 0.045 0.040
2012
FALSE 0.961
TRUE 0.039
【3.4】 For what proportion of motor vehicle thefts in 2007 was an arrest made?
tapply(D$Arrest, format(ts,'%Y'), mean) %>% round(3) # 0.085
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
0.104 0.113 0.108 0.100 0.093 0.081 0.085 0.071 0.069 0.045 0.040 0.039
【3.5】 For what proportion of motor vehicle thefts in 2012 was an arrest made?
# 0.039
【4.1】 Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
table(D$LocationDescription) %>% sort %>% tail(6)
DRIVEWAY - RESIDENTIAL GAS STATION
1675 2111
ALLEY OTHER
2308 4573
PARKING LOT/GARAGE(NON.RESID.) STREET
14852 156564
【4.2】 How many observations are in Top5?
(top5 = names(table(D$LocationDescription) %>% sort %>% tail(6))[-4])
[1] "DRIVEWAY - RESIDENTIAL" "GAS STATION"
[3] "ALLEY" "PARKING LOT/GARAGE(NON.RESID.)"
[5] "STREET"
sum(D$LocationDescription %in% top5) # 177510
[1] 177510
【4.3】 One of the locations has a much higher arrest rate than the other locations. Which is it?
TOP5 = subset(D, LocationDescription %in% top5)
tapply(TOP5$Arrest, TOP5$LocationDescription, mean) %>% sort
STREET DRIVEWAY - RESIDENTIAL
0.074059 0.078806
ALLEY PARKING LOT/GARAGE(NON.RESID.)
0.107886 0.107932
GAS STATION
0.207958
【4.4】 On which day of the week do the most motor vehicle thefts at gas stations happen?
ts[D$Location == "GAS STATION"] %>% format('%w') %>% table %>% sort
.
2 3 1 4 5 0 6
270 273 280 282 332 336 338
【4.5】 On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
ts[D$Location == "DRIVEWAY - RESIDENTIAL"] %>%
format('%w') %>% table %>% sort
.
6 0 3 2 1 5 4
202 221 234 243 255 257 263