Section 1 - Loading the Data
1.1
How many rows of data (observations) are in this dataset?
D=read.csv("C:/bussiness analytics/data/mvtWeek1.csv")
nrow(D)
[1] 191641
nrow()觀察列的數量
1.2
How many variables are in this dataset?
ncol(D)
[1] 11
ncol()觀察行的數量
1.3
Using the “max” function, what is the maximum value of the variable “ID”?
max(D$ID)
[1] 9181151
$是指定的意思
1.4
What is the minimum value of the variable “Beat”?
min(D$Beat)
[1] 111
1.5
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
sum(D$Arrest==TRUE)
[1] 15536
==是等於的意思
1.6
How many observations have a LocationDescription value of ALLEY?
sum(D$LocationDescription=="ALLEY")
[1] 2308
sum for 個數, mean for ratio mean(D$LocationDescription==“ALLEY”) 會出現比率
Section 2 - Understanding Dates in R
In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).
2.1
In what format are the entries in the variable Date?
- Month/Day/Year Hour:Minute
- Day/Month/Year Hour:Minute
- Hour:Minute Month/Day/Year
- Hour:Minute Day/Month/Year
D$Date = as.character(D$Date)
head(D$Date,5)
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00"
[5] "12/31/12 21:30"
# Month/Day/Year Hour:Minute
head()整串數列中預設值為前六項,但若加“,”在加“數字”,則僅出現受指定的前幾項數字
2.2
Now, let’s convert these characters into a Date object in R. In your R console, type
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")
median(ts)
[1] "2006-05-21 12:30:00 CST"
as.POSIXct(x, format=指定的時間格式)是處理時間->精準到日月年 時分
as.Date()是處理時間->僅能到天數
ts是指設定完as.POSIXct()後要放置的位置
median()計算中位數的函數
2.3
Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date = DateConvert
Using the table command, answer the following questions.
D$Month = format(ts, "%m")
D$Weekday = format(ts, "%w")
指定D當中的Month->D\(Month -> 在定義D\)Month=format(ts,“%m”) 指定D當中的Weekday->D\(Weekday->在定義D\)Weekday=format(ts,“%w”)
In which month did the fewest motor vehicle thefts occur?
sort(table(D$Month))
02 04 03 06 05 01 09 11 12 08 07 10
13511 15280 15758 16002 16035 16047 16060 16063 16426 16572 16801 17086
table()將數列字串表格化,才能看出最少的 sort()以遞增的方式排列字串
2.4
On which weekday did the most motor vehicle thefts occur?
sort(table(D$Weekday))
0 2 6 4 1 3 5
26316 26791 27118 27319 27397 27416 29284
2.5
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
library(dplyr)
package 愼㸱愼㸵dplyr愼㸱愼㸶 was built under R version 3.5.1
Attaching package: 愼㸱愼㸵dplyr愼㸱愼㸶
The following objects are masked from 愼㸱愼㸵package:stats愼㸱愼㸶:
filter, lag
The following objects are masked from 愼㸱愼㸵package:base愼㸱愼㸶:
intersect, setdiff, setequal, union
library(dplyr) 是可以做為更有效率地作資料處理
tapply(D$Arrest, D$Month, sum) %>% sort
05 06 02 09 04 11 03 07 08 10 12 01
1187 1230 1238 1248 1252 1256 1298 1324 1329 1342 1397 1435
tapply()是the sum of Arrest by month 或the mean of Arrest by month %>%是第二層 或第三層括號的意思
Section 3 - Visualizing Crime Trends
3.1
Now, let’s make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
par(cex=0.7)
hist(ts, "year", las=2, main="No. Motor Crimes by year")

par()下繪圖指令 cex是指設置文本和符合的尺度 hist()直方圖 X軸的日期從水平變成垂直,要下以下的命令 las=1水平 las=2垂直 Y軸有兩種單位Density指令下freq=F和Frequency指令下freq=T,預設為Density本題改成Frequency,在hist()中下freq=T main=“No. Motor Crimes by Quarter”下圖形主題 加上主題的直方圖 hist(ts, “Date”, las=2, freq=T, main=“No. Motor Crimes by Quarter”)
Looking at the histogram, answer the following questions.
In general, does it look like crime increases or decreases from 2002 - 2012?
hist(ts,'year',las=2)
# Decreases
In general, does it look like crime increases or decreases from 2005 - 2008?
hist(ts,'year',las=2)
# Decreases
3.2
Now, let’s see how arrests have changed over time. Create a boxplot of the variable “Date”, sorted by the variable “Arrest” (if you are not familiar with boxplots and would like to learn more, check out this tutorial). In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
boxplot( ts ~ D$Arrest )

boxplot(ts ~ D$Arrest)用boxplot畫圖
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
tapply(D$Arrest, ts > as.POSIXct("2007-01-01"), sum)
FALSE TRUE
10588 4948
tapply(D$Arrest, ts > as.POSIXct(“2007-01-01”), sum) ts > as.POSIXct(“2007-01-01”)找出資料中大於2007年的開始的數量 sum加總
3.3
Let’s investigate this further. Use the table function for the next few questions.
For what proportion of motor vehicle thefts in 2001 was an arrest made?
Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.
table(D$Arrest, D$Year)
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
FALSE 18517 16638 14859 15169 14956 14796 13068 13425 11327 14796 15012 13542
TRUE 2152 2115 1798 1693 1528 1302 1212 1020 840 701 625 550
tapply(D$Arrest, D$Year, mean) %>% round(3)
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
0.104 0.113 0.108 0.100 0.093 0.081 0.085 0.071 0.069 0.045 0.040 0.039
先用table畫出表格 tapply找到比率,再用 %>% round(3)計算小數點到第三位
3.4
For what proportion of motor vehicle thefts in 2007 was an arrest made?
# 0.085
3.5
For what proportion of motor vehicle thefts in 2012 was an arrest made?
#0.039
Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don’t know when the arrests were actually made, our detective work in this area has reached a dead end.
Section 4 - Popular Locations
4.1
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
- Bank
- Gas Station
- Hotel/Motel
- Street
- Car Wash
- Restaurant
- Parking Lot/Garage (Non-Residential)
- Alley
- Driveway (Residential)
- Vacant Lot/Land
table(D$LocationDescription) %>% sort %>% tail
DRIVEWAY - RESIDENTIAL GAS STATION
1675 2111
ALLEY OTHER
2308 4573
PARKING LOT/GARAGE(NON.RESID.) STREET
14852 156564
tail()字串數列中最後的幾項,若無設定數字,則會用預設值六項呈現
4.2
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.
Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.
How many observations are in Top5?
(Top5=names(table(D$LocationDescription) %>% sort %>% tail(6))[-4])
[1] "DRIVEWAY - RESIDENTIAL" "GAS STATION"
[3] "ALLEY" "PARKING LOT/GARAGE(NON.RESID.)"
[5] "STREET"
sum(D$LocationDescription %in% Top5)
[1] 177510
subset()= %in% 篩選子資料,例如LocationDescription和TopLocations
4.3
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
Top5$LocationDescription = factor(Top5$LocationDescription)
If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
Use the Top5 data frame to answer the remaining questions.
One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
Top5 = subset(D, LocationDescription %in% Top5)
tapply(Top5$Arrest, Top5$LocationDescription, mean) %>% sort %>% round(3)
STREET DRIVEWAY - RESIDENTIAL
0.074 0.079
ALLEY PARKING LOT/GARAGE(NON.RESID.)
0.108 0.108
GAS STATION
0.208
as.character換格式
4.4
On which day of the week do the most motor vehicle thefts at gas stations happen? (Monday~Sunday)
ts[Top5$Location == "GAS STATION"] %>% format('%w') %>% table %>% sort
.
3 2 0 4 6 1 5
293 305 315 321 325 329 344
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
# 星期六"
如果要在更複雜的子集合 tapply(Top5\(Arrest, list(Top5\)LocationDescription, Top5$Weekday), mean) %>% round(3)
4.5
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?(Monday~Sunday)
ts[Top5$Location == "DRIVEWAY - RESIDENTIAL"] %>% format('%w') %>% table %>% sort
.
2 4 3 0 1 6 5
223 237 239 247 270 282 293
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
#星期四"
