Section 1 - Loading the Data

1.1

How many rows of data (observations) are in this dataset?

D=read.csv("C:/bussiness analytics/data/mvtWeek1.csv")
nrow(D)
[1] 191641

nrow()觀察列的數量

1.2

How many variables are in this dataset?

ncol(D)
[1] 11

ncol()觀察行的數量

1.3

Using the “max” function, what is the maximum value of the variable “ID”?

max(D$ID)
[1] 9181151

$是指定的意思

1.4

What is the minimum value of the variable “Beat”?

min(D$Beat)
[1] 111

1.5

How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

sum(D$Arrest==TRUE)
[1] 15536

==是等於的意思

1.6

How many observations have a LocationDescription value of ALLEY?

sum(D$LocationDescription=="ALLEY")
[1] 2308

sum for 個數, mean for ratio mean(D$LocationDescription==“ALLEY”) 會出現比率

Section 2 - Understanding Dates in R

In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).

2.1

In what format are the entries in the variable Date?

  • Month/Day/Year Hour:Minute
  • Day/Month/Year Hour:Minute
  • Hour:Minute Month/Day/Year
  • Hour:Minute Day/Month/Year
D$Date = as.character(D$Date)
head(D$Date,5)
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00"
[5] "12/31/12 21:30"
# Month/Day/Year Hour:Minute

head()整串數列中預設值為前六項,但若加“,”在加“數字”,則僅出現受指定的前幾項數字

2.2

Now, let’s convert these characters into a Date object in R. In your R console, type

DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))

This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.

What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)

ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")
median(ts)
[1] "2006-05-21 12:30:00 CST"

as.POSIXct(x, format=指定的時間格式)是處理時間->精準到日月年 時分

as.Date()是處理時間->僅能到天數

ts是指設定完as.POSIXct()後要放置的位置

median()計算中位數的函數

2.3

Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:

mvt$Month = months(DateConvert)

mvt$Weekday = weekdays(DateConvert)

This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:

mvt$Date = DateConvert

Using the table command, answer the following questions.

D$Month = format(ts, "%m")
D$Weekday = format(ts, "%w")

指定D當中的Month->D\(Month -> 在定義D\)Month=format(ts,“%m”) 指定D當中的Weekday->D\(Weekday->在定義D\)Weekday=format(ts,“%w”)

In which month did the fewest motor vehicle thefts occur?

sort(table(D$Month))

   02    04    03    06    05    01    09    11    12    08    07    10 
13511 15280 15758 16002 16035 16047 16060 16063 16426 16572 16801 17086 

table()將數列字串表格化,才能看出最少的 sort()以遞增的方式排列字串

2.4

On which weekday did the most motor vehicle thefts occur?

sort(table(D$Weekday))

    0     2     6     4     1     3     5 
26316 26791 27118 27319 27397 27416 29284 

2.5

Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?

library(dplyr)
package 愼㸱愼㸵dplyr愼㸱愼㸶 was built under R version 3.5.1
Attaching package: 愼㸱愼㸵dplyr愼㸱愼㸶

The following objects are masked from 愼㸱愼㸵package:stats愼㸱愼㸶:

    filter, lag

The following objects are masked from 愼㸱愼㸵package:base愼㸱愼㸶:

    intersect, setdiff, setequal, union

library(dplyr) 是可以做為更有效率地作資料處理

tapply(D$Arrest, D$Month, sum) %>% sort 
  05   06   02   09   04   11   03   07   08   10   12   01 
1187 1230 1238 1248 1252 1256 1298 1324 1329 1342 1397 1435 

tapply()是the sum of Arrest by month 或the mean of Arrest by month %>%是第二層 或第三層括號的意思

