Section 1 - Loading the Data
1.1
How many rows of data (observations) are in this dataset?
nrow(D)
[1] 191641
1.2
How many variables are in this dataset?
ncol(D)
[1] 11
1.3
Using the “max” function, what is the maximum value of the variable “ID”?
max(D$ID)
[1] 9181151
1.4
What is the minimum value of the variable “Beat”?
min(D$Beat)
[1] 111
1.5
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
summary(D$Arrest) #15536
Mode FALSE TRUE
logical 176105 15536
1.6
How many observations have a LocationDescription value of ALLEY?
summary(D$LocationDescription == "ALLEY") #2308
Mode FALSE TRUE
logical 189333 2308
Section 2 - Understanding Dates in R
In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).
2.1
In what format are the entries in the variable Date?
- Month/Day/Year Hour:Minute
- Day/Month/Year Hour:Minute
- Hour:Minute Month/Day/Year
- Hour:Minute Day/Month/Year
head(D$Date) #Month/Day/Year Hour:Minute
[1] 12/31/12 23:15 12/31/12 22:00 12/31/12 22:00 12/31/12 22:00
[5] 12/31/12 21:30 12/31/12 20:30
131680 Levels: 10/10/01 0:00 10/10/01 0:01 10/10/01 0:30 ... 9/9/12 9:50
2.2
Now, let’s convert these characters into a Date object in R. In your R console, type
DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
DateConvert = as.Date(strptime(D$Date, "%m/%d/%y %H:%M"))
median(DateConvert)
[1] "2006-05-21"
# May 2006
2.3
Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date = DateConvert
Using the table command, answer the following questions.
In which month did the fewest motor vehicle thefts occur?
D$Month = months(DateConvert)
D$Weekday = weekdays(DateConvert)
D$Date = DateConvert
table(D$Month) #February 13511
April August December February January July June
15280 16572 16426 13511 16047 16801 16002
March May November October September
15758 16035 16063 17086 16060
2.4
On which weekday did the most motor vehicle thefts occur?
table(D$Weekday) #Sunday 26316
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
29284 27397 27118 26316 27319 26791 27416
2.5
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
table(D$Month,D$Arrest) #January 1435
FALSE TRUE
April 14028 1252
August 15243 1329
December 15029 1397
February 12273 1238
January 14612 1435
July 15477 1324
June 14772 1230
March 14460 1298
May 14848 1187
November 14807 1256
October 15744 1342
September 14812 1248
Section 3 - Visualizing Crime Trends
3.1
Now, let’s make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
hist(D$Date, breaks=100)

Looking at the histogram, answer the following questions.
In general, does it look like crime increases or decreases from 2002 - 2012?
#Decreases
In general, does it look like crime increases or decreases from 2005 - 2008?
#decreases
3.2
Now, let’s see how arrests have changed over time. Create a boxplot of the variable “Date”, sorted by the variable “Arrest” (if you are not familiar with boxplots and would like to learn more, check out this tutorial). In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)

3.3
Let’s investigate this further. Use the table function for the next few questions.
For what proportion of motor vehicle thefts in 2001 was an arrest made?
Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.
table(D$Year)
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
20669 18753 16657 16862 16484 16098 14280 14445 12167 15497 15637 14092
ans = 20669/nrow(D)
ans
[1] 0.1078527
3.4
For what proportion of motor vehicle thefts in 2007 was an arrest made?
ans = 14280/nrow(D)
ans
[1] 0.07451433
3.5
For what proportion of motor vehicle thefts in 2012 was an arrest made?
ans = 14092/nrow(D)
ans
[1] 0.07353333
Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don’t know when the arrests were actually made, our detective work in this area has reached a dead end.
Section 4 - Popular Locations
4.1
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?
We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
- Bank
- Gas Station
- Hotel/Motel
- Street
- Car Wash
- Restaurant
- Parking Lot/Garage (Non-Residential)
- Alley
- Driveway (Residential)
- Vacant Lot/Land
head(sort(table(D$LocationDescription),decreasing = TRUE))
STREET PARKING LOT/GARAGE(NON.RESID.)
156564 14852
OTHER ALLEY
4573 2308
GAS STATION DRIVEWAY - RESIDENTIAL
2111 1675
4.2
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.
Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.
How many observations are in Top5?
Top5 = subset(D,LocationDescription == "STREET" |
LocationDescription =="PARKING LOT/GARAGE(NON.RESID.)"|
LocationDescription =="OTHER","ALLEY"|
LocationDescription =="GAS STATION"|
LocationDescription =="DRIVEWAY - RESIDENTIAL")
Error in "ALLEY" | LocationDescription == "GAS STATION" :
operations are possible only for numeric, logical or complex types
4.3
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
Top5$LocationDescription = factor(Top5$LocationDescription)
If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
Use the Top5 data frame to answer the remaining questions.
One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
table(Top5$LocationDescription)
ALLEY GAS STATION
2308 2111
OTHER PARKING LOT/GARAGE(NON.RESID.)
4573 14852
STREET
156564
4.4
On which day of the week do the most motor vehicle thefts at gas stations happen? (Monday~Sunday)
table(Top5$LocationDescription,Top5$Weekday) #ans: sunday
Friday Monday Saturday Sunday Thursday
ALLEY 385 320 341 307 315
GAS STATION 332 280 338 336 282
OTHER 751 704 597 516 687
PARKING LOT/GARAGE(NON.RESID.) 2331 2128 2199 1936 2082
STREET 23773 22305 22175 21756 22296
Tuesday Wednesday
ALLEY 323 317
GAS STATION 270 273
OTHER 637 681
PARKING LOT/GARAGE(NON.RESID.) 2073 2103
STREET 21888 22371
4.5
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?(Monday~Sunday)
table(D$LocationDescription == "DRIVEWAY - RESIDENTIAL",D$Weekday) #ans:Saturday
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
FALSE 29027 27142 26916 26095 27056 26548 27182
TRUE 257 255 202 221 263 243 234
---
title: "AS1-1 An Analytical Detective"
author: "<name> <student ID>"
output: html_notebook
---

- - -

### Section 1 - Loading the Data

#### 1.1 
How many rows of data (observations) are in this dataset?

```{r}
D = read.csv("data/mvtWeek1.csv")
nrow(D)
```


#### 1.2 
How many variables are in this dataset?
```{r}
ncol(D)
```


#### 1.3 
Using the "max" function, what is the maximum value of the variable "ID"?

```{r}
max(D$ID)
```

#### 1.4 
What is the minimum value of the variable "Beat"?
```{r}
min(D$Beat)
```

#### 1.5 
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

```{r}
summary(D$Arrest) #15536
```

#### 1.6 
How many observations have a LocationDescription value of ALLEY?

```{r}
summary(D$LocationDescription == "ALLEY") #2308
```

### Section 2 - Understanding Dates in R


In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).

#### 2.1 
In what format are the entries in the variable Date?

+ Month/Day/Year Hour:Minute
+ Day/Month/Year Hour:Minute
+ Hour:Minute Month/Day/Year
+ Hour:Minute Day/Month/Year

```{r}
head(D$Date) #Month/Day/Year Hour:Minute
```

#### 2.2 

Now, let's convert these characters into a Date object in R. In your R console, type

    DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))

This converts the variable "Date" into a Date object in R. Take a look at the variable DateConvert using the summary function.

What is the month and year of the median date in our dataset? Enter your answer as "Month Year", without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer "March 2008", without the quotes.)

```{r}
DateConvert = as.Date(strptime(D$Date, "%m/%d/%y %H:%M"))
median(DateConvert)
# May 2006
```

#### 2.3
Now, let's extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:

    mvt$Month = months(DateConvert)

    mvt$Weekday = weekdays(DateConvert)

This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:

    mvt$Date = DateConvert

Using the table command, answer the following questions.

In which month did the fewest motor vehicle thefts occur?

```{r}
D$Month = months(DateConvert)
D$Weekday = weekdays(DateConvert)
D$Date = DateConvert
table(D$Month) #February 13511
```

#### 2.4 
On which weekday did the most motor vehicle thefts occur?

```{r}
table(D$Weekday) #Sunday 26316
```

#### 2.5 
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?

```{r}
table(D$Month,D$Arrest) #January 1435
```

### Section 3 - Visualizing Crime Trends

#### 3.1

Now, let's make some plots to help us better understand how crime has changed over time in Chicago. Throughout this problem, and in general, you can save your plot to a file. For more information, this website very clearly explains the process.

First, let's make a histogram of the variable Date. We'll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type

    hist(mvt$Date, breaks=100)

```{r}
hist(D$Date, breaks=100)
```

Looking at the histogram, answer the following questions.

In general, does it look like crime increases or decreases from 2002 - 2012?

+ Increases
+ Decreases

```{r}
#Decreases
```

In general, does it look like crime increases or decreases from 2005 - 2008?

+ Increases
+ Decreases

```{r}
#decreases
```

#### 3.2
Now, let's see how arrests have changed over time. Create a boxplot of the variable "Date", sorted by the variable "Arrest" (if you are not familiar with boxplots and would like to learn more, check out this tutorial). In a boxplot, the bold horizontal line is the median value of the data, the box shows the range of values between the first quartile and third quartile, and the whiskers (the dotted lines extending outside the box) show the minimum and maximum values, excluding any outliers (which are plotted as circles). Outliers are defined by first computing the difference between the first and third quartile values, or the height of the box. This number is called the Inter-Quartile Range (IQR). Any point that is greater than the third quartile plus the IQR or less than the first quartile minus the IQR is considered an outlier.

Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)

+ First half
+ Second half

```{r}
boxplot(Arrest~Date,data=D)
```


#### 3.3
Let's investigate this further. Use the table function for the next few questions.

For what proportion of motor vehicle thefts in 2001 was an arrest made?

Note: in this question and many others in the course, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1.

```{r}
table(D$Year)
ans = 20669/nrow(D)
ans

```

#### 3.4
For what proportion of motor vehicle thefts in 2007 was an arrest made?

```{r}
ans = 14280/nrow(D)
ans

```

#### 3.5
For what proportion of motor vehicle thefts in 2012 was an arrest made?

```{r}
ans = 14092/nrow(D)
ans
```

Since there may still be open investigations for recent crimes, this could explain the trend we are seeing in the data. There could also be other factors at play, and this trend should be investigated further. However, since we don't know when the arrests were actually made, our detective work in this area has reached a dead end.

### Section 4 - Popular Locations

#### 4.1
Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?

We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:

    sort(table(mvt$LocationDescription))

Which locations are the top five locations for motor vehicle thefts, excluding the "Other" category? You should select 5 of the following options.

+ Bank
+ Gas Station
+ Hotel/Motel
+ Street
+ Car Wash
+ Restaurant
+ Parking Lot/Garage (Non-Residential)
+ Alley
+ Driveway (Residential)
+ Vacant Lot/Land

```{r}
head(sort(table(D$LocationDescription),decreasing = TRUE))


```

#### 4.2 
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set "Top5". To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical "or" operation.

Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.

How many observations are in Top5?

```{r}
Top5 = subset(D,LocationDescription == "STREET" | 
                LocationDescription =="PARKING LOT/GARAGE(NON.RESID.)"|
              LocationDescription =="OTHER"|
                LocationDescription =="ALLEY"|
              LocationDescription =="GAS STATION")
```

#### 4.3
R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:

    Top5$LocationDescription = factor(Top5$LocationDescription)

If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.

Use the Top5 data frame to answer the remaining questions.

One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.

```{r}
Top5$LocationDescription = factor(Top5$LocationDescription) #R會自動記憶原本dataset的factor，這時刷table會出現很多factor但數量為0，所以要重刷新factor
table(Top5$LocationDescription)

#ans:STREET

```


#### 4.4 
On which day of the week do the most motor vehicle thefts at gas stations happen?
(Monday~Sunday)

```{r}
table(Top5$LocationDescription,Top5$Weekday) #ans: sunday
```

#### 4.5
On which day of the week do the fewest motor vehicle thefts in residential driveways happen?(Monday~Sunday)

```{r}
levels(D$LocationDescription) #找到DRIVEWAY - RESIDENTIAL
table(D$LocationDescription == "DRIVEWAY - RESIDENTIAL",D$Weekday) #ans:Saturday
```





