The Analytics Edge - Assignment 1

How many rows of data (observations) are in this dataset? How many variables are in this dataset?

mvt <- read.csv("mvtWeek1.csv")
str(mvt)

## 'data.frame':    191641 obs. of  11 variables:
##  $ ID                 : int  8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
##  $ Date               : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
##  $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
##  $ Arrest             : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Beat               : int  623 1213 1622 724 211 2521 423 231 1021 1215 ...
##  $ District           : int  6 12 16 7 2 25 4 2 10 12 ...
##  $ CommunityArea      : int  69 24 11 67 35 19 48 40 29 24 ...
##  $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Latitude           : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude          : num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

Using the “max” function, what is the maximum value of the variable “ID”?

max(mvt$ID)

## [1] 9181151

What is the minimum value of the variable “Beat”?

min(mvt$Beat)

## [1] 111

How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

sum(mvt$Arrest)

## [1] 15536

How many observations have a LocationDescription value of ALLEY?

sum(mvt$LocationDescription == "ALLEY")

## [1] 2308

In what format are the entries in the variable Date?

head(mvt$Date)

## [1] 12/31/12 23:15 12/31/12 22:00 12/31/12 22:00 12/31/12 22:00
## [5] 12/31/12 21:30 12/31/12 20:30
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50

What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)

DateConvert <- as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
summary(DateConvert)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24" 
##         Max. 
## "2012-12-31"

mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
mvt$Date = DateConvert

In which month did the fewest motor vehicle thefts occur?

(t <- table(mvt$Month))

## 
##     April    August  December  February   January      July      June 
##     15280     16572     16426     13511     16047     16801     16002 
##     March       May  November   October September 
##     15758     16035     16063     17086     16060

which.min(t)

## February 
##        4

On which weekday did the most motor vehicle thefts occur?

(t <- table(mvt$Weekday))

## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     29284     27397     27118     26316     27319     26791     27416

which.max(t)

## Friday 
##      1

Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?

arrest <- subset(mvt, mvt$Arrest == TRUE)
(t <- table(arrest$Month))

## 
##     April    August  December  February   January      July      June 
##      1252      1329      1397      1238      1435      1324      1230 
##     March       May  November   October September 
##      1298      1187      1256      1342      1248

which.max(t)

## January 
##       5

First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type

hist(mvt$Date, breaks=100)

Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)

boxplot(mvt$Date ~ mvt$Arrest)

For what proportion of motor vehicle thefts in 2001 was an arrest made?

mvt$Year <- as.numeric(format(mvt$Date, "%Y"))
mvt2001 <- subset(mvt, mvt$Year == 2001)
sum(mvt2001$Arrest) / nrow(mvt2001)

## [1] 0.1041173

For what proportion of motor vehicle thefts in 2007 was an arrest made?

mvt2007 <- subset(mvt, mvt$Year == 2007)
sum(mvt2007$Arrest) / nrow(mvt2007)

## [1] 0.08487395

For what proportion of motor vehicle thefts in 2012 was an arrest made?

mvt2012 <- subset(mvt, mvt$Year == 2012)
sum(mvt2012$Arrest) / nrow(mvt2012)

## [1] 0.03902924

Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.

head(sort(table(mvt$LocationDescription), decreasing=TRUE), 6)

## 
##                         STREET PARKING LOT/GARAGE(NON.RESID.) 
##                         156564                          14852 
##                          OTHER                          ALLEY 
##                           4573                           2308 
##                    GAS STATION         DRIVEWAY - RESIDENTIAL 
##                           2111                           1675

Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.

Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.

How many observations are in Top5?

top5 <- head(sort(table(mvt$LocationDescription), decreasing=TRUE), 6)

The Analytics Edge - Assignment 1

Andy

Thursday, June 11, 2015