How many rows of data (observations) are in this dataset? How many variables are in this dataset?
mvt <- read.csv("mvtWeek1.csv")
str(mvt)
## 'data.frame': 191641 obs. of 11 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
Using the “max” function, what is the maximum value of the variable “ID”?
max(mvt$ID)
## [1] 9181151
What is the minimum value of the variable “Beat”?
min(mvt$Beat)
## [1] 111
How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
sum(mvt$Arrest)
## [1] 15536
How many observations have a LocationDescription value of ALLEY?
sum(mvt$LocationDescription == "ALLEY")
## [1] 2308
In what format are the entries in the variable Date?
head(mvt$Date)
## [1] 12/31/12 23:15 12/31/12 22:00 12/31/12 22:00 12/31/12 22:00
## [5] 12/31/12 21:30 12/31/12 20:30
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50
What is the month and year of the median date in our dataset? Enter your answer as “Month Year”, without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer “March 2008”, without the quotes.)
DateConvert <- as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
summary(DateConvert)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24"
## Max.
## "2012-12-31"
mvt$Month = months(DateConvert)
mvt$Weekday = weekdays(DateConvert)
mvt$Date = DateConvert
In which month did the fewest motor vehicle thefts occur?
(t <- table(mvt$Month))
##
## April August December February January July June
## 15280 16572 16426 13511 16047 16801 16002
## March May November October September
## 15758 16035 16063 17086 16060
which.min(t)
## February
## 4
On which weekday did the most motor vehicle thefts occur?
(t <- table(mvt$Weekday))
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 29284 27397 27118 26316 27319 26791 27416
which.max(t)
## Friday
## 1
Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
arrest <- subset(mvt, mvt$Arrest == TRUE)
(t <- table(arrest$Month))
##
## April August December February January July June
## 1252 1329 1397 1238 1435 1324 1230
## March May November October September
## 1298 1187 1256 1342 1248
which.max(t)
## January
## 5
First, let’s make a histogram of the variable Date. We’ll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
boxplot(mvt$Date ~ mvt$Arrest)
For what proportion of motor vehicle thefts in 2001 was an arrest made?
mvt$Year <- as.numeric(format(mvt$Date, "%Y"))
mvt2001 <- subset(mvt, mvt$Year == 2001)
sum(mvt2001$Arrest) / nrow(mvt2001)
## [1] 0.1041173
For what proportion of motor vehicle thefts in 2007 was an arrest made?
mvt2007 <- subset(mvt, mvt$Year == 2007)
sum(mvt2007$Arrest) / nrow(mvt2007)
## [1] 0.08487395
For what proportion of motor vehicle thefts in 2012 was an arrest made?
mvt2012 <- subset(mvt, mvt$Year == 2012)
sum(mvt2012$Arrest) / nrow(mvt2012)
## [1] 0.03902924
Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
head(sort(table(mvt$LocationDescription), decreasing=TRUE), 6)
##
## STREET PARKING LOT/GARAGE(NON.RESID.)
## 156564 14852
## OTHER ALLEY
## 4573 2308
## GAS STATION DRIVEWAY - RESIDENTIAL
## 2111 1675
Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set “Top5”. To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical “or” operation.
Alternately, you could create five different subsets, and then merge them together into one data frame using rbind.
How many observations are in Top5?
top5 <- head(sort(table(mvt$LocationDescription), decreasing=TRUE), 6)