We’ll be using the dataset mvt.csv (motor vehicle theft in Chicago) Chicago Police Department to visual crime trends in Chicago and point out crime hot spots that can be used to predict and prevent future crime activities in the pointed areas.

The data has over 190,000 observations of three different variables– the date of the crime,and the location of the crime, in terms of latitude and longitude.Below is the structure of the data set.

## 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

We first convert the Date variable to a format that R will recognize so that we can extract the day of the week and the hour of the day.

mvt$Date <- strptime(mvt$Date, format = "%m/%d/%y %H:%M")
head(mvt)
##                  Date Latitude Longitude
## 1 2012-12-31 23:15:00 41.75628 -87.62164
## 2 2012-12-31 22:00:00 41.89879 -87.66130
## 3 2012-12-31 22:00:00 41.96919 -87.76767
## 4 2012-12-31 22:00:00 41.76933 -87.65773
## 5 2012-12-31 21:30:00 41.83757 -87.62176
## 6 2012-12-31 20:30:00 41.92856 -87.75400

In this format, we can now extract the hour and the day of the week from the Date variable,and we can add these as new variables to our data frame. We can do this by first defining our new variable,

mvt$Weekday = weekdays(mvt$Date)
head(mvt)
##                  Date Latitude Longitude Weekday
## 1 2012-12-31 23:15:00 41.75628 -87.62164  Monday
## 2 2012-12-31 22:00:00 41.89879 -87.66130  Monday
## 3 2012-12-31 22:00:00 41.96919 -87.76767  Monday
## 4 2012-12-31 22:00:00 41.76933 -87.65773  Monday
## 5 2012-12-31 21:30:00 41.83757 -87.62176  Monday
## 6 2012-12-31 20:30:00 41.92856 -87.75400  Monday

Subsequently, we add the hour variable to the dataset.This only exists because we converted the Date variable.

mvt$Hour <- mvt$Date$hour
head(mvt)
##                  Date Latitude Longitude Weekday Hour
## 1 2012-12-31 23:15:00 41.75628 -87.62164  Monday   23
## 2 2012-12-31 22:00:00 41.89879 -87.66130  Monday   22
## 3 2012-12-31 22:00:00 41.96919 -87.76767  Monday   22
## 4 2012-12-31 22:00:00 41.76933 -87.65773  Monday   22
## 5 2012-12-31 21:30:00 41.83757 -87.62176  Monday   21
## 6 2012-12-31 20:30:00 41.92856 -87.75400  Monday   20

Now, we have two more variables– Weekday, which gives the day of the week, and Hour,which gives the hour of the day. Our dataset is now ready for visualization.

Line Plots

Let’s start by creating the line plot with just one line and a value for every day of the week.We want to plot as that value the total number of crimes on each day of the week.We can get this information by creating a table of the Weekday variable. This gives the total amount of crime on each day of the week. We save this table as a data frame so that we can pass it to ggplot as our data.

weekdayCounts <- as.data.frame(table(mvt$Weekday))
str(weekdayCounts)
## 'data.frame':    7 obs. of  2 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
##  $ Freq: int  29284 27397 27118 26316 27319 26791 27416

We can see that our data frame has seven observations, one for each day of the week, and two different variables. The first variable, called Var1, gives the name of the day of the week, and the second variable, called Freq, for frequency, gives the total amount of crime on that day of the week. Now, we’re ready to make our plot.

library(colorspace)
library(ggplot2)
ggplot(weekdayCounts, aes(x = Var1, y = Freq))+ geom_line(aes(group = 1)) # group = 1 just groups all of our data into one line, since we want one line in our plot.

We can see that this is very close to the plot we want. We have the total number of crime plotted by day of the week, but our days of the week are a little bit out of order. We have Friday first, then Monday, then Saturday, then Sunday, etc. What ggplot did was it put the days of the week in alphabetical order. But we actually want the days of the week in chronological order to make this plot a bit easier to read. We can do this by making the Var1 variable an ordered factor variable. This signals to ggplot that the ordering is meaningful.

weekdayCounts$Var1 <- factor(weekdayCounts$Var1, ordered = TRUE,
    levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

ggplot(weekdayCounts, aes(x = Var1, y = Freq))+ geom_line(aes(group = 1), linetype = 2) + xlab("Day of the Week") + ylab("Total Motor Vehicle Theft")    

We can add the hour of the day to our line plot. We can do this by creating a line for each day of the week and making the x-axis the hour of the day. We first need to create a counts table for the weekday, and hour.The table gives, for each day of the week and each hour,the total number of motor vehicle thefts that occurred. Let’s save this table to a data frame so that we can use it in our visualizations

DayHourCounts <- as.data.frame(table(mvt$Weekday, mvt$Hour))
str(DayHourCounts)
## 'data.frame':    168 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...

We now have 168 observations– one for each day of the week and hour pair, and three different variables. The first variable, Var1, gives the day of the week.The second variable, Var2, gives the hour of the day. And the third variable, Freq for frequency, gives the total crime count. Let’s convert the second variable, Var2, to actual numbers and call it Hour, since this is the hour of the day, and it makes sense that it’s numerical.

DayHourCounts$Hour <- as.numeric(as.character(DayHourCounts$Var2))

Now we’re ready to create our plot. We just need to change the group to Var1, which is the day of the week.

ggplot(DayHourCounts, aes(x = Hour, y = Freq)) + geom_line(aes(group = Var1, color = Var1,size = 2))

Now in our plot, each line is colored corresponding to the day of the week. This helps us see that on Saturday and Sunday, for example, the green and the teal lines, there’s less motor vehicle thefts in the morning. While we can get some information from this plot, it’s still quite hard to interpret. Seven lines is a lot.Let’s instead try to visualize the same information with a heat map.

A Heatmap

To make a heat map, we’ll use our data in our data frame DayHourCounts. First, though, we need to fix the order of the days so that they’ll show up in chronological order instead of in alphabetical order.

DayHourCounts$Var1 <- factor(DayHourCounts$Var1, ordered = TRUE,
    levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

We make the Heatmap

library(plotly)
hmap <- ggplot(DayHourCounts, aes(x = Hour, y = Var1))+ geom_tile(aes(fill = Freq))+ scale_fill_gradient(name = "Total MV Thefts",low= "white",high = "red")+theme(axis.title.y = element_blank())
ggplotly(hmap)

So how do we read this? For each hour and each day of the week, we have a rectangle in our heat map. The color of that rectangle indicates the frequency, or the number of crimes that occur in that hour and on that day. Our legend tells us that darker colors correspond to more crime.So we can see that a lot of crime happens around midnight, particularly on the weekends.

A Geographical Hot Spot Map

We’ll now advance and plot crime on a map of Chicago. First, we need to install and load two new packages, the maps package and the ggmap package.

devtools::install_github("dkahle/ggmap", ref = "tidyup")
library(ggmap)
chicago <- get_stamenmap(bbox = c(left = -88.0225, bottom = 41.5949, 
                                  right = -87.2713, top = 42.0677), 
                         zoom = 11)
ggmap(chicago)

Now let’s plot the first 100 motor vehicle thefts in our data set on this map.

ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))
## Warning: Removed 4 rows containing missing values (geom_point).

If we plotted all 190,000 motor vehicle thefts, we would just see a big black box, which wouldn’t be helpful at all. We’re more interested in whether or not an area has a high amount of crime, so let’s round our latitude and longitude to two digits of accuracy and create a crime counts data frame for each area.

LatLonCounts <- as.data.frame(table(round(mvt$Longitude,2),round(mvt$Latitude,2)))
str(LatLonCounts)
## 'data.frame':    1638 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 0 0 0 0 0 0 0 0 0 ...

The new variable gives us the total crimes at every point on a grid. We have 1,638 observations and three variables. The first two variables, Var1 and Var2, are the latitude and longitude coordinates, and the third variable is the number of motor vehicle thefts that occur in that area. Let’s convert our longitude and latitude variables to numbers and call them Lat and Long.

LatLonCounts$Long <- as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat <- as.numeric(as.character(LatLonCounts$Var2))

Now, let’s plot these points on our map, making the size and color of the points depend on the total number of motor vehicle thefts.

ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size = Freq)) +
    scale_color_gradient(low="yellow",high="red")

The plot shows a point for every area defined by our latitude and longitude areas.The areas with more crime are closer to red and the areas with less crime are closer to yellow. This helps us see where in Chicago more crimes occur.

We can also overlay a heatmap on the Chicago map to locate the hot spots.

ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill = "red") + ggtitle("Chicago Hot Spots")