Data Preparation

The dataset is first loaded into R and the structure of the dataset is observed. This is the crime dataset released by the Baltimore police department. It seems to be a dataset with a lot of categorical variables and very few numerical variables. There are variables related to location which might be useful to visualize reported incidents on a map.

crimes <- read.csv("BPD_Part_1_Victim_Based_Crime_Data.csv")
str(crimes)
## 'data.frame':    276529 obs. of  15 variables:
##  $ CrimeDate      : Factor w/ 2072 levels "01/01/2012","01/01/2013",..: 1472 1472 1472 1472 1472 1472 1472 1472 1472 1472 ...
##  $ CrimeTime      : Factor w/ 3017 levels "00:00:00","00:01:00",..: 2950 2880 2865 2857 2812 2748 2642 2691 2610 2610 ...
##  $ CrimeCode      : Factor w/ 81 levels "1F","1K","1O",..: 27 62 81 41 43 45 1 12 41 43 ...
##  $ Location       : Factor w/ 26009 levels "","& E BIDDLE ST",..: 15896 22874 20665 5391 967 24218 12556 15165 7697 2698 ...
##  $ Description    : Factor w/ 15 levels "AGG. ASSAULT",..: 13 4 15 1 6 5 7 14 1 6 ...
##  $ Inside.Outside : Factor w/ 5 levels "","I","Inside",..: 2 4 5 2 4 2 5 4 4 2 ...
##  $ Weapon         : Factor w/ 5 levels "","FIREARM","HANDS",..: 4 1 2 5 3 1 2 1 5 3 ...
##  $ Post           : int  913 133 524 934 113 922 232 123 641 332 ...
##  $ District       : Factor w/ 10 levels "","CENTRAL","EASTERN",..: 8 2 5 8 2 8 7 2 6 3 ...
##  $ Neighborhood   : Factor w/ 279 levels "","Abell","Allendale",..: 28 216 270 39 65 47 33 251 269 17 ...
##  $ Longitude      : num  -76.6 -76.6 -76.6 -76.6 -76.6 ...
##  $ Latitude       : num  39.2 39.3 39.3 39.3 39.3 ...
##  $ Location.1     : Factor w/ 95498 levels "","(39.2004100000, -76.5560200000)",..: 930 53781 81981 15697 22055 3586 14103 42341 53910 45269 ...
##  $ Premise        : Factor w/ 124 levels "","Alley","ALLEY",..: 97 108 107 97 108 97 107 108 108 97 ...
##  $ Total.Incidents: int  1 1 1 1 1 1 1 1 1 1 ...

This dataset needs to ber properly cleaned - we need to perform missing value treatment and create new time-related variables. The following operations were performed:

  • CrimeDate variable is formatted in the proper date format.
  • Variables such as Year, Month, Day, Week, and Weekday are extracted from the date, using the lubridate package.
  • The Month Week is also extracted using a custom function
  • Null values for variables such as Inside.Outside, Location, District, and Weapon are set to NA
  • Month names and Hour and District variables are factorized
  • Categorical variables are formed for different times of the day.
  • The non-null values in Inside.Outside are exapanded to Inside/Outside
  • Using the variable Description, a new variable called Crime Type is created from Descriptions which contain one of the below descriptions. A new variable called Crime Score is created with values corresponding to the severity of each crime type:
    • Robbery - 7
    • Assault - 8
    • Larceny - 4
    • Arson - 5
    • Rape - 9
    • Shooting - 6
    • Auto Theft - 6
    • Homicide - 10
    • Burglary - 6
  • For Rape Crime Group, the missing values for Weapon are renamed to ‘Other’
#Loading packages
library("lubridate")

#Formatting the date
crimes$CrimeDate <- as.Date(crimes$CrimeDate, format="%m/%d/%Y")

#Creating various time variables
crimes$Year <- as.numeric(format(crimes$CrimeDate,"%Y"))
crimes$Month <- as.numeric(format(crimes$CrimeDate,"%m"))
crimes$Day <- as.numeric(format(crimes$CrimeDate,"%d"))
crimes$Hour_Int<- as.integer(substr(crimes$CrimeTime,0,2))
crimes$Hour_Factor <- as.numeric(hour(hms(as.character(factor(crimes$CrimeTime)))))
crimes$Weekday <- wday(crimes$CrimeDate, label=TRUE)

library("data.table")
library("dplyr")
library("wordcloud")
library("ggplot2")
library("scales")
library("zoo")

#Create Month Week
crimes$yearmonth <- as.yearmon(crimes$CrimeDate)
crimes$yearmonthf <- factor(crimes$yearmonth)

monthweeks <- function(x) {
  UseMethod("monthweeks")
}
monthweeks.Date <- function(x) {
  ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.POSIXlt <- function(x) {
  ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.character <- function(x) {
  ceiling(as.numeric(format(as.Date(x), "%d")) / 7)
}
crimes$MonthWeeks <- monthweeks(crimes$CrimeDate)

#Creating month names
crimes$Month_names<- as.character(substr(crimes$yearmonth,0,3))
crimes$Month_names <- factor(crimes$Month_names,levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec"))

#Creating a new variable for time of the day
crimes$TimeOfDay[crimes$Hour_Int == 0] <-  '(0,6]'
crimes$TimeOfDay <- cut(crimes$Hour_Int, breaks=c(0,6,12,18,24), right=TRUE)

#Renaming Inside and Outside and setting null values to NA
crimes$Inside.Outside[crimes$Inside.Outside == "O"] <- "Outside"
crimes$Inside.Outside[crimes$Inside.Outside == "I"] <- "Inside"
crimes$Inside.Outside[crimes$Inside.Outside == ""] <- NA

#Setting null location values to NA
crimes$Location[crimes$Location == ""] <- NA

#Setting null district values to NA
crimes$District[crimes$District == ""] <- NA

#Factorizing the District field and reordering
crimes$District <- factor(crimes$District,levels = c("CENTER", "WESTERN", "NORTHWESTERN", "NORTHERN", "NORTHEASTERN", "EASTERN","SOUTHEASTERN","SOUTHERN","SOUTHWESTERN"))

#Creating a new field CrimeGroup for different types of crimes based on similar types of description
crimes$CrimeGroup[like(crimes$Description,"ROBBERY")] <- "ROBBERY"
crimes$CrimeGroup[like(crimes$Description,"ASSAULT")] <- "ASSAULT"
crimes$CrimeGroup[like(crimes$Description,"LARCENY")] <- "LARCENY"
crimes$CrimeGroup[like(crimes$Description,"ARSON")] <- "ARSON"
crimes$CrimeGroup[like(crimes$Description,"RAPE")] <- "RAPE"
crimes$CrimeGroup[like(crimes$Description,"SHOOTING")] <- "SHOOTING"
crimes$CrimeGroup[like(crimes$Description,"AUTO THEFT")] <- "AUTO THEFT"
crimes$CrimeGroup[like(crimes$Description,"HOMICIDE")] <- "HOMICIDE"
crimes$CrimeGroup[like(crimes$Description,"BURGLARY")] <- "BURGLARY"

#Updating null weapon values for Rape Crime Type to Other and other null values to NA
crimes$Weapon[crimes$Weapon == "" & crimes$CrimeGroup == "RAPE"] <- "OTHER"
crimes$Weapon[crimes$Weapon == ""] <- NA

#Creating a new variable called CrimeScore indicating the seriousness of the crime
crimes$CrimeScore[crimes$CrimeGroup=="ROBBERY"] <- 7
crimes$CrimeScore[crimes$CrimeGroup=="AUTO THEFT"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="SHOOTING"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="ASSAULT"] <- 8
crimes$CrimeScore[crimes$CrimeGroup=="BURGLARY"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="HOMICIDE"] <- 10
crimes$CrimeScore[crimes$CrimeGroup=="LARCENY"] <- 4
crimes$CrimeScore[crimes$CrimeGroup=="ARSON"] <- 5
crimes$CrimeScore[crimes$CrimeGroup=="RAPE"] <- 9
crimes$CrimeScore <- as.factor(crimes$CrimeScore)

#Setting null weapon values to NA
crimes$Weapon[crimes$Weapon == ""] <- NA

Time Series Calendar Heatmap

Let’s first try to visualize our cleaned dataset in a time series calendar heatmap, which shows how a numerical variable of interest has changed over the course of time during which this data was collected. The first numerical variable of interest would naturally be the total number of reported incidents. However, when this was plotted in a time-series calendar heatmap, the results were’t very insightful, due to the high number of reported incidents on specific days. To get an idea of the level of violence (or seriousness of crime), we plot the average daily crime scores based on our earlier definition in the time-series calendar heatmap.

sub1 <- crimes
sub1$CrimeScore <- as.integer(sub1$CrimeScore)
sub1 <- sub1 %>% group_by(Year, Month_names, MonthWeeks, Weekday) %>% summarize(sum = mean(CrimeScore, na.rm = TRUE))

ggplot(sub1, aes(MonthWeeks, Weekday, fill = sum)) + 
  geom_tile(colour = "white") + 
  facet_grid(Year~Month_names) + 
  scale_fill_gradient(low="blue", high="red") +
  labs(x="Week of Month",
       y="",
       title = "Time-Series Calendar Heatmap", 
       subtitle="Mean Crime Scores in Baltimore", 
       fill="#Incidents")

  • The violence associated the various reported crimes seems to have dropped from 2012 to 2014.
  • But it seems to have picked up after that
  • The second half of every year seems to be dominated by less violent crimes
  • There doesn’t seem to be much trend between beginning and end of the month and on different days of the week

Yearly Frequency of Incidents by District

Let’s look at the frequency of incidents by district

library("ggthemes")

ggplot(subset(crimes,!is.na(District)  & Year!=2017))+
  aes(x=Year, color=District)+
  geom_line(stat="count")+
  scale_x_continuous(breaks = seq(2012,2017,1))+
  scale_y_continuous(breaks = seq(5000,50000,5000))+
  labs(title="Frequency of Incidents by Disctrict",x="Districts",y="Number of Incidents") +
  theme_solarized(light = FALSE) +
  scale_colour_solarized("blue")

  • Looks like the number of reported crimes dropped from 2012 to 2014 and then increased
  • The number of incidents in the Northeastern district is higher than any other district

Animated Time-Series Barplot

Let’s make an animated bar plot for number of incidents by type of crime and district and vary it over time

library("animation")
library("gganimate")

crimegroup.animate <- subset(crimes,!is.na(District)) %>%
  group_by(CrimeGroup,District,Year) %>%
  summarise(cnt=n()) %>%
  arrange(desc(cnt))

ani.options(interval = 0.4)
ggplot(crimegroup.animate, mapping = aes(x=District ,y=cnt, fill=CrimeGroup, frame = Year))+
  geom_bar(stat = "Identity", position = "identity") + 
  labs(title = "Number of Incidents: Year {frame_time}" ,x="District",y="Number of Incidents")+
  theme_solarized(light = FALSE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  transition_time(Year) +
  ease_aes('linear')

  • The number of larcenies have sharply risen in the Northern, Northwestern, Northeastern, and Southeastern districts over time
  • Auto Thefts dropped from 2012 - 2015.

Monthly frequency plot faceted by year

Let’s look at the frequency of reported crimes by month and year to see if there exists seasonality

CrimebyYearMonth <- crimes %>%
  group_by(Year,Month) %>%
  summarize(cnt = n())

CrimebyYearMonth$Year <- factor(CrimebyYearMonth$Year)

ggplot(data=CrimebyYearMonth, aes(x=Month, y=cnt)) + 
  geom_bar(stat="identity",fill = "maroon") +
  labs(title="Frequency of Incidents by Year",x="Month",y="Number of Incidents")+
  theme_solarized(light = FALSE) +
  scale_x_continuous(breaks = seq(0,12,1))+
  facet_wrap(~Year)

  • Looks like there is some seasonality - Lesser crimes are reported in winter than summer

Frequency of crimes by crime type

Let’s again look at which crime types are most prominent, and fill them based on year to see if it has changed drastically across years

CrimeGroupbyYear <- crimes %>%
  group_by(CrimeGroup,Year) %>%
  summarize(cnt = n()) %>%
  transform(CrimeGroup = reorder(CrimeGroup,-cnt))

ggplot(CrimeGroupbyYear, aes(x = reorder(CrimeGroup, cnt), y = cnt, fill = Year)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() +
  labs(x = '', y = 'Total Number of Crimes') +
  annotate("text", x = 3.5, y = 150, label = "Larceny and Assault dominate")

  • Larcenies and Assaults are the most reported crimes, followed by Burglary, Auto Theft, and Robbery
  • There doesn’t seem to be much variation across years

Mean of Crime Score by District over time

Let’s see how the mean crime score has changed over time in each of the districts.

crimes.District.CrimeScore <- crimes[!is.na(crimes$District),] %>% 
  group_by(District, Year) %>% 
  summarise(mean_CrimeScore = mean(as.integer(CrimeScore)))

colnames(crimes.District.CrimeScore)[1] <- "District"

ggplot(data = crimes.District.CrimeScore, aes(x = Year, y = mean_CrimeScore, color = District))+
  geom_line() + 
  labs(title = "Mean of Crime Score by District\n", x = "Year", y = "Average Crime Score", color = "District") +
  theme_solarized(light = FALSE)

  • At an overall level, as previously discussed, the violence involved in the crimes committed seems to have declined from 2012 to 2014 and picked up again after that.
  • 2017 doesn’t have complete data. However, the mean crime score seems to have risen a lot

Crimes Scores v/s Place of Occurence (Inside/Outside)

Let’s see if being inside or outside has any effect on the seriousness of the crime committed.

ggplot(data =crimes[!is.na(crimes$Inside.Outside),] , aes(x = CrimeScore, fill = Inside.Outside))+
  geom_bar(position = 'dodge') +
  labs(title = "Crime Score by Inside Outside", x = "Crime Score", y = "Number", color = "Inside or Outside") +
  theme_solarized(light = FALSE)

  • Most larcenies occur inside
  • The number of reported arsons, rapes, and homicides were far fewer than the other reported crimes
  • Robberies usually occur outside, while assualts usually occur outside
  • Looks like the seriousness of the crime is not very dependent on the place of occurence

Hourly Frequency of reported crimes by Day of the week

Let’s now try to take a look at how the number of crimes varies by time of the day and the day of the week. Naturally, we would assume that more crimes happen in the evening - let’s see if this is actually true.

ggplot(subset(crimes))+
  aes(x=Hour_Factor, colour= Weekday)+
  geom_line(stat="count")+
  scale_x_continuous(breaks = seq(0,23,1),limit=c(0,23))+
  scale_y_continuous(breaks = seq(500,3000,250),limit=c(500,2500))+
  labs(title="Hours of Day",x="Hour",y="Number of Incidents") +
  theme_solarized(light = FALSE)

* Clearly, the reported crimes are more in the evening * The number of reported crimes at 11.30 AM is abruptly high. It drops till about 2 PM when it picks back up on its upward trend * Weekends (Saturdays and Sundays) experience higher reported crimes starting 12 midnight till ealry morning. However, after 6 AM the number of reported crimes is lower than the weekdays.

Map of Baltimore

It is always in everyone’s interest to know which areas are unsafe for visiting by plotting the reported incidents on a map. The below map of Baltimore has reported crimes plotted using red circles. Areas with high rates of crime will appear pre-dominantly red.

library(ggmap)

#plot the  hybrid Google Maps basemap
map <- qmap('Baltimore', zoom = 13, maptype = 'hybrid')
#plot the crime points on top
map + geom_point(data = crimes, aes(x = Longitude, y = Latitude), color="red", size=1, alpha=0.01)

As can be seen from the map, majority of the crimes take place in and around Downtown Baltimore. There are relatively higher rates of reported crimes in the highways leading to the city. South Baltimore looks like a relatively peaceful area.

Word Cloud

Let’s conclude with one final visualization highlighting the most dangerous neighborhoods in Baltimore in terms of reported crimes. As can be validated from the map above, the most dangerous neighborhoods in Baltimore are the following:

  • Downtown
  • Frankford
  • Belair-Edison
  • Brooklyn
  • Cherry Hill

Concusion

These are some of the main things that I learned through this VDE in order of importance are the followings:

  • Not everyhting works easily, but if adjusting different values makes things happen
  • Some things have to be given up on due to time constraints - got stuck with Sankey diagram, Tree maps, and Violins plots, until I gave up
  • Made Word Cloud and animated bar plot and used Google Maps for plotting on R for the first time
  • Figured how to install the gganimate package (not released to CRAN yet) from github

References

  • Wickham - 2010 - A Layered Grammar of Graphics
  • Shneiderman - 1996 - The eyes have it a task by data type taxonomy for informatio nvisualizations File