The dataset is first loaded into R and the structure of the dataset is observed. This is the crime dataset released by the Baltimore police department. It seems to be a dataset with a lot of categorical variables and very few numerical variables. There are variables related to location which might be useful to visualize reported incidents on a map.
crimes <- read.csv("BPD_Part_1_Victim_Based_Crime_Data.csv")str(crimes)## 'data.frame': 276529 obs. of 15 variables:
## $ CrimeDate : Factor w/ 2072 levels "01/01/2012","01/01/2013",..: 1472 1472 1472 1472 1472 1472 1472 1472 1472 1472 ...
## $ CrimeTime : Factor w/ 3017 levels "00:00:00","00:01:00",..: 2950 2880 2865 2857 2812 2748 2642 2691 2610 2610 ...
## $ CrimeCode : Factor w/ 81 levels "1F","1K","1O",..: 27 62 81 41 43 45 1 12 41 43 ...
## $ Location : Factor w/ 26009 levels "","& E BIDDLE ST",..: 15896 22874 20665 5391 967 24218 12556 15165 7697 2698 ...
## $ Description : Factor w/ 15 levels "AGG. ASSAULT",..: 13 4 15 1 6 5 7 14 1 6 ...
## $ Inside.Outside : Factor w/ 5 levels "","I","Inside",..: 2 4 5 2 4 2 5 4 4 2 ...
## $ Weapon : Factor w/ 5 levels "","FIREARM","HANDS",..: 4 1 2 5 3 1 2 1 5 3 ...
## $ Post : int 913 133 524 934 113 922 232 123 641 332 ...
## $ District : Factor w/ 10 levels "","CENTRAL","EASTERN",..: 8 2 5 8 2 8 7 2 6 3 ...
## $ Neighborhood : Factor w/ 279 levels "","Abell","Allendale",..: 28 216 270 39 65 47 33 251 269 17 ...
## $ Longitude : num -76.6 -76.6 -76.6 -76.6 -76.6 ...
## $ Latitude : num 39.2 39.3 39.3 39.3 39.3 ...
## $ Location.1 : Factor w/ 95498 levels "","(39.2004100000, -76.5560200000)",..: 930 53781 81981 15697 22055 3586 14103 42341 53910 45269 ...
## $ Premise : Factor w/ 124 levels "","Alley","ALLEY",..: 97 108 107 97 108 97 107 108 108 97 ...
## $ Total.Incidents: int 1 1 1 1 1 1 1 1 1 1 ...
This dataset needs to ber properly cleaned - we need to perform missing value treatment and create new time-related variables. The following operations were performed:
#Loading packages
library("lubridate")
#Formatting the date
crimes$CrimeDate <- as.Date(crimes$CrimeDate, format="%m/%d/%Y")
#Creating various time variables
crimes$Year <- as.numeric(format(crimes$CrimeDate,"%Y"))
crimes$Month <- as.numeric(format(crimes$CrimeDate,"%m"))
crimes$Day <- as.numeric(format(crimes$CrimeDate,"%d"))
crimes$Hour_Int<- as.integer(substr(crimes$CrimeTime,0,2))
crimes$Hour_Factor <- as.numeric(hour(hms(as.character(factor(crimes$CrimeTime)))))
crimes$Weekday <- wday(crimes$CrimeDate, label=TRUE)
library("data.table")
library("dplyr")
library("wordcloud")
library("ggplot2")
library("scales")
library("zoo")
#Create Month Week
crimes$yearmonth <- as.yearmon(crimes$CrimeDate)
crimes$yearmonthf <- factor(crimes$yearmonth)
monthweeks <- function(x) {
UseMethod("monthweeks")
}
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.POSIXlt <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
monthweeks.character <- function(x) {
ceiling(as.numeric(format(as.Date(x), "%d")) / 7)
}
crimes$MonthWeeks <- monthweeks(crimes$CrimeDate)
#Creating month names
crimes$Month_names<- as.character(substr(crimes$yearmonth,0,3))
crimes$Month_names <- factor(crimes$Month_names,levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec"))
#Creating a new variable for time of the day
crimes$TimeOfDay[crimes$Hour_Int == 0] <- '(0,6]'
crimes$TimeOfDay <- cut(crimes$Hour_Int, breaks=c(0,6,12,18,24), right=TRUE)
#Renaming Inside and Outside and setting null values to NA
crimes$Inside.Outside[crimes$Inside.Outside == "O"] <- "Outside"
crimes$Inside.Outside[crimes$Inside.Outside == "I"] <- "Inside"
crimes$Inside.Outside[crimes$Inside.Outside == ""] <- NA
#Setting null location values to NA
crimes$Location[crimes$Location == ""] <- NA
#Setting null district values to NA
crimes$District[crimes$District == ""] <- NA
#Factorizing the District field and reordering
crimes$District <- factor(crimes$District,levels = c("CENTER", "WESTERN", "NORTHWESTERN", "NORTHERN", "NORTHEASTERN", "EASTERN","SOUTHEASTERN","SOUTHERN","SOUTHWESTERN"))
#Creating a new field CrimeGroup for different types of crimes based on similar types of description
crimes$CrimeGroup[like(crimes$Description,"ROBBERY")] <- "ROBBERY"
crimes$CrimeGroup[like(crimes$Description,"ASSAULT")] <- "ASSAULT"
crimes$CrimeGroup[like(crimes$Description,"LARCENY")] <- "LARCENY"
crimes$CrimeGroup[like(crimes$Description,"ARSON")] <- "ARSON"
crimes$CrimeGroup[like(crimes$Description,"RAPE")] <- "RAPE"
crimes$CrimeGroup[like(crimes$Description,"SHOOTING")] <- "SHOOTING"
crimes$CrimeGroup[like(crimes$Description,"AUTO THEFT")] <- "AUTO THEFT"
crimes$CrimeGroup[like(crimes$Description,"HOMICIDE")] <- "HOMICIDE"
crimes$CrimeGroup[like(crimes$Description,"BURGLARY")] <- "BURGLARY"
#Updating null weapon values for Rape Crime Type to Other and other null values to NA
crimes$Weapon[crimes$Weapon == "" & crimes$CrimeGroup == "RAPE"] <- "OTHER"
crimes$Weapon[crimes$Weapon == ""] <- NA
#Creating a new variable called CrimeScore indicating the seriousness of the crime
crimes$CrimeScore[crimes$CrimeGroup=="ROBBERY"] <- 7
crimes$CrimeScore[crimes$CrimeGroup=="AUTO THEFT"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="SHOOTING"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="ASSAULT"] <- 8
crimes$CrimeScore[crimes$CrimeGroup=="BURGLARY"] <- 6
crimes$CrimeScore[crimes$CrimeGroup=="HOMICIDE"] <- 10
crimes$CrimeScore[crimes$CrimeGroup=="LARCENY"] <- 4
crimes$CrimeScore[crimes$CrimeGroup=="ARSON"] <- 5
crimes$CrimeScore[crimes$CrimeGroup=="RAPE"] <- 9
crimes$CrimeScore <- as.factor(crimes$CrimeScore)
#Setting null weapon values to NA
crimes$Weapon[crimes$Weapon == ""] <- NALet’s first try to visualize our cleaned dataset in a time series calendar heatmap, which shows how a numerical variable of interest has changed over the course of time during which this data was collected. The first numerical variable of interest would naturally be the total number of reported incidents. However, when this was plotted in a time-series calendar heatmap, the results were’t very insightful, due to the high number of reported incidents on specific days. To get an idea of the level of violence (or seriousness of crime), we plot the average daily crime scores based on our earlier definition in the time-series calendar heatmap.
sub1 <- crimes
sub1$CrimeScore <- as.integer(sub1$CrimeScore)
sub1 <- sub1 %>% group_by(Year, Month_names, MonthWeeks, Weekday) %>% summarize(sum = mean(CrimeScore, na.rm = TRUE))
ggplot(sub1, aes(MonthWeeks, Weekday, fill = sum)) +
geom_tile(colour = "white") +
facet_grid(Year~Month_names) +
scale_fill_gradient(low="blue", high="red") +
labs(x="Week of Month",
y="",
title = "Time-Series Calendar Heatmap",
subtitle="Mean Crime Scores in Baltimore",
fill="#Incidents")Let’s look at the frequency of incidents by district
library("ggthemes")
ggplot(subset(crimes,!is.na(District) & Year!=2017))+
aes(x=Year, color=District)+
geom_line(stat="count")+
scale_x_continuous(breaks = seq(2012,2017,1))+
scale_y_continuous(breaks = seq(5000,50000,5000))+
labs(title="Frequency of Incidents by Disctrict",x="Districts",y="Number of Incidents") +
theme_solarized(light = FALSE) +
scale_colour_solarized("blue")Let’s make an animated bar plot for number of incidents by type of crime and district and vary it over time
library("animation")
library("gganimate")
crimegroup.animate <- subset(crimes,!is.na(District)) %>%
group_by(CrimeGroup,District,Year) %>%
summarise(cnt=n()) %>%
arrange(desc(cnt))
ani.options(interval = 0.4)
ggplot(crimegroup.animate, mapping = aes(x=District ,y=cnt, fill=CrimeGroup, frame = Year))+
geom_bar(stat = "Identity", position = "identity") +
labs(title = "Number of Incidents: Year {frame_time}" ,x="District",y="Number of Incidents")+
theme_solarized(light = FALSE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
transition_time(Year) +
ease_aes('linear')Let’s look at the frequency of reported crimes by month and year to see if there exists seasonality
CrimebyYearMonth <- crimes %>%
group_by(Year,Month) %>%
summarize(cnt = n())
CrimebyYearMonth$Year <- factor(CrimebyYearMonth$Year)
ggplot(data=CrimebyYearMonth, aes(x=Month, y=cnt)) +
geom_bar(stat="identity",fill = "maroon") +
labs(title="Frequency of Incidents by Year",x="Month",y="Number of Incidents")+
theme_solarized(light = FALSE) +
scale_x_continuous(breaks = seq(0,12,1))+
facet_wrap(~Year)Let’s again look at which crime types are most prominent, and fill them based on year to see if it has changed drastically across years
CrimeGroupbyYear <- crimes %>%
group_by(CrimeGroup,Year) %>%
summarize(cnt = n()) %>%
transform(CrimeGroup = reorder(CrimeGroup,-cnt))
ggplot(CrimeGroupbyYear, aes(x = reorder(CrimeGroup, cnt), y = cnt, fill = Year)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(x = '', y = 'Total Number of Crimes') +
annotate("text", x = 3.5, y = 150, label = "Larceny and Assault dominate")Let’s see how the mean crime score has changed over time in each of the districts.
crimes.District.CrimeScore <- crimes[!is.na(crimes$District),] %>%
group_by(District, Year) %>%
summarise(mean_CrimeScore = mean(as.integer(CrimeScore)))
colnames(crimes.District.CrimeScore)[1] <- "District"
ggplot(data = crimes.District.CrimeScore, aes(x = Year, y = mean_CrimeScore, color = District))+
geom_line() +
labs(title = "Mean of Crime Score by District\n", x = "Year", y = "Average Crime Score", color = "District") +
theme_solarized(light = FALSE)Let’s see if being inside or outside has any effect on the seriousness of the crime committed.
ggplot(data =crimes[!is.na(crimes$Inside.Outside),] , aes(x = CrimeScore, fill = Inside.Outside))+
geom_bar(position = 'dodge') +
labs(title = "Crime Score by Inside Outside", x = "Crime Score", y = "Number", color = "Inside or Outside") +
theme_solarized(light = FALSE)Let’s now try to take a look at how the number of crimes varies by time of the day and the day of the week. Naturally, we would assume that more crimes happen in the evening - let’s see if this is actually true.
ggplot(subset(crimes))+
aes(x=Hour_Factor, colour= Weekday)+
geom_line(stat="count")+
scale_x_continuous(breaks = seq(0,23,1),limit=c(0,23))+
scale_y_continuous(breaks = seq(500,3000,250),limit=c(500,2500))+
labs(title="Hours of Day",x="Hour",y="Number of Incidents") +
theme_solarized(light = FALSE) * Clearly, the reported crimes are more in the evening * The number of reported crimes at 11.30 AM is abruptly high. It drops till about 2 PM when it picks back up on its upward trend * Weekends (Saturdays and Sundays) experience higher reported crimes starting 12 midnight till ealry morning. However, after 6 AM the number of reported crimes is lower than the weekdays.
It is always in everyone’s interest to know which areas are unsafe for visiting by plotting the reported incidents on a map. The below map of Baltimore has reported crimes plotted using red circles. Areas with high rates of crime will appear pre-dominantly red.
library(ggmap)
#plot the hybrid Google Maps basemap
map <- qmap('Baltimore', zoom = 13, maptype = 'hybrid')
#plot the crime points on top
map + geom_point(data = crimes, aes(x = Longitude, y = Latitude), color="red", size=1, alpha=0.01) As can be seen from the map, majority of the crimes take place in and around Downtown Baltimore. There are relatively higher rates of reported crimes in the highways leading to the city. South Baltimore looks like a relatively peaceful area.
Let’s conclude with one final visualization highlighting the most dangerous neighborhoods in Baltimore in terms of reported crimes. As can be validated from the map above, the most dangerous neighborhoods in Baltimore are the following:
These are some of the main things that I learned through this VDE in order of importance are the followings: