In the project, I analyzed the criminal incident data from Summer 2014 of San Francisco. The data set contains the criminal history of 28993 records June 1st, 2014 to August 31st, 2014. Each row represent a criminal record including variables such as Category (crime type), Date, DayOfWeek (Monday, Tuesday, …), PdDistrict (neighborhood), etc. Basically, I would address the following questions:

The data visualization includes time series, histogram, heat maps and pie charts. The findings are summarized at the end of the report.


Useful Packages

Here I used R to conduct the data visualization. The useful R packages are listed below. The packages “lubridate” is used to manipulate date and time; “dplyr” can handle large data frames; “ggplot” is an implementation of graphics in R; “scale”" is the function that centers and/or scales matrix columns.

library(lubridate)
library(scales)
library(ggplot2)
library(dplyr)
require(ggplot2)

Data Manipulation

It is necessary to manipulate the raw data. Here I loaded the data into R and created some new variables for my analysis.

# load the data into R
remove(list = objects())
mydata = read.csv(file = "sanfrancisco_incidents_summer_2014.csv", header = TRUE, sep = ",")

# data manipulation
mydata$Date = as.POSIXct(mydata$Date, format="%m/%d/%Y")
mydata$Year = year(mydata$Date) 
mydata$Month = month(mydata$Date)
mydata$Day = day(mydata$Date)
mydata$Time = as.POSIXct(mydata$Time, format="%H:%M")
mydata$Hour = hour(mydata$Time)
mydata$Minute = minute(mydata$Time)

Exploratory Data Analysis

It can been seen there is weekly cycle in the number of incidents in San Francisco. Four crime peaks occurred in early June, late June, week of August 7 and late August.

# time series
daily = group_by(mydata, Date)
day_counts = summarise(daily, count = n())

ggplot(day_counts, aes(x = Date, y = count)) + geom_point(colour = "red") + 
  geom_line(colour = "red", size = 1.5) + 
  theme_light(base_size = 12) + xlab("Date") + ylab("Count of indicents") + 
  ggtitle("The number of incidents in Summer 2014 of San Francisco") + 
  theme(plot.title=element_text(size=16))

The following time series and histogram shows the average crime incidents by time of day. It is apparent the crime increases significantly afternoon to midnight. Please note that the notation “hour X”" used here means time “X:00” to time “X:59”. There are two peaks more than 20 incidents at hour 17 and 18 on average, indicating more incidents take place during the periods 5pm - 5:59pm and 6pm-6:59pm. There are less incidents in the morning periods (1am - 7am) with less than 10. In particular, very few incidents happen during the period of 4am.

# average counts per hour
daily_group = group_by(mydata, Month, Day, Hour)
day_hour_counts = summarise(daily_group, count = n())
hour_group = group_by(day_hour_counts, Hour)
hour_avg_counts = summarise(hour_group, count = mean(count))

# time series: average counts by time of day
ggplot(hour_avg_counts, aes(x = Hour, y = count)) + geom_point(colour = "red") + 
  geom_line(colour = "red", size = 1.5) + 
  theme_light(base_size = 12) + xlab("Time of day") + ylab("Count of incidents") + 
  scale_x_continuous(breaks=c(0:23)) + 
  ggtitle("The average number of incidents by time of day") + 
  theme(plot.title = element_text(size = 16))

# histogram: average counts by time of day
ggplot(hour_avg_counts, aes(x = Hour, y = count)) + 
  geom_bar(position = "dodge", stat = "identity", fill = "#FF9933") +
  theme_light(base_size = 12) + labs(x = "Time of day", y = "Count of Incidents") + 
  scale_x_continuous(breaks=c(0:23)) + 
  ggtitle("The average number of incident by time of day") + 
  theme(plot.title = element_text(size = 16))


Heat maps are graphical representation of data, where each value contained in the matrix are marked in colors. I used heat maps to explore the relationships among variables PdDistrict (neighborhood), Category (crime type) and Hour (time of day). The first heat map shows the relationship between Category (crime type) and Hour (time of day). It can be found LARCENCY/THEFT on average is the most common crime type (that is, the darker green areas in the heat map), particularly with 8.89 and 7.81 incidents at peak hours 6pm and 7pm respectively. Other types OTHER OFFENSES, NON-CRIMINAL, MISSING PERSON, DRUG/NARCOTIC and ASSAULT are also common during the day in San Francisco.

hourly_group = group_by(mydata, Category, Month, Day, Hour)
category_day_hour_counts = summarise(hourly_group, count = n())
category_hourly_group = group_by(category_day_hour_counts, Category, Hour)
category_hour_avg_counts = summarise(category_hourly_group, count = mean(count))

ggplot(category_hour_avg_counts, aes(x = Hour, y = Category)) + 
  geom_tile(aes(fill = count)) + 
  scale_fill_gradient(name = "Average counts", low = "white", high = "green") +
  scale_x_continuous(breaks=c(0:23)) + 
  theme(axis.title.y = element_blank()) + theme_light(base_size = 10) + 
  theme(plot.title = element_text(size=16)) + 
  ggtitle("The number of incidents: Hour vs. Category")

The second heat map explores the relationship between variables PdDistrict (neighborhood) and Hour (time of day). It turns out that SOUTHERN is the most dangerous neighborhood and there are on average 5.04 and 5.11 incidents during periods of 5pm-5:59pm and 6pm-6:59pm. In addition, the incidents happen frequently in areas of MISSION, NORTHERN and CENTRAL.

hourly_group = group_by(mydata, PdDistrict, Month, Day, Hour)
district_day_hour_counts = summarise(hourly_group, count = n())
district_hourly_group = group_by(district_day_hour_counts, PdDistrict, Hour)
district_hour_avg_counts = summarise(district_hourly_group, count = mean(count))

ggplot(district_hour_avg_counts, aes(x = Hour, y = PdDistrict)) + 
  geom_tile(aes(fill = count)) + 
  scale_fill_gradient(name = "Average counts", low = "white", high = "green") +
  scale_x_continuous(breaks=c(0:23)) + 
  theme(axis.title.y = element_blank()) + theme_light(base_size = 10) + 
  theme(plot.title = element_text(size = 16)) + 
  ggtitle("The number of incidents: Hour vs. PdDistrict") 

The third heat map reports the number of incidents in terms of variables PdDistrict (neighborhood) and Category (crime type). LARCENY/THEFT, the most common incident, happens frequently in three areas SOUTHERN, NORTHERN and CENTRAL, with 27.85, 16.29 and 17.10 incidents per day respectively. It can also be observed that VEHICLE THEFT, OTHER OFFENSES, ASSAULT are the common types of crime across the neighborhoods.

category_group = group_by(mydata, Month, Day, PdDistrict, Category)
day_district_category_counts = summarise(category_group, count = n())
district_category_group = group_by(day_district_category_counts, PdDistrict, Category)
district_category_avg_counts = summarise(district_category_group, count = mean(count))

ggplot(district_category_avg_counts, aes(x = PdDistrict, y = Category)) + 
  geom_tile(aes(fill = count)) + 
  scale_fill_gradient(name="Average counts", low="white", high="green") +
  theme(axis.title.y = element_blank()) + theme_light(base_size = 10) + 
  theme(plot.title = element_text(size = 16)) + 
  ggtitle("The number of incidents: PdDistrict vs. Category") + 
  theme(axis.text.x = element_text(angle = 45,size = 8, vjust = 0.5)) 


Two pie charts are used to represent the proportion of crime in terms of variables Category (crime type) and PdDistrict (neighborhood). Because there are 28 types of crime, I only listed top 5 types and the remaining (type 6 - 28) was classified as “OTHERS”. In the first pie chart, LARCENY/THEFT accounts for 39.39% of the crimes, followed by OTHER OFFENSES, NON_CRIMINAL, ASSAULT AND VEHICLE THEFT. In the pie chart for neighborhood, one can see top 4 areas are SOUTHERN, MISSION, NORTHERN and CENTRAL with frequencies of 19.79%, 12.70%, 12.38% and 12.11% respectively. These results are aligned with those of the heat maps.

# Pie chart for Category
category = group_by(mydata, Category)
category_counts = summarise(category, count = n())
top = 5
sorted_category_counts = category_counts[rev(order(category_counts$count)),]
top_category = sorted_category_counts[1:top,]
end = length(sorted_category_counts)
other_counts = sum(sorted_category_counts[top+1:end, 2])
other_category = data.frame(Category = c("OTHERS"), count = c(other_counts))
new_category_counts = rbind(top_category, other_category)
proportion = prop.table(new_category_counts$count)
category_counts_prop = cbind(new_category_counts,proportion)
y.breaks = cumsum(category_counts_prop$count) - category_counts_prop$count/2

ggplot(category_counts_prop, aes(x = 1, y = count, fill = Category)) + geom_bar(width = 1, stat = "identity", color = "black") + 
  coord_polar(theta = "y", start = 0) + ggtitle("Crime type in Summer 2014 of San Francisco")+ 
  guides(fill = guide_legend(override.aes = list(colour = NA))) + 
  theme(axis.ticks = element_blank(), axis.title = element_blank(), 
        axis.text.y = element_blank(), axis.text.x = element_text(color = "black"),
        panel.background = element_rect(fill = "white")) + 
  scale_y_continuous(breaks = y.breaks, labels = category_counts_prop$Category) 

# Pie chart for PdDistrict
district = group_by(mydata, PdDistrict)
district_counts = summarise(district, count = n())
sorted_district_counts = district_counts[rev(order(district_counts$count)),]
district_prop = prop.table(sorted_district_counts$count)
district_counts_prop = cbind(sorted_district_counts,district_prop)
y.breaks = cumsum(district_counts_prop$count) - sorted_district_counts$count/2

ggplot(sorted_district_counts, aes(x = 1, y = count, fill = PdDistrict)) + 
  geom_bar(width = 1, stat = "identity", color = "black") + 
  coord_polar(theta = "y", start = 0) + ggtitle("Crime area in Summer 2014 of San Francisco")+ 
  guides(fill = guide_legend(override.aes = list(colour = NA))) + 
  theme(axis.ticks = element_blank(), axis.title = element_blank(), 
        axis.text.y = element_blank(), axis.text.x = element_text(color = "black"),
        panel.background = element_rect(fill = "white")) + 
  scale_y_continuous(breaks = y.breaks, labels = sorted_district_counts$PdDistrict) 

Summary

The main findings of the report are summarized as follows. During Summer of 2014 of San Francisco,

Appendix

I also made scatter plots in terms of location variables X and Y. However, the representation visulazation is not effective. The reason is that there are a large number of data points and our eyes cannot identify colors or density acurately due to various levels. Only two of the scatter plots are shown here for illustrative purposes.

# scatter plot
ggplot(mydata, aes(x = X, y = Y)) + geom_point(aes(colour = factor(PdDistrict)), size = 1.25) + 
  theme_light(base_size = 10) + xlab("X") + ylab("Y") +
  ggtitle("PdDistrict") + theme(plot.title=element_text(size = 16))

ggplot(mydata, aes(x = X, y = Y)) + geom_point(aes(colour = factor(DayOfWeek)), size = 1.25) + 
  theme_light(base_size = 10) + xlab("X") + ylab("Y") +
  ggtitle("Day of Week") + theme(plot.title=element_text(size = 16))