In this project I will investigate the Crime Data from the City of Cambridge from 2009 to 2016
First, I’ll install the packages I might need for this analysis.
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyr)
library(magrittr)
getwd()
[1] "C:/Users/Ana/Desktop/Data Analytics"
Importing data:
setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
The working directory was changed to C:/Users/Ana/Desktop/Data Analytics/CSV Files inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
raw_data <- read_csv("crime_data_csv.csv")
Parsed with column specification:
cols(
`File Number` = [31mcol_character()[39m,
`Date of Report` = [31mcol_character()[39m,
`Crime Date Time` = [31mcol_character()[39m,
Crime = [31mcol_character()[39m,
`Reporting Area` = [32mcol_double()[39m,
Neighborhood = [31mcol_character()[39m,
Location = [31mcol_character()[39m
)
#column_names <- colnames(raw_data)
#print(column_names)
Here is the first 10 rows of the raw data:
head(raw_data, 10)
The data has 7 columns:
- File number - a reference number for the crime
- Date of report - when the report was made
- Crime Date Time - The date and time of the crime
- Reporting Area - Subset of Neighborhood
- Neighbourhood - Neighborhood within Cambridge
- Location - specific location where crime took place
Next, find out:
- How many rows of raw data are there?
- How many different categories of crimes are there?
- How many reporting areas are there?
- How many neighbourhoods are there?
count(raw_data[,1])
crime_types <- unique(raw_data$Crime)
no_of_reporting_areas <- unique(raw_data$`Reporting Area`)
neighborhoods <- unique(raw_data$Neighborhood)
print(as.data.frame(neighborhoods))
length(crime_types)
[1] 54
length(no_of_reporting_areas)
[1] 118
length(neighborhoods)
[1] 14
So there are 54 different types of crime, 118 different reporting areas, and 13 different neighborhoods (note, not 14 as one is an NA entry).
Here are some things I will investigate with this data:
- What are the most commonly occurring crimes in Cambridge?
- What neighbourhoods have the highest crime rate?
- Are the most common types of crime the same across all areas?
- Is there any seasonal variation with crime types?
- Is there any time variation with crime types? i.e. do certain crimes happen at certain times of the day?
- How have crimes changed over the years?
What is the best way to present all of this data?
- Bar chart (x = crime type, y = no. of occurrances)
- Bar chart (x = neighborhood, y = no. of crimes)
- Bar chart of the number of incidents of each crime and split the incidents into colours representing each neighborhood.
- Line graph (x = month, y = number of incidents of each type of crime. Plot different crime types in each colour)
- Line graph (x = time of day (in hours), y = number of incidents of each type of crime.)
- Line graph (x = year, y = totals of each crime (plot different crimes in different colours))
I’m not going to do any analysis on ‘date of report’ as this is just sometime around the date of crime so I shall remove this column from the data set.
Next I will clean the data and check for any missing values. The ‘Crime Date Time’ column needs to be rearranged. It would be good to split this column up. Sometimes a range of dates and times are inputted into this column. Here are the first 10 entries in the existing Crime Date Time column:
data_clean <- raw_data %>%
select(-'Date of Report')
date_time_v <- raw_data$`Crime Date Time`
head(date_time_v, 10)
[1] "01/01/2009 00:39" "01/01/2009 01:34"
[3] "01/01/2009 02:20 - 02:35" "01/01/2009 02:20 - 02:45"
[5] "01/01/2009 02:15 - 02:35" "01/01/2009 02:45"
[7] "01/01/2009 02:37" "01/01/2009 10:30 - 10:42"
[9] "01/01/2009 10:35 - 10:47" "12/31/2008 10:00 - 01/01/2009 10:00"
Here are the new columns: Date Start, Time Start, Date End, Time End.
col_names <- c("crime_st_date", "crime_st_time", "crime_end_date", "crime_end_time")
info_splitted <- as.data.frame(str_split(date_time_v, " ", simplify = TRUE))
info_splitted <- info_splitted %>%
select(-V3) %>%
mutate(V5 = if_else(V5 == "", V4, V5)) %>%
mutate(V4 = if_else(V5 == V4, "", V4))
colnames(info_splitted) <- col_names
print(head(info_splitted, 10))
NA
Now I want to add these columns back to the main dataframe. I can also now remove the original Crime Date Time column. Here’s the dataframe as it currently stands:
crime_data <- cbind(data_clean, info_splitted)
crime_data <- crime_data %>%
select(-`Crime Date Time`)
print(head(crime_data, 10))
Next I should check for any missing data.
logic_na <- as.logical(rowSums(is.na(crime_data)))
#print(head(logic_na, 10))
na_rows <- crime_data[logic_na,]
#na_rows
length(na_rows[,1])
[1] 85
There are 85 rows which contain NA values. Next I want to quickly see if there are certain columns which contain most of the NA values. To do this I can sum the number of NA values for each column. I can also plot a visual aid to show the number of NA values in each column.
no_nas <- as.data.frame(colSums(is.na(na_rows)))
colnames(no_nas) <- "No. of NA Values"
no_nas
The table shows that it is mostly data in the Location column that is missing. Here is a visual aid to illustrate this.
library(purrr)
df_na <- map_df(na_rows, function(x) as.numeric(is.na(x)))
df_na_heat <- df_na %>%
pivot_longer(cols = everything(),
names_to = "x") %>%
group_by(x) %>%
mutate(y = row_number())
plot_na_matrix <- function(df_na) {
# Preparing the dataframe for heatmaps
df_heat <- df_na %>%
pivot_longer(cols = everything(),
names_to = "x") %>%
group_by(x) %>%
mutate(y = row_number())
# Ensuring the order of columns is kept as it is
df_heat <- df_heat %>%
ungroup() %>%
mutate(x = factor(x,levels = colnames(df_na)))
# Plotting data
g <- ggplot(data = df_heat, aes(x=x, y=y, fill=value)) +
geom_tile() +
theme(legend.position = "none",
axis.title.y=element_blank(),
axis.text.y =element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))
# Returning the plot
g
}
plot_na_matrix(df_na)

As we could already see before, most of the missing values are in the Location column. There are also 2 missing values in the ‘Reporting Area’ and ‘Neighborhood’ columns but for both these instances there IS an entry in the Location column.
There isn’t any way of finding the exact location data from the rest of the data. There is probably a way of getting the Reporting Area and Neighborhood from the Location data however 2 out of 56000+ rows of data is not significant so the best thing to do is to probably remove the rows of data which don’t have the Reporting Area and Neighborhood. There’s also an entry for “admin error” under Crime so I am going to disregard these rows.
#length(crime_data[,1]) # See how many rows there were before
crime_data <- crime_data %>%
drop_na(`Reporting Area`) %>%
filter(Crime != "Admin Error")
#length(crime_data[,1]) # See how many rows there are after dropping the NA values from 'Reporting Area' and removing entries for "Admin Error" in the Crime column
Now I have cleaned the data and removed the rows containing NA values, I can start to analyse the data.
Q1: what is the overall crime profile of Cambridge?
crime_data_grouped <- crime_data %>%
group_by(Crime) %>%
summarise(no_of_incidents = n()) %>%
arrange(desc(no_of_incidents))
crime_data_grouped
ggplot(data = crime_data_grouped,
aes(x = Crime, y = no_of_incidents, fill = 'blue')) +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
scale_x_discrete(limits=crime_data_grouped$Crime) +
geom_bar(stat='identity')+
labs(title = "No. of Incidents (all years) vs Crime Type", x = "Crime Type", y = "No. of Incidents") +
theme(panel.background = element_rect(fill = "white"))

Q2:Which Neighborhoods have the highest crime rate?
neighborhood_data_grouped <- crime_data %>%
group_by(Neighborhood) %>%
summarise(no_of_incidents = n()) %>%
arrange(desc(no_of_incidents))
neighborhood_data_grouped
ggplot(data = neighborhood_data_grouped,
aes(x = Neighborhood, y = no_of_incidents, fill = Neighborhood)) +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
scale_x_discrete(limits=neighborhood_data_grouped$Neighborhood) +
geom_bar(stat='identity') +
labs(title = "Crime Rate vs Neighborhood", x = "Neighborhood", y = "No. of Incidents (all crimes)") +
theme(panel.background = element_rect(fill = "white"))

I should really find out the population of each of these neighborhoods as that would give a better idea of the crime rate.
Q3: Are the most common types of crime the same across all areas?
I will create a bar chart of the number of incidents of each crime and split the incidents into colours representing each neighborhood.
nc_data_grouped <- crime_data %>%
group_by(Neighborhood, Crime) %>%
summarise(no_of_incidents = n())
#nc_data_grouped
ggplot(data = nc_data_grouped,
aes(x = Crime, y = no_of_incidents, fill = Neighborhood)) +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
geom_bar(stat='identity') +
labs(title = "No. of Crime incidents per Crime Type", x = "Crime Type", y = "No. of Incidents") +
theme(panel.background = element_rect(fill = "white"))

From this plot, it looks like most crimes occur across all the neighborhoods. i.e. there is no crime which is more one colour than another. There’s probably some more in depth analysis possible, but for now, we’ll just consider that the crime profiles are pretty similar across the different neighborhoods.
Q4: Is there any seasonal variation with Crime types?
For this, it would be good to plot a line graph with the months on the x-axis and the total number of crimes in each month on the y-axis. However, the crime_st_date data needs to be manipulated to extract the month. Use regex to extract the month in digits and add a column for ‘crime_month’
Here’s the dataframe again with the new column ‘crime_month’ added:
#head(crime_data, 20)
pattern = "\\b(.+)\\/.+\\/"
month <- str_match(crime_data$crime_st_date, pattern)[,2]
#head(month)
crime_data <- crime_data %>%
mutate(crime_month = month) %>%
mutate(crime_month = str_pad(crime_month, 2, side = "left", pad = "0"))
#print(unique(crime_data$crime_month)) # checking what months I get - it should be 01 - 12
head(crime_data,10)
Now I have a column with the months. I now want to plot the frequency of each type of crime for each month and then plot line graphs to see if there’s any seasonal variation.
#head(crime_data, 20)
total_seasonal_data <- crime_data %>%
group_by(crime_month) %>%
summarise(monthly_total = n()) %>%
mutate(crime_month = as.integer(crime_month)) #%>%
#mutate(crime_month = month.abb[crime_month])
ggplot(data = total_seasonal_data,
aes(x = crime_month, y = monthly_total, fill = "blue")) +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
geom_bar(stat = 'identity') +
labs(title = "All Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white")) +
scale_x_discrete(limits = month.abb)

seasonal_data <- crime_data %>%
group_by(Crime, crime_month) %>%
summarise(monthly_totals = n()) %>%
mutate(crime_month = as.integer(crime_month))
ggplot(data = seasonal_data,
aes(x = crime_month, y = monthly_totals, color = Crime)) +
geom_line() +
labs(title = "Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white")) +
scale_x_discrete(limits=month.abb) +
theme(axis.text.x = element_text(angle=90, hjust=1))

NA
NA
NA
The bar graph showing all crimes over the course of the year shows clearly that the number of crimes increases across the summer months and decreases in the winter months, with a small increase in crimes in January.
It would be really interesting to see whether this overall trend is driven by specific crimes or whether all crimes follow this general trend.
However, when I plotted all the crimes separately on a line graph then there were too many lines to work out what was going on. I think it would be best to plot them all on separate graphs, but there’s a lot. I can tell that some of the crimes show significant variation in incidence across the year.
I think it would be good to plot only the graphs that are interesting i.e. there is some significant variation in incidence level accross the year. To do this, I will look at the variation of the data within each crime.
The next table shows the Crime and the corresponding standard deviation of the monthly incidents of each crime.
sd_seasonal_data <- crime_data %>%
group_by(Crime, crime_month) %>%
summarise(monthly_totals = n()) %>%
group_by(Crime) %>%
summarise(st_dev = sd(monthly_totals)) %>%
arrange(desc(st_dev))
head(sd_seasonal_data, 20)
This confirms what I could see from the data. Next, I am going to plot graphs of crimes which show signficant variation in incidence level across the year. I’ll plot the 10 crimes with the greatest standard deviation.
high_sd_crimes <- sd_seasonal_data$Crime[1:10]
#length(sd_seasonal_data$Crime)
#high_sd_crimes
high_sd_seasonal_data <- subset(seasonal_data, Crime %in% high_sd_crimes)
ggplot(data = high_sd_seasonal_data,
aes(x = crime_month, y = monthly_totals)) +
facet_wrap(~Crime, nrow = 2) +
geom_line() +
labs(title = "Top 10 Most Season-Dependant Crimes: Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white")) +
scale_x_discrete(limits=month.abb) +
theme(axis.text.x = element_text(angle=90, hjust=1))

In summary - there is seasonal variation with total numbers of crimes committed. This overall variation is caused by a number of crimes which have a distict summer preference, most notably, larceny of bicycle. This particular crime sways the overall data so much because there are a lot of incidents of this crime. Interestingly, Domestic Disputes seem to occur more in the early summer, peaking in June. I don’t know why this is. Hit and Runs seem to peak in January and February. Would be interesting to find out why but don’t think this data will tell us that!
There are 53 crime types in total. Let’s also see which have the lowest standard deviation.
print(low_sd_crimes <- sd_seasonal_data$Crime[44:53])
[1] "Prostitution" "Liquor Possession/Sale" "Annoying & Accosting" "Peeping & Spying"
[5] "Kidnapping" "Sex Offender Violation" "Extortion/Blackmail" "Homicide"
[9] "Stalking" "Gambling"
These crimes seem to occur throughout the year with very little variation between months.
Q5: Is there any time variation with crime type?
For this, I need to create a new column with the time grouped in hours. I can just take the first 2 digits from the crime_st_time column.I’m going to call this new column ‘crime_hour’
pattern <- "(\\d+):"
crime_data_hours_1 <- crime_data %>%
mutate(crime_hour = str_match(crime_st_time, pattern)[,2]) %>%
mutate(crime_hour = str_pad(crime_hour, 2, side = "left", pad = "0"))
crime_data_hours <- crime_data_hours_1 %>%
group_by(Crime, crime_hour) %>%
summarise(no_crimes_per_hour = n()) %>%
mutate(crime_hour = as.integer(crime_hour)) %>%
filter(crime_hour != is.na(crime_hour))
#print(unique(crime_data_hours$crime_hour))
head(crime_data_hours, 20)
all_crime_data_hours <- crime_data_hours %>%
group_by(crime_hour) %>%
summarise(total_crimes_per_hour = sum(no_crimes_per_hour)) %>%
filter(crime_hour != is.na(crime_hour))
ggplot(data = all_crime_data_hours,
aes(x = crime_hour, y = total_crimes_per_hour)) +
geom_line() +
labs(title = "Crime Rate Vs Time in 24 hours", x = "Time (hours)", y = "No. of Crimes") +
#theme(panel.background = element_rect(fill = "white")) +
scale_x_continuous(breaks=c(0:23))

ggplot(data = crime_data_hours,
aes(x = crime_hour, y = no_crimes_per_hour)) +
facet_wrap(~Crime, nrow = 5) +
geom_line() +
labs(title = "Crime Rate Vs Time in 24 hours", x = "Time (hours)", y = "No. of Crimes") +
#theme(panel.background = element_rect(fill = "white"))+
scale_x_continuous(breaks=c(0,4,8,12,16,20))

NA
NA
The plot of overall crimes seems to show that the peak time for crimes (of any sort) to be committed is 6pm. The least likely time for a crime (any sort) to be committed is 4am-5am. There is also another peak crime time which is midday. This seems unusual and may be worth further investigation. Could it be that if no time is entered, the default is midday? Or is it that midday day is usually popular with criminals? Maybe looking at the individual crimes will shed more light on this.
From the individual crime plots - once again, some crimes appear to have no or very little relation to time of day, however with some others it is quite marked.
Again, calculate the standard deviation of each crime and plot the crimes which show the most variation (top 10).
crime_data_hours_sd <- crime_data_hours %>%
group_by(Crime) %>%
summarise(st_dev = sd(no_crimes_per_hour)) %>%
arrange(desc(st_dev))
crime_data_hours_sd
crime_data_hrs_sd_top10 <- crime_data_hours_sd$Crime[1:10]
#crime_data_hrs_sd_top10
top10_sd_time_data <- subset(crime_data_hours, Crime %in% crime_data_hrs_sd_top10)
ggplot(data = top10_sd_time_data,
aes(x = crime_hour, y = no_crimes_per_hour)) +
facet_wrap(~Crime, nrow = 2) +
geom_line() +
labs(title = "Top 10 most Time-Dependant Crimes: Crime Rate Vs Time in 24 hours", x = "Time", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white")) +
scale_x_continuous(breaks=c(0, 4, 8, 12, 16, 20))

NA
NA
You can see from these plots that certain crimes are more likely to happen at certain times of the day. Domestic disputes are most likely to occur in the evening, peaking at 8pm. Car theft is most likely to happen in the evening as well, highest between 6pm and 8pm. Shoplifting unsurprising occurs during most shop opening hours and peaks between 3pm and 6pm. The graph for forgery is interesting. There seems to be a very large peak at midday. It seems unlikely that criminals have a preference for carrying out forgery at midday. This -could- be due to the fact that forgery is reported after the actual crime is commited and people generally don’t know when it was committed and therefore the time is recorded as default midday. Some qualitative analysis/questionnaires would perhaps be able to answer this question. Quite a few of the other crimes have peaks at midday as well.
One way of potentially discovering why is to look back at the original data for time. Time data is given in HH:MM. i.e 12:35, 15:15. If there is a default setting for midday, that would mean that many of the crimes would be entered as 12:00 instead of, say 12:10, 12:40
Try this: for each hour, count the total number of entires and also count the number of entires which are of the format HH:00.
pattern = ":00"
crime_data_midday <- crime_data_hours_1 %>%
mutate(logic_midday = str_detect(crime_st_time, pattern)) %>%
group_by(crime_hour, logic_midday) %>%
summarise(no_1200 = n()) %>%
pivot_wider(names_from = logic_midday, values_from = no_1200) %>%
mutate(percent_00 = `TRUE`/`FALSE`*100) %>%
arrange(desc = -percent_00)
crime_data_midday
Interestingly the highest proportion of entires for exactly on the hours is for 00:00 i.e. midnight. Next is 12:00 and then 08:00.
Even though 95% of entries for 00:00-01:00 are recorded as 00:00 exactly, from looking at the graphs, there doesn’t seem to be an unusual peak. This could suggest that people record crimes at exactly midnight but they largely do occur around midnight.
For the midday entries, 78% are recorded as exactly midday, but there seems to be a peak purely for this hour on the graphs.
This does support the theory that these entries could be due to default time settings but I’d need to do some qual research to check.
Q6: How have crimes changed over the years? For this, I want to see if any crimes have any particular trend across the years that this data has been collected. The data covers the years 2009 - 2016. First, I’ll plot the total crime trend across the years, then I’ll look at individual crimes, and again plot those crimes with the biggest variability in data across the years.
#head(crime_data)
pattern = "/\\d+/(\\d+)\\b"
dates <- crime_data$crime_st_date
#head(str_match(dates, pattern))
year_crime_data_1 <- crime_data %>%
mutate(crime_year = str_match(crime_st_date, pattern)[,2]) %>%
mutate(crime_year = str_sub(crime_year, -2, -1))
year_crime_data <- year_crime_data_1 %>%
group_by(crime_year) %>%
summarise(crimes_per_year = n())
#print(unique(year_crime_data$crime_year))
correct_years <- c("09", "10", "11", "12", "13", "14", "15", "16")
year_crime_data <- subset(year_crime_data, crime_year %in% correct_years)
head(year_crime_data,50)
#print(sum(year_crime_data$crimes_per_year))
year_crime_data <- year_crime_data %>%
mutate(crime_year = str_pad(crime_year, 3, side = "left", pad = "0")) %>%
mutate(crime_year = str_pad(crime_year, 4, side = "left", pad = "2"))
ggplot(data = year_crime_data,
aes(x = as.numeric(crime_year), y = crimes_per_year)) +
geom_line() +
labs(title = "Crimes Committed Per Year in Cambridge", x = "Year", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white")) +
scale_x_continuous(breaks=c(2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016))

(Note: From grouping by year and counting the number of crimes that occur in each year, I can see that there are a few ‘rogue’ years in there… I was expecting to see 2009-2016 but sometimes the years are entered as 2009 and sometimes as 09. I can sort this by subsetting the string for only the last two digits. However the other problem is that there are years that I didn’t expect like 1999 and 01. However the number of entries for these years is very small so I’ve removed these entries from the data. I still have 55127 data entries.)
From the plot of crimes over time in years, it looks like crime has overall decreased, with a dip in 2013. However the big dip for 2016 is probably unrealistic and suggests the data grab was taken part way through the year. I can check to see if the data for 2016 contains all months.
check_2016 <- subset(year_crime_data_1, crime_year %in% correct_years)
check_2016 <- check_2016 %>%
select(crime_year, crime_month) %>%
filter(crime_year == 16) %>%
group_by(crime_month) %>%
summarise(no_per_month = n())
colnames(check_2016) <- c("Month in 2016", "No. Crimes per Month")
check_2016
NA
This shows what was expected - the months only go up to 09 i.e. September. So we’d expect the number of crimes to be ~75% of the other years.
Seeing as this is going to mess up the overall trends, I am going to exclude the year 2016 from the analysis.
The next table shows the Crime and the corresponding standard deviation of the yearly incidents of each crime.
#year_crime_data_1
crimes_by_year <- year_crime_data_1 %>%
mutate(crime_year = str_pad(crime_year, 3, side = "left", pad = "0")) %>%
mutate(crime_year = str_pad(crime_year, 4, side = "left", pad = "2")) %>%
mutate(crime_year = as.numeric(crime_year)) %>%
filter(crime_year == 2009:2015) %>%
group_by(Crime, crime_year) %>%
summarise(no_per_year = n())
longer object length is not a multiple of shorter object length
#crimes_by_year
top10_sd_year_data <- crimes_by_year %>%
group_by(Crime) %>%
summarise(st_dev = sd(no_per_year)) %>%
arrange(desc(st_dev))
top10 <- top10_sd_year_data$Crime[1:10]
top_10_df <- top10_sd_year_data[1:10,]
top_10_df
crimes_by_year <- subset(crimes_by_year, Crime %in% top10)
#crimes_by_year
ggplot(data = crimes_by_year,
aes(x = crime_year, y = no_per_year)) +
facet_wrap(~Crime, nrow = 2) +
geom_line() +
labs(title = "Crimes Committed Per Year in Cambridge", x = "Year", y = "No. of Crimes") +
theme(panel.background = element_rect(fill = "white"))

NA
NA
Although these graphs show that some of the crimes are quite variable year to year, most of them don’t show an obvious downward or upward trend. The only one worth noting here is Larceny from MV which seems to be decreasing each year.
And here ends my analysis of Cambridge, MA Crime Data!
---
title: "Investigating Cambridge Crime Data"
output:
  html_notebook: default
  pdf_document: default
---
In this project I will investigate the Crime Data from the City of Cambridge from 2009 to 2016

First, I'll install the packages I might need for this analysis.

```{r}
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyr)
library(magrittr)

getwd()

```

Importing data:
```{r}
setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
raw_data <- read_csv("crime_data_csv.csv")

#column_names <- colnames(raw_data)
#print(column_names)
```

Here is the first 10 rows of the raw data:
  
```{r}
head(raw_data, 10)
```
The data has 7 columns: 

  - File number - a reference number for the crime
  - Date of report - when the report was made
  - Crime Date Time - The date and time of the crime
  - Reporting Area - Subset of Neighborhood
  - Neighbourhood - Neighborhood within Cambridge 
  - Location - specific location where crime took place

Next, find out:

  - How many rows of raw data are there?
  - How many different categories of crimes are there?
  - How many reporting areas are there?
  - How many neighbourhoods are there?

```{r}
count(raw_data[,1])

crime_types <- unique(raw_data$Crime)
no_of_reporting_areas <- unique(raw_data$`Reporting Area`)
neighborhoods <- unique(raw_data$Neighborhood)

print(as.data.frame(neighborhoods))

length(crime_types)
length(no_of_reporting_areas)
length(neighborhoods)
```
So there are 54 different types of crime, 118 different reporting areas, and 13 different neighborhoods (note, not 14 as one is an NA entry). 

Here are some things I will investigate with this data:

1. What are the most commonly occurring crimes in Cambridge?
2. What neighbourhoods have the highest crime rate?
3. Are the most common types of crime the same across all areas?
4. Is there any seasonal variation with crime types?
5. Is there any time variation with crime types? i.e. do certain crimes happen at certain times of the day?
6. How have crimes changed over the years? 

What is the best way to present all of this data? 

1. Bar chart (x = crime type, y = no. of occurrances)
2. Bar chart (x = neighborhood, y = no. of crimes)
3. Bar chart of the number of incidents of each crime and split the incidents into colours representing each neighborhood.
4. Line graph (x = month, y = number of incidents of each type of crime. Plot different crime types in each colour)
5. Line graph (x = time of day (in hours), y = number of incidents of each type of crime.)
6. Line graph (x = year, y = totals of each crime (plot different crimes in different colours))

I'm not going to do any analysis on 'date of report' as this is just sometime around the date of crime so I shall remove this column from the data set.

Next I will clean the data and check for any missing values. The 'Crime Date Time' column needs to be rearranged. It would be good to split this column up. Sometimes a range of dates and times are inputted into this column. Here are the first 10 entries in the existing `Crime Date Time` column:

```{r}
data_clean <- raw_data %>%
  select(-'Date of Report') 

date_time_v <- raw_data$`Crime Date Time`
head(date_time_v, 10)

```


Here are the new columns: Date Start, Time Start, Date End, Time End.

```{r}
col_names <- c("crime_st_date", "crime_st_time", "crime_end_date", "crime_end_time")

info_splitted <- as.data.frame(str_split(date_time_v, " ", simplify = TRUE))

info_splitted <- info_splitted %>%
  select(-V3) %>%
  mutate(V5 = if_else(V5 == "", V4, V5)) %>%
  mutate(V4 = if_else(V5 == V4, "", V4)) 

colnames(info_splitted) <- col_names

print(head(info_splitted, 10))

```
Now I want to add these columns back to the main dataframe. I can also now remove the original `Crime Date Time` column. Here's the dataframe as it currently stands:

```{r}
crime_data <- cbind(data_clean, info_splitted)

crime_data <- crime_data %>%
    select(-`Crime Date Time`)

print(head(crime_data, 10))
```
Next I should check for any missing data.
```{r}
logic_na <- as.logical(rowSums(is.na(crime_data)))

#print(head(logic_na, 10))

na_rows <- crime_data[logic_na,]

#na_rows

length(na_rows[,1])
```

There are 85 rows which contain NA values. Next I want to quickly see if there are certain columns which contain most of the NA values. To do this I can sum the number of NA values for each column. I can also plot a visual aid to show the number of NA values in each column. 
```{r}

no_nas <- as.data.frame(colSums(is.na(na_rows)))
colnames(no_nas) <- "No. of NA Values"
no_nas
```
The table shows that it is mostly data in the Location column that is missing. Here is a visual aid to illustrate this.
```{r}
library(purrr)

df_na <- map_df(na_rows, function(x) as.numeric(is.na(x)))
 df_na_heat <- df_na %>%
    pivot_longer(cols = everything(),
           names_to = "x") %>%
    group_by(x) %>%
    mutate(y = row_number())

plot_na_matrix <- function(df_na) {
     # Preparing the dataframe for heatmaps 
    df_heat <- df_na %>%
        pivot_longer(cols = everything(),
           names_to = "x") %>%
        group_by(x) %>%
        mutate(y = row_number())
     # Ensuring the order of columns is kept as it is
    df_heat <- df_heat %>%
        ungroup() %>%
        mutate(x = factor(x,levels = colnames(df_na)))
     # Plotting data
    g <- ggplot(data = df_heat, aes(x=x, y=y, fill=value)) + 
        geom_tile() + 
        theme(legend.position = "none",
              axis.title.y=element_blank(),
              axis.text.y =element_blank(),
              axis.ticks.y=element_blank(),
              axis.title.x=element_blank(),
              axis.text.x = element_text(angle = 90, hjust = 1))
     # Returning the plot
    g
 } 
 
plot_na_matrix(df_na)
```
As we could already see before, most of the missing values are in the Location column. There are also 2 missing values in the 'Reporting Area' and 'Neighborhood' columns but for both these instances there IS an entry in the Location column. 

There isn't any way of finding the exact location data from the rest of the data. There is probably a way of getting the Reporting Area and Neighborhood from the Location data however 2 out of 56000+ rows of data is not significant so the best thing to do is to probably remove the rows of data which don't have the Reporting Area and Neighborhood.
There's also an entry for "admin error" under Crime so I am going to disregard these rows.
```{r}
#length(crime_data[,1]) # See how many rows there were before

crime_data <- crime_data %>%
  drop_na(`Reporting Area`) %>%
  filter(Crime != "Admin Error")

#length(crime_data[,1]) # See how many rows there are after dropping the NA values from 'Reporting Area' and removing entries for "Admin Error" in the Crime column

```
Now I have cleaned the data and removed the rows containing NA values, I can start to analyse the data.

Q1: what is the overall crime profile of Cambridge?
```{r}
crime_data_grouped <- crime_data %>%
  group_by(Crime) %>%
  summarise(no_of_incidents = n()) %>%
  arrange(desc(no_of_incidents))

crime_data_grouped

ggplot(data = crime_data_grouped,
  aes(x = Crime, y = no_of_incidents, fill = 'blue')) + 
  theme(axis.text.x = element_text(angle=60, hjust=1)) +
  scale_x_discrete(limits=crime_data_grouped$Crime) +
  geom_bar(stat='identity')+
  labs(title = "No. of Incidents (all years) vs Crime Type", x = "Crime Type", y = "No. of Incidents") +
  theme(panel.background = element_rect(fill = "white"))
```
Q2:Which Neighborhoods have the highest crime rate?

```{r}
neighborhood_data_grouped <- crime_data %>%
  group_by(Neighborhood) %>%
  summarise(no_of_incidents = n()) %>%
  arrange(desc(no_of_incidents))

neighborhood_data_grouped

ggplot(data = neighborhood_data_grouped,
  aes(x = Neighborhood, y = no_of_incidents, fill = Neighborhood)) + 
  theme(axis.text.x = element_text(angle=60, hjust=1)) +
  scale_x_discrete(limits=neighborhood_data_grouped$Neighborhood) +
  geom_bar(stat='identity') +
  labs(title = "Crime Rate vs Neighborhood", x = "Neighborhood", y = "No. of Incidents (all crimes)") +
  theme(panel.background = element_rect(fill = "white"))
```
I should really find out the population of each of these neighborhoods as that would give a better idea of the crime rate. 

Q3: Are the most common types of crime the same across all areas? 

I will create a bar chart of the number of incidents of each crime and split the incidents into colours representing each neighborhood.

```{r}
nc_data_grouped <- crime_data %>%
  group_by(Neighborhood, Crime) %>%
  summarise(no_of_incidents = n())

#nc_data_grouped

ggplot(data = nc_data_grouped,
  aes(x = Crime, y = no_of_incidents, fill = Neighborhood)) + 
  theme(axis.text.x = element_text(angle=60, hjust=1)) +
  geom_bar(stat='identity') +
  labs(title = "No. of Crime incidents per Crime Type", x = "Crime Type", y = "No. of Incidents") +
  theme(panel.background = element_rect(fill = "white"))
```
From this plot, it looks like most crimes occur across all the neighborhoods. i.e. there is no crime which is more one colour than another. There's probably some more in depth analysis possible, but for now, we'll just consider that the crime profiles are pretty similar across the different neighborhoods.  

Q4: Is there any seasonal variation with Crime types?

For this, it would be good to plot a line graph with the months on the x-axis and the total number of crimes in each month on the y-axis. However, the crime_st_date data needs to be manipulated to extract the month. 
Use regex to extract the month in digits and add a column for 'crime_month'

Here's the dataframe again with the new column 'crime_month' added:

```{r}
#head(crime_data, 20)

pattern = "\\b(.+)\\/.+\\/"
month <- str_match(crime_data$crime_st_date, pattern)[,2]
#head(month)

crime_data <- crime_data %>%
  mutate(crime_month = month) %>%
  mutate(crime_month = str_pad(crime_month, 2, side = "left", pad = "0"))

#print(unique(crime_data$crime_month)) # checking what months I get - it should be 01 - 12

head(crime_data,10)
```
Now I have a column with the months. I now want to plot the frequency of each type of crime for each month and then plot line graphs to see if there's any seasonal variation.
```{r}
#head(crime_data, 20)


total_seasonal_data <- crime_data %>%
  group_by(crime_month) %>%
  summarise(monthly_total = n()) %>%
  mutate(crime_month = as.integer(crime_month)) #%>%
  #mutate(crime_month = month.abb[crime_month])

ggplot(data = total_seasonal_data,
       aes(x = crime_month, y = monthly_total, fill = "blue")) +
       theme(axis.text.x = element_text(angle=60, hjust=1)) +
       geom_bar(stat = 'identity') +
       labs(title = "All Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
       theme(panel.background = element_rect(fill = "white")) + 
       scale_x_discrete(limits = month.abb)

seasonal_data <- crime_data %>%
  group_by(Crime, crime_month) %>%
  summarise(monthly_totals = n()) %>%
  mutate(crime_month = as.integer(crime_month)) 

ggplot(data = seasonal_data,
  aes(x = crime_month, y = monthly_totals, color = Crime)) +
  geom_line() +
  labs(title = "Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
  theme(panel.background = element_rect(fill = "white")) +
  scale_x_discrete(limits=month.abb) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) 



```
The bar graph showing all crimes over the course of the year shows clearly that the number of crimes increases across the summer months and decreases in the winter months, with a small increase in crimes in January. 

It would be really interesting to see whether this overall trend is driven by specific crimes or whether all crimes follow this general trend. 

However, when I plotted all the crimes separately on a line graph then there were too many lines to work out what was going on. I think it would be best to plot them all on separate graphs, but there's a lot. I can tell that some of the crimes show significant variation in incidence across the year. 

I think it would be good to plot only the graphs that are interesting i.e. there is some significant variation in incidence level accross the year. To do this, I will look at the variation of the data within each crime. 

The next table shows the Crime and the corresponding standard deviation of the monthly incidents of each crime.
```{r}
sd_seasonal_data <- crime_data %>%
  group_by(Crime, crime_month) %>%
  summarise(monthly_totals = n()) %>%
  group_by(Crime) %>%
  summarise(st_dev = sd(monthly_totals)) %>%
  arrange(desc(st_dev))

head(sd_seasonal_data, 20)
```
This confirms what I could see from the data. Next, I am going to plot graphs of crimes which show signficant variation in incidence level across the year. I'll plot the 10 crimes with the greatest standard deviation. 

```{r}
high_sd_crimes <- sd_seasonal_data$Crime[1:10]

#length(sd_seasonal_data$Crime)

#high_sd_crimes

high_sd_seasonal_data <- subset(seasonal_data, Crime %in% high_sd_crimes)

ggplot(data = high_sd_seasonal_data,
  aes(x = crime_month, y = monthly_totals)) +
  facet_wrap(~Crime, nrow = 2) +
  geom_line() +
  labs(title = "Top 10 Most Season-Dependant Crimes: Crime Rate Vs Month", x = "Month", y = "No. of Crimes") +
  theme(panel.background = element_rect(fill = "white")) +
  scale_x_discrete(limits=month.abb) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) 
```
In summary - there is seasonal variation with total numbers of crimes committed. This overall variation is caused by a number of crimes which have a distict summer preference, most notably, larceny of bicycle. This particular crime sways the overall data so much because there are a lot of incidents of this crime. Interestingly, Domestic Disputes seem to occur more in the early summer, peaking in June. I don't know why this is. Hit and Runs seem to peak in January and February. Would be interesting to find out why but don't think this data will tell us that! 

There are 53 crime types in total. Let's also see which have the lowest standard deviation.

```{r}
print(low_sd_crimes <- sd_seasonal_data$Crime[44:53])
```
These crimes seem to occur throughout the year with very little variation between months.

Q5: Is there any time variation with crime type?

For this, I need to create a new column with the time grouped in hours. I can just take the first 2 digits from the crime_st_time column.I'm going to call this new column 'crime_hour'


```{r}
pattern <- "(\\d+):"

crime_data_hours_1 <- crime_data %>%
  mutate(crime_hour = str_match(crime_st_time, pattern)[,2]) %>%
  mutate(crime_hour = str_pad(crime_hour, 2, side = "left", pad = "0")) 

crime_data_hours <- crime_data_hours_1 %>%
  group_by(Crime, crime_hour) %>%
  summarise(no_crimes_per_hour = n()) %>%
  mutate(crime_hour = as.integer(crime_hour)) %>%
  filter(crime_hour != is.na(crime_hour))

#print(unique(crime_data_hours$crime_hour))

head(crime_data_hours, 20)

all_crime_data_hours <- crime_data_hours %>%
  group_by(crime_hour) %>%
  summarise(total_crimes_per_hour = sum(no_crimes_per_hour)) %>%
  filter(crime_hour != is.na(crime_hour))

ggplot(data = all_crime_data_hours,
       aes(x = crime_hour, y = total_crimes_per_hour)) +
      geom_line() +
  labs(title = "Crime Rate Vs Time in 24 hours", x = "Time (hours)", y = "No. of Crimes") +
  #theme(panel.background = element_rect(fill = "white")) +
  scale_x_continuous(breaks=c(0:23))

ggplot(data = crime_data_hours,
       aes(x = crime_hour, y = no_crimes_per_hour)) +
      facet_wrap(~Crime, nrow = 5) +
      geom_line() +
      labs(title = "Crime Rate Vs Time in 24 hours", x = "Time (hours)", y = "No. of Crimes") +
      #theme(panel.background = element_rect(fill = "white"))+
      scale_x_continuous(breaks=c(0,4,8,12,16,20))


```
The plot of overall crimes seems to show that the peak time for crimes (of any sort) to be committed is 6pm. The least likely time for a crime (any sort) to be committed is 4am-5am. There is also another peak crime time which is midday. This seems unusual and may be worth further investigation. Could it be that if no time is entered, the default is midday? Or is it that midday day is usually popular with criminals? Maybe looking at the individual crimes will shed more light on this. 

From the individual crime plots - once again, some crimes appear to have no or very little relation to time of day, however with some others it is quite marked.

Again, calculate the standard deviation of each crime and plot the crimes which show the most variation (top 10).

```{r}
crime_data_hours_sd <- crime_data_hours %>%
  group_by(Crime) %>%
  summarise(st_dev = sd(no_crimes_per_hour)) %>%
  arrange(desc(st_dev))

crime_data_hours_sd

crime_data_hrs_sd_top10 <- crime_data_hours_sd$Crime[1:10]

#crime_data_hrs_sd_top10

top10_sd_time_data <- subset(crime_data_hours, Crime %in% crime_data_hrs_sd_top10)

ggplot(data = top10_sd_time_data,
  aes(x = crime_hour, y = no_crimes_per_hour)) +
  facet_wrap(~Crime, nrow = 2) +
  geom_line() +
  labs(title = "Top 10 most Time-Dependant Crimes: Crime Rate Vs Time in 24 hours", x = "Time", y = "No. of Crimes") +
  theme(panel.background = element_rect(fill = "white")) +
   scale_x_continuous(breaks=c(0, 4, 8, 12, 16, 20))


```
You can see from these plots that certain crimes are more likely to happen at certain times of the day. Domestic disputes are most likely to occur in the evening, peaking at 8pm. Car theft is most likely to happen in the evening as well, highest between 6pm and 8pm. Shoplifting unsurprising occurs during most shop opening hours and peaks between 3pm and 6pm. The graph for forgery is interesting. There seems to be a very large peak at midday. It seems unlikely that criminals have a preference for carrying out forgery at midday. This -could- be due to the fact that forgery is reported after the actual crime is commited and people generally don't know when it was committed and therefore the time is recorded as default midday. Some qualitative analysis/questionnaires would perhaps be able to answer this question. Quite a few of the other crimes have peaks at midday as well. 

One way of potentially discovering why is to look back at the original data for time. Time data is given in HH:MM. i.e 12:35, 15:15. If there is a default setting for midday, that would mean that many of the crimes would be entered as 12:00 instead of, say 12:10, 12:40 

Try this: for each hour, count the total number of entires and also count the number of entires which are of the format HH:00.
```{r}
pattern = ":00"

crime_data_midday <- crime_data_hours_1 %>%
  mutate(logic_midday = str_detect(crime_st_time, pattern)) %>%
  group_by(crime_hour, logic_midday) %>%
  summarise(no_1200 = n()) %>%
  pivot_wider(names_from = logic_midday, values_from = no_1200) %>%
  mutate(percent_00 = `TRUE`/`FALSE`*100) %>%
  arrange(desc = -percent_00)

crime_data_midday
```
Interestingly the highest proportion of entires for exactly on the hours is for 00:00 i.e. midnight. Next is 12:00 and then 08:00.

Even though 95% of entries for 00:00-01:00 are recorded as 00:00 exactly, from looking at the graphs, there doesn't seem to be an unusual peak. This could suggest that people record crimes at exactly midnight but they largely do occur around midnight. 

For the midday entries, 78% are recorded as exactly midday, but there seems to be a peak purely for this hour on the graphs.

This does support the theory that these entries could be due to default time settings but I'd need to do some qual research to check. 

Q6: How have crimes changed over the years? 
For this, I want to see if any crimes have any particular trend across the years that this data has been collected. The data covers the years 2009 - 2016. First, I'll plot the total crime trend across the years, then I'll look at individual crimes, and again plot those crimes with the biggest variability in data across the years. 

```{r}
#head(crime_data)

pattern = "/\\d+/(\\d+)\\b"

dates <- crime_data$crime_st_date

#head(str_match(dates, pattern))

year_crime_data_1 <- crime_data %>%
  mutate(crime_year = str_match(crime_st_date, pattern)[,2]) %>%
  mutate(crime_year = str_sub(crime_year, -2, -1)) 

year_crime_data <- year_crime_data_1 %>%
  group_by(crime_year) %>%
  summarise(crimes_per_year = n()) 

#print(unique(year_crime_data$crime_year))
correct_years <- c("09", "10", "11", "12", "13", "14", "15", "16")

year_crime_data <- subset(year_crime_data, crime_year %in% correct_years)

head(year_crime_data,50)

#print(sum(year_crime_data$crimes_per_year))

year_crime_data <- year_crime_data %>%
  mutate(crime_year = str_pad(crime_year, 3, side = "left", pad = "0")) %>%
  mutate(crime_year = str_pad(crime_year, 4, side = "left", pad = "2")) 

ggplot(data = year_crime_data,
       aes(x = as.numeric(crime_year), y = crimes_per_year)) +
       geom_line() +
       labs(title = "Crimes Committed Per Year in Cambridge", x = "Year", y = "No. of Crimes") +
       theme(panel.background = element_rect(fill = "white")) +
   scale_x_continuous(breaks=c(2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016))

```
(Note: From grouping by year and counting the number of crimes that occur in each year, I can see that there are a few 'rogue' years in there... I was expecting to see 2009-2016 but sometimes the years are entered as 2009 and sometimes as 09. I can sort this by subsetting the string for only the last two digits. However the other problem is that there are years that I didn't expect like 1999 and 01. However the number of entries for these years is very small so I've removed these entries from the data. I still have 55127 data entries.) 

From the plot of crimes over time in years, it looks like crime has overall decreased, with a dip in 2013. However the big dip for 2016 is probably unrealistic and suggests the data grab was taken part way through the year. I can check to see if the data for 2016 contains all months. 

```{r}
check_2016 <- subset(year_crime_data_1, crime_year %in% correct_years)

check_2016 <- check_2016 %>%
  select(crime_year, crime_month) %>%
  filter(crime_year == 16) %>%
  group_by(crime_month) %>%
  summarise(no_per_month = n())

colnames(check_2016) <- c("Month in 2016", "No. Crimes per Month")
  
check_2016
  
```
This shows what was expected - the months only go up to 09 i.e. September. So we'd expect the number of crimes to be ~75% of the other years. 

Seeing as this is going to mess up the overall trends, I am going to exclude the year 2016 from the analysis.

The next table shows the Crime and the corresponding standard deviation of the yearly incidents of each crime.
```{r}
#year_crime_data_1

crimes_by_year <- year_crime_data_1 %>%
  mutate(crime_year = str_pad(crime_year, 3, side = "left", pad = "0")) %>%
  mutate(crime_year = str_pad(crime_year, 4, side = "left", pad = "2")) %>%
  mutate(crime_year = as.numeric(crime_year)) %>%
  filter(crime_year == 2009:2015) %>%
  group_by(Crime, crime_year) %>%
  summarise(no_per_year = n()) 

#crimes_by_year

top10_sd_year_data <- crimes_by_year %>%
  group_by(Crime) %>%
  summarise(st_dev = sd(no_per_year)) %>%
  arrange(desc(st_dev))

top10 <- top10_sd_year_data$Crime[1:10]

top_10_df <- top10_sd_year_data[1:10,]
top_10_df

crimes_by_year <- subset(crimes_by_year, Crime %in% top10)

#crimes_by_year

ggplot(data = crimes_by_year,
       aes(x = crime_year, y = no_per_year)) +
       facet_wrap(~Crime, nrow = 2) +
       geom_line() +
       labs(title = "Crimes Committed Per Year in Cambridge", x = "Year", y = "No. of Crimes") +
       theme(panel.background = element_rect(fill = "white"))

  
```
Although these graphs show that some of the crimes are quite variable year to year, most of them don't show an obvious downward or upward trend. The only one worth noting here is Larceny from MV which seems to be decreasing each year. 

And here ends my analysis of Cambridge, MA Crime Data! 




