Crime Type Trends in the Past Five Years in the DMV Area
Author
Karen Pesca
1. Introduction
I chose the “Realtimecrimeindex_sample” dataset from the courses source to explore crime trends in the DMV area. It includes real-time data (divided by year and month), state, cities, and crime types such as murder, rape, robbery, aggravated assault, burglary, theft, and motor vehicle theft. Murder is the intentional killing of another person, excluding deaths from negligence or accidents. Rape involves non-consensual penetration, while robbery is taking property by force or threat. Aggravated assault is an unlawful attack meant to cause severe injury, often with a weapon. Burglary is unlawfully entering a structure to commit a felony or theft, and theft is taking property without force. Motor vehicle theft refers to stealing vehicles, excluding things like airplanes or farming equipment. The dataset offers a comprehensive view of crime patterns, which is why I want to focus on a specific area of the country.
I believe crime data is an essential tool for understanding public safety trends, identifying high-risk areas, and shaping effective policies. This analysis utilizes real-time crime data from the Real-Time Crime Index web page, a platform that aggregates and visualizes crime reports across the United States, collected by agencies such as the Policy and the FBI.
Source: https://realtimecrimeindex.com/
2. Loading data, libraries and cleaning.
First I set my working directory. Then I load my dataset and remove any spaces.
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
3. Filter and Select the Variables and Columns for Analysis
In this case, I will focus on the DMV area, which includes the states of Maryland, Virginia, and the District of Columbia (Washington). I removed the full sample to avoid double counting.
# A tibble: 6 × 15
month year date agency state agency_state murder rape robbery
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 2018 Jan-18 Alexandria VA Alexandria, VA 0 0 8
2 2 2018 Feb-18 Alexandria VA Alexandria, VA 0 0 1
3 3 2018 Mar-18 Alexandria VA Alexandria, VA 1 2 5
4 4 2018 Apr-18 Alexandria VA Alexandria, VA 0 2 10
5 5 2018 May-18 Alexandria VA Alexandria, VA 0 1 4
6 6 2018 Jun-18 Alexandria VA Alexandria, VA 0 4 8
# ℹ 6 more variables: aggravatedassault <dbl>, burglary <dbl>, theft <dbl>,
# motorvehicletheft <dbl>, violentcrime <dbl>, propertycrime <dbl>
4. Correlation and Multiple Linear Regression
To analyze the correlation, I used a matrix of scatterplots, histograms, and correlation plots (ggpairs function) for multiple variables in my dataset. This matrix allowed me to visually assess the relationships between different crime types. The scatterplots show how pairs of variables are related, while the histograms offer insights into the distribution of each variable. The correlation values displayed in the upper triangle of the matrix reveal the strength and direction of the relationships between crime types. This approach helped me identify patterns and correlations, providing a better understanding of how different crimes might be interconnected. Based on this, I will create my multiple linear models.
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
ggpairs(crimeindex_DMV, columns =7:15)
I confirm that murders present the highest correlation with the other variables, which means that changes in other crime types, such as robbery, aggravated assault, and theft, are strongly associated with changes in the murder rate. This indicates that these crimes might be interconnected and can help predict or explain trends in murders within the DMV area.
Multiple Linear Regression
For this section, I will run some multiple linear models and check for a good adjusted R^2 and significant p-values.
lm1 model shows a R^2=1, indicating a perfect fit of the model to the data. However, I believe this interpretation could overlook many other factors that explain the variation in the murder rate in the DMV area. I will try to remove some variables.
If robbery, aggravated assault, and theft were zero, the model predicts a negative murder rate, which is unrealistic. This negative intercept may reflect that other factors not included in the model (such as drugs, socioeconomic status, or policing) also influence murder rates.
Robbery (0.0280):
For each additional robbery per 100.000, the murder rate is predicted to increase by 0.0280 murders.
Aggravated Assault (0.0186):
Each additional aggravated assault per 100.000 increases the predicted murder rate by 0.0186 murders.
Theft (0.0048):
Each additional theft per 100,000 increases the murder rate by 0.0048 murders. While this effect is smaller, it still suggests that theft could be a contributing factor to higher murder rates, although the impact is less significant compared to robbery and aggravated assault.
All predictors have very small p-values (<0.05), indicating they are statistically significant.
On the other side, it’s important to note that the correlation of these variables is not necessarily a causation.
Interpretion of the Diagnostic Plots:
During my analysis, I focused on the Residuals vs. Fitted plot to identify points that might disproportionately affect my model. I noticed that points far from the center with high leverage could be potential outliers or influential observations, meaning they might be distorting the results. Based on this, I made adjustments to my final model to ensure a more accurate analysis.
5. Visualization
Organizing the Dataset from Wide to Long Format
To organize the dataset from wide to long format, I restructured the data so that multiple crime types, previously in separate columns, are now grouped into a single variable along with the Year. This transformation simplifies the analysis by making it easier to compare trends across different crime types over time.Also I filtered by the last five years (2019-2024).
ggplot(crime_sum, aes(x = year, y = crime_type, fill = count)) +geom_tile() +scale_fill_gradient(low ="lightskyblue2", high ="indianred2") +labs(title ="Crime Trends in DMV (2019-2024)",x ="Year",y ="Crime Type",fill ="Crime Count",caption ="Source: crimeindex_DMV dataset" ) +theme_light()+theme(legend.position ="right", plot.title =element_text(hjust =0.5))
Create a Highchart
For my first project, I would like to try the Highcharts graph. I will make some modifications to see what looks best for presenting my project. I plan to use line, area, and bar graphs.
Line Trends
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
highchart() %>%hc_chart(type ="line") %>%hc_title(text ="Crime Type Trends in the Past Five Years in the DMV Area") %>%hc_xAxis(categories =unique(crime_sum$year)) %>%hc_yAxis(title =list(text ="Crime Count")) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="murder") %>%select(year, count), type ="line", name ="Murder", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="robbery") %>%select(year, count), type ="line", name ="Robbery", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum%>%filter(crime_type =="rape") %>%select(year, count), type ="line", name ="Rape", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="aggravatedassault") %>%select(year, count), type ="line", name ="Aggravated Assault", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="burglary") %>%select(year, count), type ="line", name ="Burglary", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="theft") %>%select(year, count), type ="line", name ="Theft", hcaes(x = year, y = count))%>%hc_add_series(data = crime_sum %>%filter(crime_type =="propertycrime") %>%select(year, count), type ="line", name ="Property Crime", hcaes(x = year, y = count))
Highcharts Bar Graph
highchart() %>%hc_chart(type ="column") %>%hc_title(text ="Crime Type Trends in the Past Five Years in the DMV Area") %>%hc_xAxis(categories =unique(crime_sum$year)) %>%hc_yAxis(title =list(text ="Crime Count")) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="murder") %>%select(year, count), type ="column", name ="Murder", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="robbery") %>%select(year, count), type ="column", name ="Robbery", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum%>%filter(crime_type =="rape") %>%select(year, count), type ="column", name ="Rape", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="aggravatedassault") %>%select(year, count), type ="column", name ="Aggravated Assault", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="burglary") %>%select(year, count), type ="column", name ="Burglary", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="theft") %>%select(year, count), type ="column", name ="Theft", hcaes(x = year, y = count))%>%hc_add_series(data = crime_sum %>%filter(crime_type =="propertycrime") %>%select(year, count), type ="column", name ="Property Crime", hcaes(x = year, y = count))
Highchart area trends
highchart() %>%hc_chart(type ="area") %>%hc_title(text ="Crime Type Trends in the Past Five Years in the DMV Area") %>%hc_xAxis(categories =unique(crime_sum$year)) %>%hc_yAxis(title =list(text ="Crime Count")) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="murder") %>%select(year, count), type ="area", name ="Murder", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="robbery") %>%select(year, count), type ="area", name ="Robbery", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="rape") %>%select(year, count), type ="area", name ="Rape", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="aggravatedassault") %>%select(year, count), type ="area", name ="Aggravated Assault", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="burglary") %>%select(year, count), type ="area", name ="Burglary", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="theft") %>%select(year, count), type ="area",name ="Theft", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="propertycrime") %>%select(year, count), type ="area", name ="Property Crime", hcaes(x = year, y = count))
Highchart area trends
library(highcharter)library(RColorBrewer)cols <-brewer.pal(7, "Set2")# Create an area charthighchart() %>%hc_add_series(data = crime_sum %>%filter(crime_type =="murder") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Murder") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="robbery") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Robbery") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="rape") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Rape") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="aggravatedassault") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Aggravated Assault") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="burglary") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Burglary") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="theft") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Theft") %>%hc_add_series(data = crime_sum %>%filter(crime_type =="propertycrime") %>%select(year, count), type ="area", hcaes(x = year, y = count), name ="Property Crime") %>%hc_colors(cols) %>%hc_chart(style =list(fontFamily ="Georgia", fontWeight ="bold")) %>%hc_plotOptions(series =list(stacking ="normal", marker =list(enabled =FALSE, states =list(hover =list(enabled =FALSE))),lineWidth =0.5, lineColor ="white")) %>%hc_xAxis(title =list(text ="Year")) %>%hc_yAxis(title =list(text ="Crime Count")) %>%hc_legend(align ="right", verticalAlign ="top", layout ="vertical")
Final visualization
Highchart line trends
Finally I created a line chart using the Highcharts library to visualize crime trends in the DMV area over the past five years. I focused on seven types of crimes: murder, robbery, rape, aggravated assault, burglary, theft, and property crime. For each crime type, I filtered the data by year and counted the occurrences, adding each crime type as a separate line on the chart. The x-axis represents the years, while the y-axis shows the crime count. I used colors from the “Set1” palette to differentiate the lines and added a credit to the chart that links to the Real-Time Crime Index website. This setup allows me to easily compare crime trends over time for different crime types.
cols <-brewer.pal(7, "Set1")highchart() %>%hc_chart(type ="line") %>%hc_title(text ="Crime Type Trends in the Past Five Years in the DMV Area") %>%hc_xAxis(categories =unique(crime_sum$year)) %>%hc_yAxis(title =list(text ="Crime Count")) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="murder") %>%select(year, count), type ="line", name ="Murder", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="robbery") %>%select(year, count), type ="line", name ="Robbery", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum%>%filter(crime_type =="rape") %>%select(year, count), type ="line", name ="Rape", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="aggravatedassault") %>%select(year, count), type ="line", name ="Aggravated Assault", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="burglary") %>%select(year, count), type ="line", name ="Burglary", hcaes(x = year, y = count)) %>%hc_add_series(data = crime_sum %>%filter(crime_type =="theft") %>%select(year, count), type ="line", name ="Theft", hcaes(x = year, y = count))%>%hc_add_series(data = crime_sum %>%filter(crime_type =="propertycrime") %>%select(year, count), type ="line", name ="Property Crime", hcaes(x = year, y = count)) %>%hc_colors(cols)%>%hc_credits(enabled =TRUE, text ="Source: Real-Time Crime Index (https://realtimecrimeindex.com)", href ="https://realtimecrimeindex.com", style =list(fontSize ="10", color ="black", fontStyle ="italic"))
6. Essay
To prepare the dataset for analysis, I first removed unnecessary spaces and standardized column names.I ensured that numerical values were stored as integers and categorical variables, such as state or city, were stored as factors, ensuring the dataset was clean and ready for analysis. I filtered the data to focus on the DMV area, removing observations outside Maryland, Virginia, and Washington, D.C. After checking for missing values (none were found), I converted the dataset from wide to long format, grouping crime types under a single variable with corresponding values and years for easier trend analysis and create the visualization.
The visualizations provide a clear overview of crime trends in the DMV area, showing fluctuations in different crime types over the last five years. Line graphs revealed that certain crimes, such as theft and aggravated assault, had noticeable spikes in specific years. Additionally, I could analyze that the pandemic (2020-2022) affected these trends, as the data showed a decrease in crime rates during those years. I was surprised to find that property crime was the most prevalent in this area, as I did not expect it to be the largest category.
There were a few aspects I wished I could have included but couldn’t get to work. One challenge was customizing interactive maps to visualize crime data across different regions more effectively, such as by city or by making comparisons with other areas of the country. Additionally, some graphs didn’t display the data as clearly as I had hoped, especially when trying to adjust axis scales for better readability. For example, the area graph was difficult to optimize, and I struggled to display it in a way that would show the correct results, as it appeared somewhat confusing. However, in my last graph, I found a visualization that expressed what I wanted and a graph that I would like to explore further with different variables across the country.
References
https://rpubs.com/sharmaar3/Residual_Plots https://www.highcharts.com/chat/gpt/ (to add the source in my final graph)