Crime Type Trends in the Past Five Years in the DMV Area

Author

Karen Pesca

1. Introduction

I chose the “Realtimecrimeindex_sample” dataset from the courses source to explore crime trends in the DMV area. It includes real-time data (divided by year and month), state, cities, and crime types such as murder, rape, robbery, aggravated assault, burglary, theft, and motor vehicle theft. Murder is the intentional killing of another person, excluding deaths from negligence or accidents. Rape involves non-consensual penetration, while robbery is taking property by force or threat. Aggravated assault is an unlawful attack meant to cause severe injury, often with a weapon. Burglary is unlawfully entering a structure to commit a felony or theft, and theft is taking property without force. Motor vehicle theft refers to stealing vehicles, excluding things like airplanes or farming equipment. The dataset offers a comprehensive view of crime patterns, which is why I want to focus on a specific area of the country.

I believe crime data is an essential tool for understanding public safety trends, identifying high-risk areas, and shaping effective policies. This analysis utilizes real-time crime data from the Real-Time Crime Index web page, a platform that aggregates and visualizes crime reports across the United States, collected by agencies such as the Policy and the FBI.

Source: https://realtimecrimeindex.com/

2. Loading data, libraries and cleaning.

First I set my working directory. Then I load my dataset and remove any spaces.

setwd("/Users/karenlizethpp/Library/Mobile Documents/com~apple~CloudDocs/Data 110")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
library(htmltools)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
Crimeindex <- read_csv("realtimecrimeindex_sample.csv")
Rows: 36311 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): Date, Agency, State, Agency_State, Source.Link, Source.Type, Sourc...
dbl (22): Month, Year, Murder, Rape, Robbery, Aggravated Assault, Burglary, ...
lgl  (2): Latitude, Longitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Removing spaces and capital letters from the variables.

names(Crimeindex) <- tolower(names(Crimeindex))
names(Crimeindex) <- gsub(" ","",names(Crimeindex))
head(Crimeindex)
# A tibble: 6 × 33
  month  year date   agency  state agency_state murder  rape robbery
  <dbl> <dbl> <chr>  <chr>   <chr> <chr>         <dbl> <dbl>   <dbl>
1     1  2018 Jan-18 Abilene TX    Abilene, TX       0     9       7
2     2  2018 Feb-18 Abilene TX    Abilene, TX       0    11       5
3     3  2018 Mar-18 Abilene TX    Abilene, TX       0    11       6
4     4  2018 Apr-18 Abilene TX    Abilene, TX       1     9       8
5     5  2018 May-18 Abilene TX    Abilene, TX       0    12       8
6     6  2018 Jun-18 Abilene TX    Abilene, TX       1    13       4
# ℹ 24 more variables: aggravatedassault <dbl>, burglary <dbl>, theft <dbl>,
#   motorvehicletheft <dbl>, violentcrime <dbl>, propertycrime <dbl>,
#   murder_mvs_12mo <dbl>, burglary_mvs_12mo <dbl>, rape_mvs_12mo <dbl>,
#   robbery_mvs_12mo <dbl>, aggravatedassault_mvs_12mo <dbl>,
#   motorvehicletheft_mvs_12mo <dbl>, theft_mvs_12mo <dbl>,
#   violentcrime_mvs_12mo <dbl>, propertycrime_mvs_12mo <dbl>,
#   source.link <chr>, source.type <chr>, source.method <chr>, …
names(Crimeindex) <- tolower(names(Crimeindex))
names(Crimeindex) <- gsub(" ","_",names(Crimeindex))
head(Crimeindex)
# A tibble: 6 × 33
  month  year date   agency  state agency_state murder  rape robbery
  <dbl> <dbl> <chr>  <chr>   <chr> <chr>         <dbl> <dbl>   <dbl>
1     1  2018 Jan-18 Abilene TX    Abilene, TX       0     9       7
2     2  2018 Feb-18 Abilene TX    Abilene, TX       0    11       5
3     3  2018 Mar-18 Abilene TX    Abilene, TX       0    11       6
4     4  2018 Apr-18 Abilene TX    Abilene, TX       1     9       8
5     5  2018 May-18 Abilene TX    Abilene, TX       0    12       8
6     6  2018 Jun-18 Abilene TX    Abilene, TX       1    13       4
# ℹ 24 more variables: aggravatedassault <dbl>, burglary <dbl>, theft <dbl>,
#   motorvehicletheft <dbl>, violentcrime <dbl>, propertycrime <dbl>,
#   murder_mvs_12mo <dbl>, burglary_mvs_12mo <dbl>, rape_mvs_12mo <dbl>,
#   robbery_mvs_12mo <dbl>, aggravatedassault_mvs_12mo <dbl>,
#   motorvehicletheft_mvs_12mo <dbl>, theft_mvs_12mo <dbl>,
#   violentcrime_mvs_12mo <dbl>, propertycrime_mvs_12mo <dbl>,
#   source.link <chr>, source.type <chr>, source.method <chr>, …

3. Filter and Select the Variables and Columns for Analysis

In this case, I will focus on the DMV area, which includes the states of Maryland, Virginia, and the District of Columbia (Washington). I removed the full sample to avoid double counting.

crimeindex_DMV <- Crimeindex %>%
  filter(state %in% c("DC", "MD", "VA"), agency != "Full Sample") %>%
  select(-c(16:33))
head(crimeindex_DMV)
# A tibble: 6 × 15
  month  year date   agency     state agency_state   murder  rape robbery
  <dbl> <dbl> <chr>  <chr>      <chr> <chr>           <dbl> <dbl>   <dbl>
1     1  2018 Jan-18 Alexandria VA    Alexandria, VA      0     0       8
2     2  2018 Feb-18 Alexandria VA    Alexandria, VA      0     0       1
3     3  2018 Mar-18 Alexandria VA    Alexandria, VA      1     2       5
4     4  2018 Apr-18 Alexandria VA    Alexandria, VA      0     2      10
5     5  2018 May-18 Alexandria VA    Alexandria, VA      0     1       4
6     6  2018 Jun-18 Alexandria VA    Alexandria, VA      0     4       8
# ℹ 6 more variables: aggravatedassault <dbl>, burglary <dbl>, theft <dbl>,
#   motorvehicletheft <dbl>, violentcrime <dbl>, propertycrime <dbl>

4. Correlation and Multiple Linear Regression

To analyze the correlation, I used a matrix of scatterplots, histograms, and correlation plots (ggpairs function) for multiple variables in my dataset. This matrix allowed me to visually assess the relationships between different crime types. The scatterplots show how pairs of variables are related, while the histograms offer insights into the distribution of each variable. The correlation values displayed in the upper triangle of the matrix reveal the strength and direction of the relationships between crime types. This approach helped me identify patterns and correlations, providing a better understanding of how different crimes might be interconnected. Based on this, I will create my multiple linear models.

library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
ggpairs(crimeindex_DMV, columns = 7:15)

I confirm that murders present the highest correlation with the other variables, which means that changes in other crime types, such as robbery, aggravated assault, and theft, are strongly associated with changes in the murder rate. This indicates that these crimes might be interconnected and can help predict or explain trends in murders within the DMV area.

Multiple Linear Regression

For this section, I will run some multiple linear models and check for a good adjusted R^2 and significant p-values.

Model 1

Murder = Intercept+ Rape + Robbery + Aggravated Assault + Burglary + Theft + Motor Vehicle Theft + Violent Crime + Property Crime

lm1 <- lm (murder ~ rape + robbery + aggravatedassault + burglary + theft+ motorvehicletheft + violentcrime + propertycrime, data= crimeindex_DMV)
summary(lm1)

Call:
lm(formula = murder ~ rape + robbery + aggravatedassault + burglary + 
    theft + motorvehicletheft + violentcrime + propertycrime, 
    data = crimeindex_DMV)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.080e-11  9.400e-15  2.500e-14  3.730e-14  1.179e-13 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error    t value Pr(>|t|)    
(Intercept)       -5.818e-13  3.321e-14 -1.752e+01   <2e-16 ***
rape              -1.000e+00  8.583e-15 -1.165e+14   <2e-16 ***
robbery           -1.000e+00  6.557e-15 -1.525e+14   <2e-16 ***
aggravatedassault -1.000e+00  6.475e-15 -1.544e+14   <2e-16 ***
burglary           2.029e-16  6.296e-16  3.220e-01    0.747    
theft              3.791e-17  8.400e-17  4.510e-01    0.652    
motorvehicletheft  2.438e-17  2.301e-16  1.060e-01    0.916    
violentcrime       1.000e+00  6.332e-15  1.579e+14   <2e-16 ***
propertycrime             NA         NA         NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.199e-13 on 1140 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.5e+28 on 7 and 1140 DF,  p-value: < 2.2e-16
#Diagnostic Plots
autoplot(lm1, 1:4, nrow=2, ncol=2)

lm1 model shows a R^2=1, indicating a perfect fit of the model to the data. However, I believe this interpretation could overlook many other factors that explain the variation in the murder rate in the DMV area. I will try to remove some variables.

Model 2:

lm2 <- lm (murder ~ rape + robbery + aggravatedassault + violentcrime, data= crimeindex_DMV)
summary(lm2)

Call:
lm(formula = murder ~ rape + robbery + aggravatedassault + violentcrime, 
    data = crimeindex_DMV)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.081e-11  1.360e-14  2.470e-14  3.400e-14  1.226e-13 

Coefficients:
                    Estimate Std. Error    t value Pr(>|t|)    
(Intercept)       -5.752e-13  2.607e-14 -2.206e+01   <2e-16 ***
rape              -1.000e+00  8.445e-15 -1.184e+14   <2e-16 ***
robbery           -1.000e+00  6.184e-15 -1.617e+14   <2e-16 ***
aggravatedassault -1.000e+00  5.935e-15 -1.685e+14   <2e-16 ***
violentcrime       1.000e+00  5.885e-15  1.699e+14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.193e-13 on 1143 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.384e+28 on 4 and 1143 DF,  p-value: < 2.2e-16
#Diagnostic Plots
autoplot(lm2, 1:4, nrow=2, ncol=2)

Model 3:

lm3 <- lm (murder ~ rape + robbery + aggravatedassault + burglary + theft+ motorvehicletheft + propertycrime, data= crimeindex_DMV)
summary(lm3)

Call:
lm(formula = murder ~ rape + robbery + aggravatedassault + burglary + 
    theft + motorvehicletheft + propertycrime, data = crimeindex_DMV)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2795  -1.2448  -0.2219   0.8762  17.8736 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.5550219  0.1543736  -3.595 0.000338 ***
rape               0.0198987  0.0264281   0.753 0.451644    
robbery            0.0301181  0.0031094   9.686  < 2e-16 ***
aggravatedassault  0.0202475  0.0019960  10.144  < 2e-16 ***
burglary          -0.0030990  0.0029419  -1.053 0.292390    
theft              0.0046765  0.0003675  12.725  < 2e-16 ***
motorvehicletheft -0.0014001  0.0010750  -1.302 0.193026    
propertycrime             NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.898 on 1141 degrees of freedom
Multiple R-squared:  0.8575,    Adjusted R-squared:  0.8568 
F-statistic:  1145 on 6 and 1141 DF,  p-value: < 2.2e-16
#Diagnostic Plots
autoplot(lm3, 1:4, nrow=2, ncol=2)

Model 4:

lm4 <- lm (murder ~ robbery + aggravatedassault + theft, data= crimeindex_DMV)
summary(lm4)

Call:
lm(formula = murder ~ robbery + aggravatedassault + theft, data = crimeindex_DMV)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3657  -1.2475  -0.2377   0.8709  17.9786 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.5954090  0.1465054  -4.064 5.15e-05 ***
robbery            0.0277718  0.0026282  10.567  < 2e-16 ***
aggravatedassault  0.0196480  0.0016566  11.861  < 2e-16 ***
theft              0.0047176  0.0003035  15.544  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.898 on 1144 degrees of freedom
Multiple R-squared:  0.8572,    Adjusted R-squared:  0.8568 
F-statistic:  2288 on 3 and 1144 DF,  p-value: < 2.2e-16
#Diagnostic Plots
autoplot(lm4, 1:4, nrow=2, ncol=2)

The residuals plot shows observations 1185, 111 and 131 have an effect on the residuals plot as well having high scale- location values.

Model 5:

Removing Observation from the residuals plot (outliers)

To improve the Adj R^2 we should removing the outliers. After do it, we can see the adjusted R^2 went up to about 86%, which is an improvement.

options(scipen = 0)
crimeindex_DMV2<- crimeindex_DMV[-c(1185,111,131),]
lm5 <- lm(murder ~ robbery + aggravatedassault + theft, data= crimeindex_DMV2)
summary(lm5)

Call:
lm(formula = murder ~ robbery + aggravatedassault + theft, data = crimeindex_DMV2)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1073  -1.2183  -0.2272   0.8863  16.8428 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.5891500  0.1419881  -4.149 3.58e-05 ***
robbery            0.0280269  0.0025500  10.991  < 2e-16 ***
aggravatedassault  0.0185736  0.0016113  11.527  < 2e-16 ***
theft              0.0047928  0.0002943  16.285  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.807 on 1142 degrees of freedom
Multiple R-squared:  0.8613,    Adjusted R-squared:  0.8609 
F-statistic:  2364 on 3 and 1142 DF,  p-value: < 2.2e-16
#Diagnostic Plots
autoplot(lm5, 1:4, nrow=2, ncol=2)

Final model (lm5):

  1. Model Equation

Murder=−0.5891+(0.0280)Robbery+(0.0186)AggravatedAssault+(0.0048)Theft

Intercept (-0.589):

If robbery, aggravated assault, and theft were zero, the model predicts a negative murder rate, which is unrealistic. This negative intercept may reflect that other factors not included in the model (such as drugs, socioeconomic status, or policing) also influence murder rates.

Robbery (0.0280):

For each additional robbery per 100.000, the murder rate is predicted to increase by 0.0280 murders.

Aggravated Assault (0.0186):

Each additional aggravated assault per 100.000 increases the predicted murder rate by 0.0186 murders.

Theft (0.0048):

Each additional theft per 100,000 increases the murder rate by 0.0048 murders. While this effect is smaller, it still suggests that theft could be a contributing factor to higher murder rates, although the impact is less significant compared to robbery and aggravated assault.

All predictors have very small p-values (<0.05), indicating they are statistically significant.

On the other side, it’s important to note that the correlation of these variables is not necessarily a causation.

Interpretion of the Diagnostic Plots:

During my analysis, I focused on the Residuals vs. Fitted plot to identify points that might disproportionately affect my model. I noticed that points far from the center with high leverage could be potential outliers or influential observations, meaning they might be distorting the results. Based on this, I made adjustments to my final model to ensure a more accurate analysis.

5. Visualization

Organizing the Dataset from Wide to Long Format

To organize the dataset from wide to long format, I restructured the data so that multiple crime types, previously in separate columns, are now grouped into a single variable along with the Year. This transformation simplifies the analysis by making it easier to compare trends across different crime types over time.Also I filtered by the last five years (2019-2024).

crime_sum <- crimeindex_DMV %>%
  filter(year >= 2019, year <= 2024) %>%
  group_by(year) %>%
  summarise(across(c(murder, robbery, rape, aggravatedassault,burglary, theft, propertycrime), sum)) %>%
  pivot_longer(cols = c(murder, robbery, rape, aggravatedassault,burglary,theft, propertycrime), names_to = "crime_type", values_to = "count")

Trying Different Graphs

  1. Heatmap
ggplot(crime_sum, aes(x = year, y = crime_type, fill = count)) +
  geom_tile() +
  scale_fill_gradient(low = "lightskyblue2", high = "indianred2") +
  labs(
    title = "Crime Trends in DMV (2019-2024)",
    x = "Year",
    y = "Crime Type",
    fill = "Crime Count",
    caption = "Source: crimeindex_DMV dataset"
  ) +
  theme_light()+
  theme(legend.position = "right", plot.title = element_text(hjust = 0.5))

Create a Highchart

For my first project, I would like to try the Highcharts graph. I will make some modifications to see what looks best for presenting my project. I plan to use line, area, and bar graphs.

Highcharts Bar Graph

highchart() %>%
  hc_chart(type = "column") %>% 
  hc_title(text = "Crime Type Trends in the Past Five Years in the DMV Area") %>%
  hc_xAxis(categories = unique(crime_sum$year)) %>%  
  hc_yAxis(title = list(text = "Crime Count")) %>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "murder") %>%
                  select(year, count), 
                type = "column", 
                name = "Murder", 
                hcaes(x = year, y = count)) %>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "robbery") %>%
                  select(year, count), 
                type = "column", 
                name = "Robbery", 
                hcaes(x = year, y = count)) %>%
  hc_add_series(data = crime_sum%>%
                  filter(crime_type == "rape") %>%
                  select(year, count), 
                type = "column", 
                name = "Rape", 
                hcaes(x = year, y = count)) %>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "aggravatedassault") %>%
                  select(year, count), 
                type = "column", 
                name = "Aggravated Assault", 
                hcaes(x = year, y = count)) %>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "burglary") %>%
                  select(year, count), 
                type = "column", 
                name = "Burglary", 
                hcaes(x = year, y = count)) %>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "theft") %>%
                  select(year, count), 
                type = "column", 
                name = "Theft", 
                hcaes(x = year, y = count))%>%
  hc_add_series(data = crime_sum %>%
                  filter(crime_type == "propertycrime") %>%
                  select(year, count), 
                type = "column", 
                name = "Property Crime", 
                hcaes(x = year, y = count))

Final visualization

6. Essay

To prepare the dataset for analysis, I first removed unnecessary spaces and standardized column names.I ensured that numerical values were stored as integers and categorical variables, such as state or city, were stored as factors, ensuring the dataset was clean and ready for analysis. I filtered the data to focus on the DMV area, removing observations outside Maryland, Virginia, and Washington, D.C. After checking for missing values (none were found), I converted the dataset from wide to long format, grouping crime types under a single variable with corresponding values and years for easier trend analysis and create the visualization.

The visualizations provide a clear overview of crime trends in the DMV area, showing fluctuations in different crime types over the last five years. Line graphs revealed that certain crimes, such as theft and aggravated assault, had noticeable spikes in specific years. Additionally, I could analyze that the pandemic (2020-2022) affected these trends, as the data showed a decrease in crime rates during those years. I was surprised to find that property crime was the most prevalent in this area, as I did not expect it to be the largest category.

There were a few aspects I wished I could have included but couldn’t get to work. One challenge was customizing interactive maps to visualize crime data across different regions more effectively, such as by city or by making comparisons with other areas of the country. Additionally, some graphs didn’t display the data as clearly as I had hoped, especially when trying to adjust axis scales for better readability. For example, the area graph was difficult to optimize, and I struggled to display it in a way that would show the correct results, as it appeared somewhat confusing. However, in my last graph, I found a visualization that expressed what I wanted and a graph that I would like to explore further with different variables across the country.

References

https://rpubs.com/sharmaar3/Residual_Plots https://www.highcharts.com/chat/gpt/ (to add the source in my final graph)