This project analyzes DUI (Driving Under the Influence) incidents in Montgomery County, Maryland by using data from the Montgomery County Police Department Open Data Portal. The dataset contains official DUI incident reports which are recorded by law enforcement and includes information about location, geography, and incident severity.
This dataset is a structured collection of real DUI incident reports from the Montgomery County Police Department. Each entry represents a specific crash or incident where alcohol or drug impairment was involved and officially documented by law enforcement.
The dataset I have it includes key details about where the incident happened (city and specific place type like roadway or intersection), the exact geographic coordinates (latitude and longitude), and how severe the incident was, often measured by the number of people injured or affected.
It also comes directly from police records, the data reflects real-world traffic safety conditions in Montgomery County. It helps show patterns in where DUI incidents tend to happen and whether certain locations or areas are associated with more severe outcomes.
Maryland law clearly defines DUI and DWI offenses and outlines as strict legal consequences for reckless driving.
According to the Maryland Motor Vehicle Administration, a blood alcohol concentration (BAC) of 0.08 or higher is considered legally impaired for drivers over 21, while drivers under 21 are prohibited from having any alcohol in their system.
If a driver is arrested for DUI, they may face immediate license suspension, fines, and long-term penalties as well as the data on their driving record or mandatory ignition interlock programs.
The severity of penalties increases such as second DUI convictions which leads to longer jail sentences, higher fines, and longer license revocations. Refusing a chemical test can result in automatic license suspension which ranges from 270 days to two years depending on the type of prior offense.
These strict laws show how DUI enforcement in Maryland is heavily regulated and that consequences escalate based on the behavior and prior history.
This background is very important for my analysis because it shows that DUI severity is not only a data variable on my dataset but an outcome which is influenced by policies and behavior of the driver.
Maryland Motor Vehicle Administration. “Driving Under the Influence (DUI) & Driving While Impaired (DWI).” MVA Maryland,
How do location (City and Place type) and geographic position (Latitude and Longitude) influence the severity of DUI incidents which are measured by the number of victims, in Montgomery County?
The data was collected by the Montgomery County Police Department (MCPD) and published through the Montgomery County Open Data Portal. Each row represents a single DUI incident recorded by police officers.
City (categorical) - Location of the incident such as Rockville or Germantown
Place (categorical) - The type of envionment and the place where the incident occured
Victims (quantative) - The Number of individuals who were involved in the incident
Latitude (quantative) - Numerical geographic coordinate representing north-south position. Higher values mean farther north in the county.
Longitude (quantative) - Numerical geographic coordinates which represent east and west positions. More negative values generally means more west
# In this chunk I am going to be loading my libraries for my data cleaning, visualization, and analysis
# I am going to use tidyverse for data wrangling and ggplot2 visualizations.
# I am going to be using plotly to create interactive visualizations.
library(readr)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ purrr 1.1.0
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
dui <- read_csv("Driving_Under_the_Influence_from_Crime_Data_20260506.csv")
## Rows: 19983 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): Dispatch Date / Time, Crime Name1, Crime Name2, Crime Name3, Polic...
## dbl (9): Incident ID, Offence Code, CR Number, NIBRS Code, Victims, Zip Cod...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Citations/Disclaimer: This code follows what learned from course/class notes
# I am using the function head which shows the first few rows of the dataset so I can understand the data
# I am using the function dim which displays the number of rows and columns
# I am using the function colnames it basically helps me figure out the variables that I need to use
# I am using structure to help me see the structure of the dataset
head(dui)
## # A tibble: 6 × 26
## `Incident ID` `Offence Code` `CR Number` `Dispatch Date / Time` `NIBRS Code`
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 201573624 5404 260019134 05/04/2026 02:31:08 AM 90
## 2 201573490 5404 260018974 05/02/2026 07:50:31 PM 90
## 3 201573622 5404 260018878 05/02/2026 02:50:28 AM 90
## 4 201573460 5404 260018836 05/01/2026 07:26:27 PM 90
## 5 201573298 5404 260018696 05/01/2026 02:12:06 AM 90
## 6 201573383 5404 260018687 04/30/2026 11:13:01 PM 90
## # ℹ 21 more variables: Victims <dbl>, `Crime Name1` <chr>, `Crime Name2` <chr>,
## # `Crime Name3` <chr>, `Police District Name` <chr>, `Block Address` <chr>,
## # City <chr>, State <chr>, `Zip Code` <dbl>, Agency <chr>, Place <chr>,
## # Sector <chr>, Beat <chr>, PRA <chr>, `Address Number` <dbl>,
## # Start_Date_Time <chr>, End_Date_Time <chr>, Latitude <dbl>,
## # Longitude <dbl>, `Police District Number` <chr>, Location <chr>
dim(dui)
## [1] 19983 26
colnames(dui)
## [1] "Incident ID" "Offence Code" "CR Number"
## [4] "Dispatch Date / Time" "NIBRS Code" "Victims"
## [7] "Crime Name1" "Crime Name2" "Crime Name3"
## [10] "Police District Name" "Block Address" "City"
## [13] "State" "Zip Code" "Agency"
## [16] "Place" "Sector" "Beat"
## [19] "PRA" "Address Number" "Start_Date_Time"
## [22] "End_Date_Time" "Latitude" "Longitude"
## [25] "Police District Number" "Location"
str(dui)
## spc_tbl_ [19,983 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Incident ID : num [1:19983] 2.02e+08 2.02e+08 2.02e+08 2.02e+08 2.02e+08 ...
## $ Offence Code : num [1:19983] 5404 5404 5404 5404 5404 ...
## $ CR Number : num [1:19983] 2.6e+08 2.6e+08 2.6e+08 2.6e+08 2.6e+08 ...
## $ Dispatch Date / Time : chr [1:19983] "05/04/2026 02:31:08 AM" "05/02/2026 07:50:31 PM" "05/02/2026 02:50:28 AM" "05/01/2026 07:26:27 PM" ...
## $ NIBRS Code : num [1:19983] 90 90 90 90 90 90 90 90 90 90 ...
## $ Victims : num [1:19983] 1 1 1 1 1 1 1 1 1 1 ...
## $ Crime Name1 : chr [1:19983] "Crime Against Society" "Crime Against Society" "Crime Against Society" "Crime Against Society" ...
## $ Crime Name2 : chr [1:19983] "Driving Under the Influence" "Driving Under the Influence" "Driving Under the Influence" "Driving Under the Influence" ...
## $ Crime Name3 : chr [1:19983] "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" ...
## $ Police District Name : chr [1:19983] "GERMANTOWN" "MONTGOMERY VILLAGE" "MONTGOMERY VILLAGE" "WHEATON" ...
## $ Block Address : chr [1:19983] NA NA NA NA ...
## $ City : chr [1:19983] "GERMANTOWN" "MONTGOMERY VILLAGE" "GAITHERSBURG" "SILVER SPRING" ...
## $ State : chr [1:19983] "MD" "MD" "MD" "MD" ...
## $ Zip Code : num [1:19983] 20874 20886 20877 20906 20876 ...
## $ Agency : chr [1:19983] "MCPD" "MCPD" "GPD" "MCPD" ...
## $ Place : chr [1:19983] "Street - In vehicle" "Street - In vehicle" "Street - In vehicle" "Street - In vehicle" ...
## $ Sector : chr [1:19983] "N" "R" "P" "L" ...
## $ Beat : chr [1:19983] "5N1" "6R2" "6P3" "4L1" ...
## $ PRA : chr [1:19983] "595" "422" "419" "524" ...
## $ Address Number : num [1:19983] NA NA NA NA NA NA NA NA NA NA ...
## $ Start_Date_Time : chr [1:19983] "05/04/2026 02:31:00 AM" "05/02/2026 07:50:00 PM" "05/02/2026 02:50:00 AM" "05/01/2026 07:26:00 PM" ...
## $ End_Date_Time : chr [1:19983] NA NA NA NA ...
## $ Latitude : num [1:19983] 39.2 39.2 39.1 39.1 39.2 ...
## $ Longitude : num [1:19983] -77.3 -77.2 -77.2 -77.1 -77.2 ...
## $ Police District Number: chr [1:19983] "5D" "6D" "6D" "4D" ...
## $ Location : chr [1:19983] "POINT (-77.2537 39.1751)" "POINT (-77.2142 39.1506)" "POINT (-77.1684 39.1469)" "POINT (-77.0773 39.0527)" ...
## - attr(*, "spec")=
## .. cols(
## .. `Incident ID` = col_double(),
## .. `Offence Code` = col_double(),
## .. `CR Number` = col_double(),
## .. `Dispatch Date / Time` = col_character(),
## .. `NIBRS Code` = col_double(),
## .. Victims = col_double(),
## .. `Crime Name1` = col_character(),
## .. `Crime Name2` = col_character(),
## .. `Crime Name3` = col_character(),
## .. `Police District Name` = col_character(),
## .. `Block Address` = col_character(),
## .. City = col_character(),
## .. State = col_character(),
## .. `Zip Code` = col_double(),
## .. Agency = col_character(),
## .. Place = col_character(),
## .. Sector = col_character(),
## .. Beat = col_character(),
## .. PRA = col_character(),
## .. `Address Number` = col_double(),
## .. Start_Date_Time = col_character(),
## .. End_Date_Time = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double(),
## .. `Police District Number` = col_character(),
## .. Location = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# I am cleaning the dataset by converting key variables into the correct format
# and removing missing values so my analysis and models are accurate.
# I am focusing on Latitude, Longitude, Victims, City, and Place since they are important for my research question about DUI location patterns.
# I am using the function colSums in order to check the missing values which are in the dataset
dui_clean <- dui %>%
mutate(
Latitude = as.numeric(gsub(",", "", Latitude)),
Longitude = as.numeric(gsub(",", "", Longitude)),
Victims = as.numeric(gsub(",", "", Victims))
) %>%
filter(
!is.na(City),
!is.na(Place),
!is.na(Victims),
!is.na(Latitude),
!is.na(Longitude)
)
colSums(is.na(dui_clean))
## Incident ID Offence Code CR Number
## 0 0 0
## Dispatch Date / Time NIBRS Code Victims
## 3631 0 0
## Crime Name1 Crime Name2 Crime Name3
## 0 0 0
## Police District Name Block Address City
## 56 10663 0
## State Zip Code Agency
## 249 232 0
## Place Sector Beat
## 0 0 0
## PRA Address Number Start_Date_Time
## 0 10661 0
## End_Date_Time Latitude Longitude
## 14717 0 0
## Police District Number Location
## 0 0
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# I am using the summary() function to get the minimum, median, mean and the max values for key variables
# This also helps me have understanding of the distribution of DUI severity and geographic variables
# I am calculating the average number of victims per DUI incident
# I am also looking for the maximum number of victims in a single DUI incident
summary(dui_clean$Victims)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
summary(dui_clean$Latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 39.04 39.09 38.41 39.14 39.34
summary(dui_clean$Longitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -77.52 -77.20 -77.12 -75.78 -77.05 0.00
mean(dui_clean$Victims, na.rm = TRUE)
## [1] 1
max(dui_clean$Victims, na.rm = TRUE)
## [1] 1
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# I am grouping the data by City to see which locations have the highest number of DUI incidents.
# This helps me identify geographic patterns across Montgomery County.
dui_city <- dui_clean %>%
group_by(City) %>%
summarize(count = n())
dui_city
## # A tibble: 87 × 2
## City count
## <chr> <int>
## 1 20877 1
## 2 4 1
## 3 ADELPHI 1
## 4 ASHTON 25
## 5 ASPEN HILL 8
## 6 BARNESVILLE 1
## 7 BEALLSVILLE 2
## 8 BEALSVILLE 1
## 9 BEHESDA 1
## 10 BELTSVILLE 1
## # ℹ 77 more rows
# I am also grouping the data by Place to understand the types of locations where DUIs occur most often.
# This helps me see whether certain environments (like streets or residential areas) are more common.
dui_place <- dui_clean %>%
group_by(Place) %>%
summarize(count = n())
dui_place
## # A tibble: 55 × 2
## Place count
## <chr> <int>
## 1 Auto Dealership 1
## 2 Bank - ATM 5
## 3 Bar/Night Club 11
## 4 Church/Synagogue/Temple 2
## 5 Commercial - Office Building 6
## 6 Construction Site 2
## 7 Convenience Store 14
## 8 Field/ Open Space 1
## 9 Gas Station 47
## 10 Golf Course 2
## # ℹ 45 more rows
# I am building a regression model in order to determine how location and environment predict the DUI severity.
# I am also analyzing p-values and adjusted R-squared to evaluate model performance.
model <- lm(Victims ~ Latitude + Longitude, data = dui_clean)
summary(model)
##
## Call:
## lm(formula = Victims ~ Latitude + Longitude, data = dui_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.000e-15 -4.000e-15 -3.000e-15 -3.000e-15 6.292e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 2.390e-14 4.184e+13 <2e-16 ***
## Latitude 2.147e-14 6.534e-14 3.290e-01 0.742
## Longitude 1.084e-14 3.312e-14 3.270e-01 0.744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.452e-13 on 19974 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: 0.4999
## F-statistic: 9987 on 2 and 19974 DF, p-value: < 2.2e-16
Victims = 1 + (2.147 * 10^-14) Latitude + 1.084 * 10^-14 Longitude
I found that the model is statistically significant overall, with an F-statistic of 9987 and a p-value less than 2.2e−16. The adjusted R² is 0.4999, which means I can explain about 50% of the variation in Victims using Latitude and Longitude.
However, when I look at individual predictors, neither Latitude p = 0.742 or Longitude p = 0.744 is statistically significant. The p-value represents the probability of observing results as extreme if the predictor actually had no relationship with Victims. Since these p-values are much larger than 0.05 this is why we fail to reject the null hypothesis for both variables.
Even though the model is statistically significant, the individual variables do not show strong linear relationships with Victims, and the significance is influenced by the large sample size rather than the effects of the predictors.
# I am generating all default diagnostic plots for my regression model.
# This lets me quickly check whether my model meets key assumptions and if any data points are overly influential.
plot(model)
# I am creating a pivot table to summarize DUI incidents by City and Place to identify location patterns across Montgomery County.
# I am grouping the data to count how many incidents occur in each City–Place combination.
# I am reshaping the data so each Place becomes a column and each City becomes a row for easier comparison in Tableau.
# I am replacing missing values with 0 so the table is complete.
dui_pivot_city_place <- dui_clean %>%
group_by(City, Place) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(
names_from = Place,
values_from = count,
values_fill = 0
)
# I am renaming variables to make them clearer and more readable for Tableau analysis.
# City becomes City_Name and Place becomes Location_Type for better interpretation.
# I am exporting the cleaned dataset as a CSV file so it can be used in Tableau for visualization.
# I am removimng row names in order to keep the file clean and properly formatted.
dui_final <- dui_clean |>
rename(
City_Name = City,
Location_Type = Place
)
write.csv(dui_final, "dui_final.csv", row.names = FALSE)
# Citations/Disclaimer: This code follows what we learned from class notes
# In this chunk I created a scatterplot in order to visualize DUI locations in Montgomery County
# Longitude and Latitude are basically used in order to map geographic patterns of incidents
# I am coloring points by City in order to compare where incidents are concentrated
interactive_plot <- ggplot(dui_clean, aes(
x = Longitude,
y = Latitude,
color = City
)) +
geom_point(alpha = 0.8, size = 1.5) +
scale_color_brewer(palette = "Set3") +
theme_minimal() +
labs(
title = "Interactive DUI Incident Location Map",
x = "Longitude",
y = "Latitude",
color = "City",
caption = "Source: Montgomery County Police Department Open Data"
)
ggplotly(interactive_plot)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors
This interactive visualization I used to try to help answer my research question about how geographic position influences DUI incidents in Montgomery County. Most DUI incidents are concentrated between about -77.3 and -76.9 longitude and 39.0 to 39.2 latitude which shows how incidents are often clustered within specific areas of the county rather than being evenly distributed
Each point represents one DUI incident, while colors separate incidents by city. This suggests that location may play a role in where DUI incidents occur most frequently.