Final Project DATA 110

Image

Driving Under the Influence

Introduction

This project analyzes DUI (Driving Under the Influence) incidents in Montgomery County, Maryland by using data from the Montgomery County Police Department Open Data Portal. The dataset contains official DUI incident reports which are recorded by law enforcement and includes information about location, geography, and incident severity.

This dataset is a structured collection of real DUI incident reports from the Montgomery County Police Department. Each entry represents a specific crash or incident where alcohol or drug impairment was involved and officially documented by law enforcement.

The dataset I have it includes key details about where the incident happened (city and specific place type like roadway or intersection), the exact geographic coordinates (latitude and longitude), and how severe the incident was, often measured by the number of people injured or affected.

It also comes directly from police records, the data reflects real-world traffic safety conditions in Montgomery County. It helps show patterns in where DUI incidents tend to happen and whether certain locations or areas are associated with more severe outcomes.

Background Research

Maryland law clearly defines DUI and DWI offenses and outlines as strict legal consequences for reckless driving.

According to the Maryland Motor Vehicle Administration, a blood alcohol concentration (BAC) of 0.08 or higher is considered legally impaired for drivers over 21, while drivers under 21 are prohibited from having any alcohol in their system.

If a driver is arrested for DUI, they may face immediate license suspension, fines, and long-term penalties as well as the data on their driving record or mandatory ignition interlock programs.

The severity of penalties increases such as second DUI convictions which leads to longer jail sentences, higher fines, and longer license revocations. Refusing a chemical test can result in automatic license suspension which ranges from 270 days to two years depending on the type of prior offense.

These strict laws show how DUI enforcement in Maryland is heavily regulated and that consequences escalate based on the behavior and prior history.

This background is very important for my analysis because it shows that DUI severity is not only a data variable on my dataset but an outcome which is influenced by policies and behavior of the driver.

Source:

Maryland Motor Vehicle Administration. “Driving Under the Influence (DUI) & Driving While Impaired (DWI).” MVA Maryland,

https://mva.maryland.gov/your-mva-guide/drivers-revoked-or-suspended-licenses/driving-under-influence-dui

Research Question

How do location (City and Place type) and geographic position (Latitude and Longitude) influence the severity of DUI incidents which are measured by the number of victims, in Montgomery County?

Dataset:

The data was collected by the Montgomery County Police Department (MCPD) and published through the Montgomery County Open Data Portal. Each row represents a single DUI incident recorded by police officers.

City (categorical) - Location of the incident such as Rockville or Germantown

Place (categorical) - The type of envionment and the place where the incident occured

Victims (quantative) - The Number of individuals who were involved in the incident

Latitude (quantative) - Numerical geographic coordinate representing north-south position. Higher values mean farther north in the county.

Longitude (quantative) - Numerical geographic coordinates which represent east and west positions. More negative values generally means more west

Loading the Libraries and Dataset

# In this chunk I am going to be loading my libraries for my data cleaning, visualization, and analysis

# I am going to use tidyverse for data wrangling and ggplot2 visualizations.

# I am going to be using plotly to create interactive visualizations.

library(readr)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ purrr     1.1.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 4.5.3

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

dui <- read_csv("Driving_Under_the_Influence_from_Crime_Data_20260506.csv")

## Rows: 19983 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): Dispatch Date / Time, Crime Name1, Crime Name2, Crime Name3, Polic...
## dbl  (9): Incident ID, Offence Code, CR Number, NIBRS Code, Victims, Zip Cod...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploratory Data Analysis

# Citations/Disclaimer: This code follows what learned from course/class notes

# I am using the function head which shows the first few rows of the dataset so I can understand the data

# I am using the function dim which displays the number of rows and columns 

# I am using the function colnames it basically helps me figure out the variables that I need to use 

# I am using structure to help me see the structure of the dataset 

head(dui)

## # A tibble: 6 × 26
##   `Incident ID` `Offence Code` `CR Number` `Dispatch Date / Time` `NIBRS Code`
##           <dbl>          <dbl>       <dbl> <chr>                         <dbl>
## 1     201573624           5404   260019134 05/04/2026 02:31:08 AM           90
## 2     201573490           5404   260018974 05/02/2026 07:50:31 PM           90
## 3     201573622           5404   260018878 05/02/2026 02:50:28 AM           90
## 4     201573460           5404   260018836 05/01/2026 07:26:27 PM           90
## 5     201573298           5404   260018696 05/01/2026 02:12:06 AM           90
## 6     201573383           5404   260018687 04/30/2026 11:13:01 PM           90
## # ℹ 21 more variables: Victims <dbl>, `Crime Name1` <chr>, `Crime Name2` <chr>,
## #   `Crime Name3` <chr>, `Police District Name` <chr>, `Block Address` <chr>,
## #   City <chr>, State <chr>, `Zip Code` <dbl>, Agency <chr>, Place <chr>,
## #   Sector <chr>, Beat <chr>, PRA <chr>, `Address Number` <dbl>,
## #   Start_Date_Time <chr>, End_Date_Time <chr>, Latitude <dbl>,
## #   Longitude <dbl>, `Police District Number` <chr>, Location <chr>

dim(dui)

## [1] 19983    26

colnames(dui)

##  [1] "Incident ID"            "Offence Code"           "CR Number"             
##  [4] "Dispatch Date / Time"   "NIBRS Code"             "Victims"               
##  [7] "Crime Name1"            "Crime Name2"            "Crime Name3"           
## [10] "Police District Name"   "Block Address"          "City"                  
## [13] "State"                  "Zip Code"               "Agency"                
## [16] "Place"                  "Sector"                 "Beat"                  
## [19] "PRA"                    "Address Number"         "Start_Date_Time"       
## [22] "End_Date_Time"          "Latitude"               "Longitude"             
## [25] "Police District Number" "Location"

str(dui)

## spc_tbl_ [19,983 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Incident ID           : num [1:19983] 2.02e+08 2.02e+08 2.02e+08 2.02e+08 2.02e+08 ...
##  $ Offence Code          : num [1:19983] 5404 5404 5404 5404 5404 ...
##  $ CR Number             : num [1:19983] 2.6e+08 2.6e+08 2.6e+08 2.6e+08 2.6e+08 ...
##  $ Dispatch Date / Time  : chr [1:19983] "05/04/2026 02:31:08 AM" "05/02/2026 07:50:31 PM" "05/02/2026 02:50:28 AM" "05/01/2026 07:26:27 PM" ...
##  $ NIBRS Code            : num [1:19983] 90 90 90 90 90 90 90 90 90 90 ...
##  $ Victims               : num [1:19983] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Crime Name1           : chr [1:19983] "Crime Against Society" "Crime Against Society" "Crime Against Society" "Crime Against Society" ...
##  $ Crime Name2           : chr [1:19983] "Driving Under the Influence" "Driving Under the Influence" "Driving Under the Influence" "Driving Under the Influence" ...
##  $ Crime Name3           : chr [1:19983] "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" "DRIVING UNDER THE INFLUENCE LIQUOR" ...
##  $ Police District Name  : chr [1:19983] "GERMANTOWN" "MONTGOMERY VILLAGE" "MONTGOMERY VILLAGE" "WHEATON" ...
##  $ Block Address         : chr [1:19983] NA NA NA NA ...
##  $ City                  : chr [1:19983] "GERMANTOWN" "MONTGOMERY VILLAGE" "GAITHERSBURG" "SILVER SPRING" ...
##  $ State                 : chr [1:19983] "MD" "MD" "MD" "MD" ...
##  $ Zip Code              : num [1:19983] 20874 20886 20877 20906 20876 ...
##  $ Agency                : chr [1:19983] "MCPD" "MCPD" "GPD" "MCPD" ...
##  $ Place                 : chr [1:19983] "Street - In vehicle" "Street - In vehicle" "Street - In vehicle" "Street - In vehicle" ...
##  $ Sector                : chr [1:19983] "N" "R" "P" "L" ...
##  $ Beat                  : chr [1:19983] "5N1" "6R2" "6P3" "4L1" ...
##  $ PRA                   : chr [1:19983] "595" "422" "419" "524" ...
##  $ Address Number        : num [1:19983] NA NA NA NA NA NA NA NA NA NA ...
##  $ Start_Date_Time       : chr [1:19983] "05/04/2026 02:31:00 AM" "05/02/2026 07:50:00 PM" "05/02/2026 02:50:00 AM" "05/01/2026 07:26:00 PM" ...
##  $ End_Date_Time         : chr [1:19983] NA NA NA NA ...
##  $ Latitude              : num [1:19983] 39.2 39.2 39.1 39.1 39.2 ...
##  $ Longitude             : num [1:19983] -77.3 -77.2 -77.2 -77.1 -77.2 ...
##  $ Police District Number: chr [1:19983] "5D" "6D" "6D" "4D" ...
##  $ Location              : chr [1:19983] "POINT (-77.2537 39.1751)" "POINT (-77.2142 39.1506)" "POINT (-77.1684 39.1469)" "POINT (-77.0773 39.0527)" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Incident ID` = col_double(),
##   ..   `Offence Code` = col_double(),
##   ..   `CR Number` = col_double(),
##   ..   `Dispatch Date / Time` = col_character(),
##   ..   `NIBRS Code` = col_double(),
##   ..   Victims = col_double(),
##   ..   `Crime Name1` = col_character(),
##   ..   `Crime Name2` = col_character(),
##   ..   `Crime Name3` = col_character(),
##   ..   `Police District Name` = col_character(),
##   ..   `Block Address` = col_character(),
##   ..   City = col_character(),
##   ..   State = col_character(),
##   ..   `Zip Code` = col_double(),
##   ..   Agency = col_character(),
##   ..   Place = col_character(),
##   ..   Sector = col_character(),
##   ..   Beat = col_character(),
##   ..   PRA = col_character(),
##   ..   `Address Number` = col_double(),
##   ..   Start_Date_Time = col_character(),
##   ..   End_Date_Time = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double(),
##   ..   `Police District Number` = col_character(),
##   ..   Location = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# I am cleaning the dataset by converting key variables into the correct format
# and removing missing values so my analysis and models are accurate.

# I am focusing on Latitude, Longitude, Victims, City, and Place since they are important for my research question about DUI location patterns.

# I am using the function colSums in order to check the missing values which are in the dataset



dui_clean <- dui %>%
  mutate(
    Latitude = as.numeric(gsub(",", "", Latitude)),
    Longitude = as.numeric(gsub(",", "", Longitude)),
    Victims = as.numeric(gsub(",", "", Victims))
    
  ) %>%
  
  filter(
    !is.na(City),
    !is.na(Place),
    !is.na(Victims),
    !is.na(Latitude),
    !is.na(Longitude)
  )

colSums(is.na(dui_clean))

##            Incident ID           Offence Code              CR Number 
##                      0                      0                      0 
##   Dispatch Date / Time             NIBRS Code                Victims 
##                   3631                      0                      0 
##            Crime Name1            Crime Name2            Crime Name3 
##                      0                      0                      0 
##   Police District Name          Block Address                   City 
##                     56                  10663                      0 
##                  State               Zip Code                 Agency 
##                    249                    232                      0 
##                  Place                 Sector                   Beat 
##                      0                      0                      0 
##                    PRA         Address Number        Start_Date_Time 
##                      0                  10661                      0 
##          End_Date_Time               Latitude              Longitude 
##                  14717                      0                      0 
## Police District Number               Location 
##                      0                      0

Summary Statistics

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# I am using the summary() function to get the minimum, median, mean and the max values for key variables

# This also helps me have understanding of the distribution of DUI severity and geographic variables

# I am calculating the average number of victims per DUI incident

# I am also looking for the maximum number of victims in a single DUI incident

summary(dui_clean$Victims)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

summary(dui_clean$Latitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   39.04   39.09   38.41   39.14   39.34

summary(dui_clean$Longitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -77.52  -77.20  -77.12  -75.78  -77.05    0.00

mean(dui_clean$Victims, na.rm = TRUE)

## [1] 1

max(dui_clean$Victims, na.rm = TRUE)

## [1] 1

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# I am grouping the data by City to see which locations have the highest number of DUI incidents.
# This helps me identify geographic patterns across Montgomery County.

dui_city <- dui_clean %>%
  group_by(City) %>%
  summarize(count = n())

dui_city

## # A tibble: 87 × 2
##    City        count
##    <chr>       <int>
##  1 20877           1
##  2 4               1
##  3 ADELPHI         1
##  4 ASHTON         25
##  5 ASPEN HILL      8
##  6 BARNESVILLE     1
##  7 BEALLSVILLE     2
##  8 BEALSVILLE      1
##  9 BEHESDA         1
## 10 BELTSVILLE      1
## # ℹ 77 more rows

# I am also grouping the data by Place to understand the types of locations where DUIs occur most often.
# This helps me see whether certain environments (like streets or residential areas) are more common.

dui_place <- dui_clean %>%
  group_by(Place) %>%
  summarize(count = n())

dui_place

## # A tibble: 55 × 2
##    Place                        count
##    <chr>                        <int>
##  1 Auto Dealership                  1
##  2 Bank - ATM                       5
##  3 Bar/Night Club                  11
##  4 Church/Synagogue/Temple          2
##  5 Commercial - Office Building     6
##  6 Construction Site                2
##  7 Convenience Store               14
##  8 Field/ Open Space                1
##  9 Gas Station                     47
## 10 Golf Course                      2
## # ℹ 45 more rows

Multiple Linear Regression

# I am building a regression model in order to determine how location and environment predict the DUI severity.
# I am also analyzing p-values and adjusted R-squared to evaluate model performance.

model <- lm(Victims ~ Latitude + Longitude, data = dui_clean)

summary(model)

## 
## Call:
## lm(formula = Victims ~ Latitude + Longitude, data = dui_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.000e-15 -4.000e-15 -3.000e-15 -3.000e-15  6.292e-11 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.000e+00  2.390e-14 4.184e+13   <2e-16 ***
## Latitude    2.147e-14  6.534e-14 3.290e-01    0.742    
## Longitude   1.084e-14  3.312e-14 3.270e-01    0.744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.452e-13 on 19974 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4999 
## F-statistic:  9987 on 2 and 19974 DF,  p-value: < 2.2e-16

Regression Equation

Victims = 1 + (2.147 * 10^-14) Latitude + 1.084 * 10^-14 Longitude

Regression Analysis

I found that the model is statistically significant overall, with an F-statistic of 9987 and a p-value less than 2.2e−16. The adjusted R² is 0.4999, which means I can explain about 50% of the variation in Victims using Latitude and Longitude.

However, when I look at individual predictors, neither Latitude p = 0.742 or Longitude p = 0.744 is statistically significant. The p-value represents the probability of observing results as extreme if the predictor actually had no relationship with Victims. Since these p-values are much larger than 0.05 this is why we fail to reject the null hypothesis for both variables.

Even though the model is statistically significant, the individual variables do not show strong linear relationships with Victims, and the significance is influenced by the large sample size rather than the effects of the predictors.

Model Diagnostic Check

# I am generating all default diagnostic plots for my regression model.
# This lets me quickly check whether my model meets key assumptions and if any data points are overly influential.


plot(model)

Pivot for Tableu (Using what learned in Assingment 9)

# I am creating a pivot table to summarize DUI incidents by City and Place to identify location patterns across Montgomery County.
# I am grouping the data to count how many incidents occur in each City–Place combination.
# I am reshaping the data so each Place becomes a column and each City becomes a row for easier comparison in Tableau.
# I am replacing missing values with 0 so the table is complete.

dui_pivot_city_place <- dui_clean %>%
  group_by(City, Place) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(
    names_from = Place,
    values_from = count,
    values_fill = 0
  )

# I am renaming variables to make them clearer and more readable for Tableau analysis.

# City becomes City_Name and Place becomes Location_Type for better interpretation.

# I am exporting the cleaned dataset as a CSV file so it can be used in Tableau for visualization.

# I am removimng row names in order to keep the file clean and properly formatted.

dui_final <- dui_clean |>
  rename(
    City_Name = City,
    Location_Type = Place
  )

write.csv(dui_final, "dui_final.csv", row.names = FALSE)

Tableu Visualization 1

https://public.tableau.com/app/profile/arnav.shah7897/viz/ArnavShahFinalProjectTableauVisualization/Sheet4?publish=yes

Interactive Plot

# Citations/Disclaimer: This code follows what we learned from class notes

# In this chunk I created a scatterplot in order to visualize DUI locations in Montgomery County

# Longitude and Latitude are basically used in order to map geographic patterns of incidents

# I am coloring points by City in order to compare where incidents are concentrated

interactive_plot <- ggplot(dui_clean, aes(
  x = Longitude,
  y = Latitude,
  color = City
)) +
  geom_point(alpha = 0.8, size = 1.5) +
  
  scale_color_brewer(palette = "Set3") +
  
  theme_minimal() +
  
  labs(
    title = "Interactive DUI Incident Location Map",
    x = "Longitude",
    y = "Latitude",
    color = "City",
    caption = "Source: Montgomery County Police Department Open Data"
  )

ggplotly(interactive_plot)

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors

Interactive Plot Analysis

This interactive visualization I used to try to help answer my research question about how geographic position influences DUI incidents in Montgomery County. Most DUI incidents are concentrated between about -77.3 and -76.9 longitude and 39.0 to 39.2 latitude which shows how incidents are often clustered within specific areas of the county rather than being evenly distributed

Each point represents one DUI incident, while colors separate incidents by city. This suggests that location may play a role in where DUI incidents occur most frequently.