Is there an association between race and gender in police alcohol violations?

Introduction

My dataset is about the police alcohol violations in Maryland specifically Montgomery county, it shows details of people who have committed these violations. My dataset has 1899 observations and 9 variables. The variables that I will be using are race, ethnicity and gender, some other variables in this dataset are age, district (the district of the police), division, bureau and date time. I will be looking at how gender and race are associated with the alcohol violations. Since this datset doesn’t have any specific violations I will be counting every single case as a violation in itself. I got my dataset from data.montgomerycountymd.gov and it can be found with this link https://data.montgomerycountymd.gov/Public-Safety/Police-Alcohol-Violations/heap-55cn/about_data.

Data Analysis

First I will clean my dataset so that the variables are lowercase and have no spaces, we will look at the dimensions and structure of the dataset. I will look at race, and use unique to see if there are any N/As or Unknowns, then I will filter out the unknowns and then use mutate to make Hispanics into the race variable. In this dataset ethnicity is used to show Hispanics but for race it shows that they are white. To change that I will use mutate and an ifelse statement creating a new variable which is race1. The ifelse statement will check if the ethnicity variable and check if it says HISPANIC, and if it does it will count it towards race1 and make a new variable, if it doesn’t say HISPANIC it will keep the race as white. Then I will use select to only have the gender, race1, and id, then find the counts of the race and separate it by gender by using count. Using geom_bar I can show the counts of each race, and with using position = dodge I can show the gender side by side. With race as the x and total amount of alcohol violations as the y.

# load the libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

library(forcats)

setwd("C:/Users/rjzavaleta/Downloads/Data 101")
pav <- read_csv("Police_Alcohol_Violations_20260331.csv")

# cleaning
names(pav) <- tolower(names(pav))
names(pav) <- gsub(" ","_",names(pav))
names(pav) <- gsub("[(). //-]", "_", names(pav))
head(pav)

## # A tibble: 6 × 9
##      id race     gender age   ethnicity district division bureau event_date_time
##   <dbl> <chr>    <chr>  <chr> <chr>        <dbl> <chr>    <chr>  <chr>          
## 1   503 Black/A… Male   45    NON-HISP…        3 3D       PSB    2026 Mar 26 04…
## 2  1898 White    Male   18    NON-HISP…        2 2D       PSB    2026 Mar 25 01…
## 3   742 White    Female 32    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 4  1899 White    Male   61    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 5   420 White    Male   39    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 6   850 Black/A… Male   32    NON-HISP…        2 2D       PSB    2026 Mar 22 06…

summary(pav)

##        id             race              gender              age           
##  Min.   :   1.0   Length:1899        Length:1899        Length:1899       
##  1st Qu.: 475.5   Class :character   Class :character   Class :character  
##  Median : 950.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 950.0                                                           
##  3rd Qu.:1424.5                                                           
##  Max.   :1899.0                                                           
##                                                                           
##   ethnicity            district      division            bureau         
##  Length:1899        Min.   :1.00   Length:1899        Length:1899       
##  Class :character   1st Qu.:2.00   Class :character   Class :character  
##  Mode  :character   Median :4.00   Mode  :character   Mode  :character  
##                     Mean   :3.65                                        
##                     3rd Qu.:4.00                                        
##                     Max.   :8.00                                        
##                     NA's   :8                                           
##  event_date_time   
##  Length:1899       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

dim(pav)

## [1] 1899    9

unique(pav$race)

## [1] "Black/African American" "White"                  "Asian"                 
## [4] "Unknown"

pav2 <- pav
pav2 <- pav2 |>
  filter(race != "Unknown") |>
  mutate(race1 = ifelse(ethnicity %in% c("HISPANIC"), "Hispanic", race))
head(pav2)

## # A tibble: 6 × 10
##      id race     gender age   ethnicity district division bureau event_date_time
##   <dbl> <chr>    <chr>  <chr> <chr>        <dbl> <chr>    <chr>  <chr>          
## 1   503 Black/A… Male   45    NON-HISP…        3 3D       PSB    2026 Mar 26 04…
## 2  1898 White    Male   18    NON-HISP…        2 2D       PSB    2026 Mar 25 01…
## 3   742 White    Female 32    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 4  1899 White    Male   61    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 5   420 White    Male   39    HISPANIC         4 4D       PSB    2026 Mar 23 10…
## 6   850 Black/A… Male   32    NON-HISP…        2 2D       PSB    2026 Mar 22 06…
## # ℹ 1 more variable: race1 <chr>

pav3 <- pav2 |>
  select(gender, id, race1) |>
  count(race1, gender)
head(pav3)

## # A tibble: 6 × 3
##   race1                  gender     n
##   <chr>                  <chr>  <int>
## 1 Asian                  Female    24
## 2 Asian                  Male      35
## 3 Black/African American Female   113
## 4 Black/African American Male     361
## 5 Hispanic               Female    83
## 6 Hispanic               Male     738

p1 <-  ggplot(pav3, aes(x = `race1`, y = `n`, fill = gender)) + geom_bar(aes(x=race1, y = n, colour = gender), stat = "identity", position = "dodge") + labs(x = "Race", y = "Alcohol Violations", 
       title = "Graph of Alcohol Violations compare to race and gender")
p1

Statistical Analysis

The \(H_0\) is:There is no significant association between race and gender and the prevalence of police alcohol violations and the \(H_a\) is : There is a significant association between race and gender and the prevalence of police alcohol violations. I will be using a chi square test for association. I will be using a matrix based on the table I made a couple chunks before this. Doing the chi squared test, the p-value is <2.2e-16 which means there is a statistical significance in association since the p-value is smaller than the alpha which is 0.05. I also created another function called chi and I checked the expected values which are all above 5 meaning that the chi square test is valid and we pass the basic assumptions.

\(H_0\) :There is no significant association between race and gender and the prevalence of police alcohol violations

\(H_a\): There is a significant association between race and gender and the prevalence of police alcohol violations

pav_test <-  matrix(c(35,24,361,113,738,83,380,145), nrow = 2, byrow = FALSE,
                   dimnames = list(c("Male", "Female"), c("Asian", "Black/African American", "Hispanic","White")))
pav_test

##        Asian Black/African American Hispanic White
## Male      35                    361      738   380
## Female    24                    113       83   145

chi <- chisq.test(pav_test)
chi

## 
##  Pearson's Chi-squared test
## 
## data:  pav_test
## X-squared = 90.967, df = 3, p-value < 2.2e-16

#check expected counts
chi$expected

##           Asian Black/African American Hispanic    White
## Male   47.53912              381.92443 661.5189 423.0176
## Female 11.46088               92.07557 159.4811 101.9824

#Chi-squared value
chi$statistic

## X-squared 
##  90.96737

Conclusion

Based on the p-value we reject the null meaning there is a significant association between race and gender in police reported alcohol violations. The bar graphs and expected values show that Hispanic males commit and are more likely to commit these alcohol violations. With Asian females being least likely to do these alcohol violations. The implications of these results suggest that alcohol violations are more likely to occur based on gender and race. Further research on this data set could be an addition of the different types of alcohol violations. This could allow for an ANOVA test to be done. There could also be a research into if these reports of alcohol violations could be false and if gender or race could have a factor in that.

References

Dataset : https://data.montgomerycountymd.gov/Public-Safety/Police-Alcohol-Violations/heap-55cn/about_data

Visualization: From Bar Charts and Diamonds from Prof. Saidi.

Project 2 - DATA 101

Ricardo Zavaleta

2026-04-07