My dataset is about the police alcohol violations in Maryland specifically Montgomery county, it shows details of people who have committed these violations. My dataset has 1899 observations and 9 variables. The variables that I will be using are race, ethnicity and gender, some other variables in this dataset are age, district (the district of the police), division, bureau and date time. I will be looking at how gender and race are associated with the alcohol violations. Since this datset doesn’t have any specific violations I will be counting every single case as a violation in itself. I got my dataset from data.montgomerycountymd.gov and it can be found with this link https://data.montgomerycountymd.gov/Public-Safety/Police-Alcohol-Violations/heap-55cn/about_data.
First I will clean my dataset so that the variables are lowercase and have no spaces, we will look at the dimensions and structure of the dataset. I will look at race, and use unique to see if there are any N/As or Unknowns, then I will filter out the unknowns and then use mutate to make Hispanics into the race variable. In this dataset ethnicity is used to show Hispanics but for race it shows that they are white. To change that I will use mutate and an ifelse statement creating a new variable which is race1. The ifelse statement will check if the ethnicity variable and check if it says HISPANIC, and if it does it will count it towards race1 and make a new variable, if it doesn’t say HISPANIC it will keep the race as white. Then I will use select to only have the gender, race1, and id, then find the counts of the race and separate it by gender by using count. Using geom_bar I can show the counts of each race, and with using position = dodge I can show the gender side by side. With race as the x and total amount of alcohol violations as the y.
# load the libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
library(forcats)
setwd("C:/Users/rjzavaleta/Downloads/Data 101")
pav <- read_csv("Police_Alcohol_Violations_20260331.csv")
# cleaning
names(pav) <- tolower(names(pav))
names(pav) <- gsub(" ","_",names(pav))
names(pav) <- gsub("[(). //-]", "_", names(pav))
head(pav)
## # A tibble: 6 × 9
## id race gender age ethnicity district division bureau event_date_time
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 503 Black/A… Male 45 NON-HISP… 3 3D PSB 2026 Mar 26 04…
## 2 1898 White Male 18 NON-HISP… 2 2D PSB 2026 Mar 25 01…
## 3 742 White Female 32 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 4 1899 White Male 61 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 5 420 White Male 39 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 6 850 Black/A… Male 32 NON-HISP… 2 2D PSB 2026 Mar 22 06…
summary(pav)
## id race gender age
## Min. : 1.0 Length:1899 Length:1899 Length:1899
## 1st Qu.: 475.5 Class :character Class :character Class :character
## Median : 950.0 Mode :character Mode :character Mode :character
## Mean : 950.0
## 3rd Qu.:1424.5
## Max. :1899.0
##
## ethnicity district division bureau
## Length:1899 Min. :1.00 Length:1899 Length:1899
## Class :character 1st Qu.:2.00 Class :character Class :character
## Mode :character Median :4.00 Mode :character Mode :character
## Mean :3.65
## 3rd Qu.:4.00
## Max. :8.00
## NA's :8
## event_date_time
## Length:1899
## Class :character
## Mode :character
##
##
##
##
dim(pav)
## [1] 1899 9
unique(pav$race)
## [1] "Black/African American" "White" "Asian"
## [4] "Unknown"
pav2 <- pav
pav2 <- pav2 |>
filter(race != "Unknown") |>
mutate(race1 = ifelse(ethnicity %in% c("HISPANIC"), "Hispanic", race))
head(pav2)
## # A tibble: 6 × 10
## id race gender age ethnicity district division bureau event_date_time
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 503 Black/A… Male 45 NON-HISP… 3 3D PSB 2026 Mar 26 04…
## 2 1898 White Male 18 NON-HISP… 2 2D PSB 2026 Mar 25 01…
## 3 742 White Female 32 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 4 1899 White Male 61 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 5 420 White Male 39 HISPANIC 4 4D PSB 2026 Mar 23 10…
## 6 850 Black/A… Male 32 NON-HISP… 2 2D PSB 2026 Mar 22 06…
## # ℹ 1 more variable: race1 <chr>
pav3 <- pav2 |>
select(gender, id, race1) |>
count(race1, gender)
head(pav3)
## # A tibble: 6 × 3
## race1 gender n
## <chr> <chr> <int>
## 1 Asian Female 24
## 2 Asian Male 35
## 3 Black/African American Female 113
## 4 Black/African American Male 361
## 5 Hispanic Female 83
## 6 Hispanic Male 738
p1 <- ggplot(pav3, aes(x = `race1`, y = `n`, fill = gender)) + geom_bar(aes(x=race1, y = n, colour = gender), stat = "identity", position = "dodge") + labs(x = "Race", y = "Alcohol Violations",
title = "Graph of Alcohol Violations compare to race and gender")
p1
The \(H_0\) is:There is no significant association between race and gender and the prevalence of police alcohol violations and the \(H_a\) is : There is a significant association between race and gender and the prevalence of police alcohol violations. I will be using a chi square test for association. I will be using a matrix based on the table I made a couple chunks before this. Doing the chi squared test, the p-value is <2.2e-16 which means there is a statistical significance in association since the p-value is smaller than the alpha which is 0.05. I also created another function called chi and I checked the expected values which are all above 5 meaning that the chi square test is valid and we pass the basic assumptions.
\(H_0\) :There is no significant association between race and gender and the prevalence of police alcohol violations
\(H_a\): There is a significant association between race and gender and the prevalence of police alcohol violations
pav_test <- matrix(c(35,24,361,113,738,83,380,145), nrow = 2, byrow = FALSE,
dimnames = list(c("Male", "Female"), c("Asian", "Black/African American", "Hispanic","White")))
pav_test
## Asian Black/African American Hispanic White
## Male 35 361 738 380
## Female 24 113 83 145
chi <- chisq.test(pav_test)
chi
##
## Pearson's Chi-squared test
##
## data: pav_test
## X-squared = 90.967, df = 3, p-value < 2.2e-16
#check expected counts
chi$expected
## Asian Black/African American Hispanic White
## Male 47.53912 381.92443 661.5189 423.0176
## Female 11.46088 92.07557 159.4811 101.9824
#Chi-squared value
chi$statistic
## X-squared
## 90.96737
Based on the p-value we reject the null meaning there is a significant association between race and gender in police reported alcohol violations. The bar graphs and expected values show that Hispanic males commit and are more likely to commit these alcohol violations. With Asian females being least likely to do these alcohol violations. The implications of these results suggest that alcohol violations are more likely to occur based on gender and race. Further research on this data set could be an addition of the different types of alcohol violations. This could allow for an ANOVA test to be done. There could also be a research into if these reports of alcohol violations could be false and if gender or race could have a factor in that.
Dataset : https://data.montgomerycountymd.gov/Public-Safety/Police-Alcohol-Violations/heap-55cn/about_data
Visualization: From Bar Charts and Diamonds from Prof. Saidi.