Ames, Iowa, USA is the home of Iowa State University, a land grant university with over 36,000 students. By comparison, the city of Ames, Iowa, itself only has about 65,000 residents. As with any other college town, Ames has had its fair share of alcohol-related incidents. (For example, Google ‘VEISHEA riots 2014’.) We will take a look at some breath alcohol test data from Ames that is published by the State of Iowa.
The data contains 1,556 readings from breath alcohol tests administered by the Ames and Iowa State University Police Departments from January 2013 to December 2017. The columns in this data set are year, month, day, hour, location, gender, Res1, Res2.
# Load the packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
# Read the data into your workspace
ba_data <- read_csv('datasets/breath_alcohol_ames.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## year = col_double(),
## month = col_double(),
## day = col_double(),
## hour = col_double(),
## location = col_character(),
## gender = col_character(),
## Res1 = col_double(),
## Res2 = col_double()
## )
# Quickly inspect the data
head(ba_data)
## # A tibble: 6 x 8
## year month day hour location gender Res1 Res2
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 2017 12 17 1 Ames PD M 0.046 0.046
## 2 2017 12 14 3 ISU PD F 0.121 0.12
## 3 2017 12 10 5 ISU PD F 0.068 0.067
## 4 2017 12 10 3 ISU PD F 0.077 0.077
## 5 2017 12 9 2 ISU PD M 0.085 0.084
## 6 2017 12 9 1 Ames PD M 0.16 0.161
# Obtain counts for each year
ba_year <- ba_data %>%
group_by(year) %>%
count()
There are two police departments in the data set: the Iowa State University Police Department and the Ames Police Department. Which one administers more breathalyzer tests?
# Count the totals for each department
pds <- ba_data %>%
group_by(location) %>%
count()
pds
## # A tibble: 2 x 2
## # Groups: location [2]
## location n
## <chr> <int>
## 1 Ames PD 616
## 2 ISU PD 940
We all know that “nothing good happens after 2am.” Thus, there are inevitably some times of the day when breath alcohol tests, especially in a college town like Ames, are most and least common. Which hours of the day have the most and least breathalyzer tests?
# Count by hour and arrange by descending frequency
# hourly <- ba_data %>%
# group_by(hour) %>%
# count() %>%
# arrange(desc(n))
hourly <- ba_data %>%
count(hour, sort = TRUE)
hourly
## # A tibble: 24 x 2
## hour n
## <dbl> <int>
## 1 2 417
## 2 3 364
## 3 1 219
## 4 4 124
## 5 0 98
## 6 23 51
## 7 5 48
## 8 20 31
## 9 22 29
## 10 21 26
## # … with 14 more rows
# Use a geom_ to create the appropriate bar chart
ggplot(hourly, aes(x = hour, weight = n)) +
geom_bar()
Now that we have discovered which time of day is most common for breath alcohol tests, we will determine which time of the year has the most breathalyzer tests. Which month will have the most recorded tests?
# Count by month and arrange by descending frequency
monthly <- ba_data %>%
count(month, sort = TRUE)
# Make month a factor
monthly$month <- as.factor(monthly$month)
# Use a geom_ to create the appropriate bar chart
ggplot(monthly, aes(x = month,weight = n)) +
geom_bar()
# ggplot(monthly, aes(x = month, y = n))+
# geom_col()
When we think of (binge) drinking in college towns in America, we usually think of something like this image at the left. And so, one might suspect that breath alcohol tests are given to men more often than women and that men drink more than women.
# Count by gender
ba_data %>%
count(gender)
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 F 425
## 2 M 1102
## 3 <NA> 29
# Create a dataset with no NAs in gender
clean_gender <- ba_data %>%
filter(!is.na(gender))
clean_gender
## # A tibble: 1,527 x 8
## year month day hour location gender Res1 Res2
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 2017 12 17 1 Ames PD M 0.046 0.046
## 2 2017 12 14 3 ISU PD F 0.121 0.12
## 3 2017 12 10 5 ISU PD F 0.068 0.067
## 4 2017 12 10 3 ISU PD F 0.077 0.077
## 5 2017 12 9 2 ISU PD M 0.085 0.084
## 6 2017 12 9 1 Ames PD M 0.16 0.161
## 7 2017 12 7 3 Ames PD M 0.131 0.131
## 8 2017 12 4 1 ISU PD M 0 0
## 9 2017 12 3 1 Ames PD M 0.091 0.09
## 10 2017 12 3 0 Ames PD M 0.095 0.095
## # … with 1,517 more rows
# Create a mean test result variable and save as mean_bas
mean_bas <- clean_gender %>%
mutate(meanRes = (Res1+Res2)/2)
mean_bas
## # A tibble: 1,527 x 9
## year month day hour location gender Res1 Res2 meanRes
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2017 12 17 1 Ames PD M 0.046 0.046 0.046
## 2 2017 12 14 3 ISU PD F 0.121 0.12 0.121
## 3 2017 12 10 5 ISU PD F 0.068 0.067 0.0675
## 4 2017 12 10 3 ISU PD F 0.077 0.077 0.077
## 5 2017 12 9 2 ISU PD M 0.085 0.084 0.0845
## 6 2017 12 9 1 Ames PD M 0.16 0.161 0.160
## 7 2017 12 7 3 Ames PD M 0.131 0.131 0.131
## 8 2017 12 4 1 ISU PD M 0 0 0
## 9 2017 12 3 1 Ames PD M 0.091 0.09 0.0905
## 10 2017 12 3 0 Ames PD M 0.095 0.095 0.095
## # … with 1,517 more rows
# Create side-by-side boxplots to compare the mean blood alcohol levels of men and women
ggplot(mean_bas, aes(x = gender, y = meanRes)) +
geom_boxplot()
In the USA, it is illegal to drive with a blood alcohol concentration (BAC) above 0.08%. This is the case for all 50 states. Assuming everyone tested in our data was driving (though we have no way of knowing this from the data), if either of the results (Res1, Res2) are above 0.08, the person would be charged with DUI (driving under the influence).
# Filter the data
duis <- ba_data %>%
filter(Res1 > 0.08 | Res2 > 0.08 )
duis
## # A tibble: 1,159 x 8
## year month day hour location gender Res1 Res2
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 2017 12 14 3 ISU PD F 0.121 0.12
## 2 2017 12 9 2 ISU PD M 0.085 0.084
## 3 2017 12 9 1 Ames PD M 0.16 0.161
## 4 2017 12 7 3 Ames PD M 0.131 0.131
## 5 2017 12 3 1 Ames PD M 0.091 0.09
## 6 2017 12 3 0 Ames PD M 0.095 0.095
## 7 2017 12 2 1 Ames PD M 0.155 0.155
## 8 2017 11 29 3 ISU PD F 0.171 0.171
## 9 2017 11 17 18 ISU PD M 0.176 0.176
## 10 2017 11 11 2 ISU PD M 0.094 0.093
## # … with 1,149 more rows
# Proportion of tests that would have resulted in a DUI
p_dui <- nrow(duis) / nrow(ba_data)
p_dui
## [1] 0.7448586
We previously saw that 2 a.m. is the most common time of day for breathalyzer tests to be administered, and August is the most common month of the year for breathalyzer tests. Now, we look at the weeks in the year over time. We briefly use the lubridate package for a bit of date-time manipulation.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Create date variable using paste() and ymd()
ba_data <- ba_data %>%
mutate(date = ymd(paste(year, month, day, sep = "-")))
# Create a week variable using week()
ba_data <- ba_data %>%
mutate(week = week(date))
ba_data
## # A tibble: 1,556 x 10
## year month day hour location gender Res1 Res2 date week
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
## 1 2017 12 17 1 Ames PD M 0.046 0.046 2017-12-17 51
## 2 2017 12 14 3 ISU PD F 0.121 0.12 2017-12-14 50
## 3 2017 12 10 5 ISU PD F 0.068 0.067 2017-12-10 50
## 4 2017 12 10 3 ISU PD F 0.077 0.077 2017-12-10 50
## 5 2017 12 9 2 ISU PD M 0.085 0.084 2017-12-09 49
## 6 2017 12 9 1 Ames PD M 0.16 0.161 2017-12-09 49
## 7 2017 12 7 3 Ames PD M 0.131 0.131 2017-12-07 49
## 8 2017 12 4 1 ISU PD M 0 0 2017-12-04 49
## 9 2017 12 3 1 Ames PD M 0.091 0.09 2017-12-03 49
## 10 2017 12 3 0 Ames PD M 0.095 0.095 2017-12-03 49
## # … with 1,546 more rows
How do the weeks differ over time? One of the most common data visualizations is the time series, a line tracking the changes in a variable over time. We will use the new week variable to look at test frequency over time. We end with a time series plot showing frequency of breathalyzer tests by week in year, with one line for each year.
# Create the weekly data set
# weekly <- ba_data %>%
# count(week, year)
# weekly
weekly <- ba_data %>%
group_by(week, year) %>%
count()
# Make year a factor
weekly <- weekly %>%
mutate(year = as.factor(year))
# Create the time series plot with one line for each year
ggplot(weekly, aes(x = week, y = n, color = year)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(0,52,2))
From Wikipedia: “VEISHEA was an annual week-long celebration held each spring on the campus of Iowa State University in Ames, Iowa. The celebration featured an annual parade and many open-house demonstrations of the university facilities and departments. Campus organizations exhibited products, technologies, and held fundraisers for various charity groups. In addition, VEISHEA brought speakers, lecturers, and entertainers to Iowa State. […] VEISHEA was the largest student-run festival in the nation, bringing in tens of thousands of visitors to the campus each year.” This over 90-year tradition in Ames was terminated permanently after riots in 2014, where drunk celebrators flipped over multiple vehicles and tore light poles down. This was not the first incidence of violence and severe property damage in VEISHEA’s history. Did former President Leath make the right decision?
# Run this code to create the plot
ggplot() +
geom_point(data = weekly, aes(x = week, y = n, color = year)) +
geom_line(data = weekly, aes(x = week, y = n, color = year)) + # included to make the plot more readable
geom_segment(data = NULL, arrow = arrow(angle = 20, length = unit(0.1, "inches"),
ends = "last", type = "closed"),
aes(x = c(20,20), xend = c(15.5,16), y = c(21, 20), yend = c(21, 12.25))) +
geom_text(data = NULL, aes(x = 23, y = 20.5, label = "VEISHEA Weeks"), size = 3) +
scale_x_continuous(breaks = seq(0,52,2))
# Make a decision about VEISHEA. TRUE or FALSE?
cancelling_VEISHEA_was_right <- TRUE