Dataset: Baltimore City Liquor Licenses
Observations: ~30,284 records
Variables: 21 columns
Each row represents a liquor license record, including:
- License class and subclass
- License year and fee
- License Status (Closed, Renewed, Transferred, Unknown)
- Establishment Type
- Business name and location (ZIP Code)
This dataset is well suited for:
- Categorical analysis (license types, statuses)
- Numerical analysis (fees, years)
- Time-based trends
- Business and Regulatory Practices
This report analyzes liquor license data rom Baltimore City. The purpose of this analysis is to explore license status, license classes, fees, and geographic patterns using descriptive statistics and data visualizations.
The data was imported from a local CSV file. Initial inspection revealed missing or empty values in the ‘License Status’ field. These values were recoded as “Unknown” to ensure accurate categorical analysis. Additionally, “NA” values were removed from the LicenseFee and LicenseYear groups.
file_path <- "R_datafiles/Liquor_Licenses.csv"
if (!file.exists(file_path)) {
stop("File not found: ", file_path)
}
df <- read.csv(file_path)
df$LicenseStatus <- as.character(df$LicenseStatus)
df$LicenseStatus[is.na(df$LicenseStatus) | (df$LicenseStatus) == ""] <- "Unknown"
df$LicenseStatus <- as.factor(df$LicenseStatus)
df <- df %>%
filter(!is.na(LicenseFee), !is.na(LicenseYear))
This visualization shows the frequency of liquor licenses by license class. The results indicate that certain license classes dominate the dataset, suggesting that specific types of alcohol sales are more common in Baltimore.
df_counts <- df %>%
count(LicenseClass) %>%
arrange(desc(n))
ggplot(df_counts, aes(x = reorder(LicenseClass, -n), y=n, fill = LicenseClass)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.5, size = 4) +
scale_y_continuous(expand = expansion(mult = c(0,0.8))) +
labs(
title= "Number of Liquor Licenses by License Class",
x = "License Class",
y = "Count"
) +
theme(
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none"
)
A box plot was used to compare license fees across different license statuses. Active licenses generally show higher and more consistent fees, while expired and unknown licenses display greater variability.
How to Read this Plot:
Each box summarizes the distribution of fees within a status group:
The middle line inside each box represents the median license fee.
The box itself shows the interquartile range (IQR) - the middle of the medians.
The whiskers extend to the typical range of values (excluding extreme outliers).
The individual points beyond the whiskers represent outliers, or unusually high license fees.
What This Plot Shows:
Renewed licenses appear to have the widest spread and the highest number of extreme outliers, with some fees reaching very high values.
Closed and Unknown licenses show moderate variability, with several high-fee outliers.
Transferred licenses tend to have lower median fees and a more compressed distribution, though a few high outliers are still present.
ggplot(df, aes(x = LicenseStatus, y = LicenseFee)) +
geom_boxplot()+
labs(
title = "License Fees by License Status",
x = "License Status",
y = "License Fee"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The scatter plot examines whether license fees have changed over time. While individual fees vary, no extreme upward or downward trend is observed, suggesting relative fee stability across years.
This scatterplot visualizes the relationship between the year a license was issued “LicenseYear” and the associated “LicenseFee”.
Each point represents one license record in the datatset.
The x-axis (LicenseYear) shows the year the license was issued.
The y-axis (LicenseFee) shows the cost of the license in dollars.
The “alpha = 0.6” setting makes the points slightly transparent, which helps reveal areas where many observations overlap.
How To Read This Plot:
A general upward pattern would suggest that license fees have increased over time.
A flat pattern would indicate that fees have remained relatively stable across years.
Cluster of points may indicate common fee tiers or standard pricing categories.
Large vertical spread within a single year suggests variation in fees for different license types.
What This Plot Shows:
There appears to be a general upward trend in the upper range of license fees over time. In more recent years, the highest observed fess reach substantially larger amounts than in earlier years.
While many license remain in a lower fee range (roughly $3000) across all years, the number and magnitude of high-fee licenses increase after approximately 2015.
The vertical clustering of point within each year suggests that license fees are often at standardized pricing tiers, rather than being continuously distributed .
The presence of several very high values (above $20000) in later years indicates the emergence or increased frequency of premium or specialized license categories.
There is also noticeable variability within individual years, meaning license cost depends on factors beyond just time (such as license type or status).
ggplot(df, aes(x = LicenseYear, y = LicenseFee)) +
geom_point(alpha = 0.6) +
labs(
title = "License Fee vs License Year",
x = "License Year",
y = "License Fee"
)
Aggregating license fees by year provides insight into trends in total revenue generated from liquor licenses. The line chart reveals fluctuations that reflect economic conditions and regulatory changes.
fees_by_year <- aggregate(LicenseFee ~ LicenseYear, data = df, sum)
ggplot(fees_by_year, aes(x = LicenseYear, y = LicenseFee)) +
geom_line() +
geom_point() +
labs(
title = "Total License Fees Collected by Year",
x = "License Year",
y = "Total License Fees"
)
A pie chart was used to visualize the proportions of license statuses. Including an “Unknown” category ensures that missing data is transparently represented rather than excluded.
status_counts <- as.data.frame(table(df$LicenseStatus))
colnames(status_counts) <- c("LicenseStatus","Count")
status_counts <- status_counts %>%
mutate(
Percent = Count / sum(Count),
PercentLabel = percent(Percent, accuracy = 1),
ymax = cumsum(Count),
ymin = lag(ymax, default = 0),
ymid = (ymax + ymin) / 2
)
total_N <- sum(status_counts$Count)
ggplot(status_counts, aes(x = 2, y = Count, fill= LicenseStatus)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar(theta = "y") +
xlim(0.5, 3) +
geom_segment(aes(x = 2.5, xend = 2.9,
y = ymid, yend = ymid),
color = "black") +
geom_text(aes(x = 3, y = ymid, label = PercentLabel),
size = 4) +
annotate("text",
x = 1.5, y = -max(status_counts$Count) * 0.05,
label = paste0("N = ", total_N),
size = 5,
fontface = "bold") +
labs(
title = "Proportion of Licenses by Status",
fill = "License Status"
) +
theme_void()
This analysis explored Baltimore City liquor licenses using multiple visualization techniques. The findings highlight dominant license classes, variability in license fees, and stable long-term trends in licensing revenue. Addressing missing data improved clarity and strengthened interpretability.