Introduction

This report analyzes hospital-associated infection data across hospitals in the United States. The dataset includes information on infection types, hospital locations, and observed infection case counts.

The goal of this analysis is to tell a meaningful story about: - Which states have the highest infection cases - Which infection types are most common - How infection cases are distributed across states - Where infection “hotspots” exist

Data Preparation

The dataset was cleaned using Python. The following steps were performed:

  • Selected relevant columns (Hospital Name, State, Measure Name, Score)
  • Converted Score column to numeric values
  • Removed missing or invalid values
  • Filtered dataset to include only observed infection cases

This ensured that the analysis focused only on meaningful infection data.

import pandas as pd

df = pd.read_csv("C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Hospital Infection/Healthcare_Associated_Infections_-_Hospital.csv")

# Keep only needed columns
df = df[[
    "Hospital Name",
    "State",
    "Measure Name",
    "Compared to National",
    "Score"
]]

# Convert Score to numeric
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")

# Drop missing values
df = df.dropna(subset=["Score"])

# Keep only observed cases
df_clean = df[df["Measure Name"].str.contains("Observed Cases", na=False)]

df_clean.head()
##                     Hospital Name  ... Score
## 4   MARSHALL MEDICAL CENTER SOUTH  ...   2.0
## 10  MARSHALL MEDICAL CENTER SOUTH  ...   5.0
## 16  MARSHALL MEDICAL CENTER SOUTH  ...   3.0
## 22  MARSHALL MEDICAL CENTER SOUTH  ...   2.0
## 29  MARSHALL MEDICAL CENTER SOUTH  ...   1.0
## 
## [5 rows x 5 columns]

Data Overview

This project uses the Healthcare-Associated Infections – Hospital dataset, which reports observed hospital-acquired infection cases across U.S. hospitals. After selecting relevant variables and cleaning the data, the final dataset includes hospital name, state, infection type, and the number of observed cases. Converting the Score column to numeric values and removing missing entries ensures that all visualizations reflect accurate and comparable infection counts.

Summary Statistic

df_clean.describe()
##               Score
## count  16911.000000
## mean       9.900834
## std       23.820175
## min        0.000000
## 25%        1.000000
## 50%        3.000000
## 75%        8.000000
## max      568.000000

Interpretation: The summary statistics for the observed infection case counts show a right-skewed distribution, where most hospitals report relatively low case numbers and a smaller number of hospitals report very high totals. This skewness raises the overall mean and influences how visualizations behave, especially bar charts and heatmaps. Understanding this distribution is important because it highlights the presence of outliers and helps explain why some states and infection types dominate the national totals.

Infection Summary

type_avg = df_clean.groupby("Measure Name")["Score"].mean().sort_values()
type_avg
## Measure Name
## SSI: Abdominal Observed Cases     2.125571
## SSI: Colon Observed Cases         4.394442
## CAUTI: Observed Cases             4.452447
## MRSA Observed Cases               4.496051
## CLABSI: Observed Cases            4.550293
## CLABSI Observed Cases             4.824412
## CAUTI Observed Cases              8.005727
## C.diff Observed Cases            30.628264
## Name: Score, dtype: float64

Interpretation: The dataset includes several infection types, each representing a different hospital-acquired condition. By grouping the data by infection type, we can see which categories contribute the most to the national infection burden. Some infection types consistently report higher case counts, while others remain relatively low across states. This variation helps explain the patterns seen in later visualizations such as the stacked bar chart and donut chart.

Findings

1) States with Highest Infection Cases (Horizontal Bar Chart)

This chart highlights the states with the highest total number of infection cases. States shown in red are above the national average, while green indicates below average.

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

state_total = df_clean.groupby("State")["Score"].sum().sort_values()
mean_val = state_total.mean()

def pick_color_according_to_mean():
    colors = []
    for val in state_total:
        if val > mean_val * 1.01:
            colors.append("lightcoral")   
        elif val < mean_val * 0.99:
            colors.append("lightgreen")   
        else:
            colors.append("gold")         
    return colors
my_colors3 = pick_color_according_to_mean()

Above=mpatches.Patch(color="lightcoral", label ="Above Mean")
At = mpatches.Patch(color ="gold", label ="At Mean")
Below = mpatches.Patch(color ="lightgreen", label ="Below Mean")
                
                     
figs = plt.figure(figsize=(12,15))
ax1 = figs.add_subplot(1,1,1)
ax1.barh(state_total.index, state_total.values, color=my_colors3)

for i, val in enumerate(state_total.values):
    if val > mean_val * 1.01:
        color = "lightcoral"
    elif val < mean_val * 0.99:
        color = "lightgreen"
    else:
        color = "gold"
         
    ax1.text(val + 0.5, i, str(round(val, 2)),
             color="black", fontsize=12, va="center")

plt.axvline(mean_val, color="black", linestyle="dashed")
ax1.text(mean_val + 1, 0, "Mean = " + str(round(mean_val, 2)),
         fontsize=12, va="bottom")

ax1.set_title("Top 10 States with Highest Hospital Infection Cases", size=20 )
ax1.set_xlabel("Total Observed Cases", fontsize=16)
ax1.set_ylabel("State", fontsize=16)
         
plt.xticks(fontsize=12)
## (array([    0.,  2500.,  5000.,  7500., 10000., 12500., 15000., 17500.,
##        20000.]), [Text(0.0, 0, '0'), Text(2500.0, 0, '2500'), Text(5000.0, 0, '5000'), Text(7500.0, 0, '7500'), Text(10000.0, 0, '10000'), Text(12500.0, 0, '12500'), Text(15000.0, 0, '15000'), Text(17500.0, 0, '17500'), Text(20000.0, 0, '20000')])
plt.yticks(fontsize=12)
## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53], [Text(0, 0, 'GU'), Text(0, 1, 'VI'), Text(0, 2, 'WY'), Text(0, 3, 'PR'), Text(0, 4, 'AK'), Text(0, 5, 'VT'), Text(0, 6, 'MT'), Text(0, 7, 'HI'), Text(0, 8, 'ID'), Text(0, 9, 'ND'), Text(0, 10, 'SD'), Text(0, 11, 'ME'), Text(0, 12, 'NH'), Text(0, 13, 'DE'), Text(0, 14, 'RI'), Text(0, 15, 'DC'), Text(0, 16, 'NE'), Text(0, 17, 'UT'), Text(0, 18, 'NM'), Text(0, 19, 'WV'), Text(0, 20, 'KS'), Text(0, 21, 'OR'), Text(0, 22, 'IA'), Text(0, 23, 'AR'), Text(0, 24, 'MS'), Text(0, 25, 'NV'), Text(0, 26, 'CO'), Text(0, 27, 'OK'), Text(0, 28, 'MN'), Text(0, 29, 'CT'), Text(0, 30, 'WI'), Text(0, 31, 'LA'), Text(0, 32, 'WA'), Text(0, 33, 'SC'), Text(0, 34, 'AL'), Text(0, 35, 'KY'), Text(0, 36, 'AZ'), Text(0, 37, 'MO'), Text(0, 38, 'IN'), Text(0, 39, 'MD'), Text(0, 40, 'VA'), Text(0, 41, 'MA'), Text(0, 42, 'TN'), Text(0, 43, 'NJ'), Text(0, 44, 'NC'), Text(0, 45, 'GA'), Text(0, 46, 'MI'), Text(0, 47, 'IL'), Text(0, 48, 'OH'), Text(0, 49, 'PA'), Text(0, 50, 'TX'), Text(0, 51, 'FL'), Text(0, 52, 'NY'), Text(0, 53, 'CA')])
ax1.legend(loc="lower right", handles=[Above, At, Below], fontsize=14)

plt.show()

Interpretation: This chart highlights how hospital infection cases are heavily concentrated in a small group of states. California, New York, Florida, and Texas stand far above the national mean, indicating significantly higher infection burdens compared to the rest of the country. The remaining states in the top ten also exceed the mean but at a more moderate level. Overall, the distribution is clearly skewed, with a few large states accounting for a disproportionate share of total infections.

2) Average Infection Cases by Type (Veritcal Bar Chart)

This chart shows the average number of infection cases for each infection type.

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

type_avg = df_clean.groupby("Measure Name")["Score"].mean().sort_values()

mean_val = type_avg.mean()

def pick_color_according_to_mean(values, mean):
    colors = []
    for v in values:
        if v > mean * 1.01:
            colors.append("lightcoral")      
        elif v < mean * 0.99:
            colors.append("lightgreen")      
        else:
            colors.append("black")           
    return colors

my_colors = pick_color_according_to_mean(type_avg.values, mean_val)

Above = mpatches.Patch(color="lightcoral", label="Above Average")
At = mpatches.Patch(color="black", label="Within 1% of Average")
Below = mpatches.Patch(color="lightgreen", label="Below Average")

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1,1,1)

ax.bar(type_avg.index, type_avg.values, color=my_colors)

for i, v in enumerate(type_avg.values):
    ax.text(i, v + (v * 0.02), str(round(v, 2)),
            ha="center", va="bottom", fontsize=12)

plt.axhline(mean_val, color="black", linestyle="dashed")
ax.text(len(type_avg)-0.5, mean_val + 0.5,
        "Mean = " + str(round(mean_val, 2)),
        fontsize=12, va="bottom")

plt.title("Average Infection Cases by Type", fontsize=20)
plt.xlabel("Infection Type", fontsize=16)
plt.ylabel("Average Cases", fontsize=16)

plt.xticks(rotation=45, fontsize=12)
## ([0, 1, 2, 3, 4, 5, 6, 7], [Text(0, 0, 'SSI: Abdominal Observed Cases'), Text(1, 0, 'SSI: Colon Observed Cases'), Text(2, 0, 'CAUTI: Observed Cases'), Text(3, 0, 'MRSA Observed Cases'), Text(4, 0, 'CLABSI: Observed Cases'), Text(5, 0, 'CLABSI Observed Cases'), Text(6, 0, 'CAUTI Observed Cases'), Text(7, 0, 'C.diff Observed Cases')])
plt.yticks(fontsize=12)
## (array([ 0.,  5., 10., 15., 20., 25., 30., 35.]), [Text(0, 0.0, '0'), Text(0, 5.0, '5'), Text(0, 10.0, '10'), Text(0, 15.0, '15'), Text(0, 20.0, '20'), Text(0, 25.0, '25'), Text(0, 30.0, '30'), Text(0, 35.0, '35')])
ax.legend(handles=[Above, At, Below], fontsize=14, loc="upper left")

plt.show()

Interpretation: This chart shows that most infection types have relatively low average case counts, with values clustered well below the overall mean. The one major exception is C. diff, which stands out dramatically as the highest‑burden infection type. This indicates that C. diff contributes disproportionately to hospital‑acquired infections compared to all other categories.

3) Infection Distribution in Top States (Stacked Bar)

The stacked bar chart shows how infection types contribute to total infections in the top states.

import matplotlib.pyplot as plt

top_states = df_clean.groupby("State")["Score"].sum().nlargest(5).index

df_top = df_clean[df_clean["State"].isin(top_states)]

stack_df = df_top.pivot_table(
    index="State",
    columns="Measure Name",
    values="Score",
    aggfunc="sum"
)

stack_df.plot(kind="bar", stacked=True, figsize=(14,8))

plt.title("Infection Types Distribution in Top 5 States", fontsize=18)
plt.xlabel("State", fontsize=14)
plt.ylabel("Total Cases", fontsize=14)

plt.xticks(rotation=30)
## (array([0, 1, 2, 3, 4]), [Text(0, 0, 'CA'), Text(1, 0, 'FL'), Text(2, 0, 'NY'), Text(3, 0, 'PA'), Text(4, 0, 'TX')])
plt.savefig("infection_stackbar.png") 

plt.show()

Interpretation: This stacked bar chart shows how different infection types contribute to the total infection burden in the top five states. California, New York, and Florida report the highest overall case counts, but all five states display a similar pattern: multiple infection types contribute to their totals, with C.diff Observed case dominating across every state. Indicating that it is the primary driver of high infection totals in these high‑burden regions.

4) What proportion of total infections comes from each type? (Donut Chart)

This chart illustrates the proportion of infections by type.

infection_total = df_clean.groupby("Measure Name")["Score"].sum()

plt.figure(figsize=(8,8))

plt.pie(infection_total, labels=infection_total.index, autopct='%1.1f%%')
## ([<matplotlib.patches.Wedge object at 0x000002404F390830>, <matplotlib.patches.Wedge object at 0x000002404FF2A710>, <matplotlib.patches.Wedge object at 0x000002404FF2AAD0>, <matplotlib.patches.Wedge object at 0x000002404FF2AE90>, <matplotlib.patches.Wedge object at 0x000002404FF2B250>, <matplotlib.patches.Wedge object at 0x000002404FF2B610>, <matplotlib.patches.Wedge object at 0x000002404FF2B9D0>, <matplotlib.patches.Wedge object at 0x000002404FF2BD90>], [Text(-0.3248717796689097, 1.0509321228199067, 'C.diff Observed Cases'), Text(-0.6482305390440575, -0.88870533263318, 'CAUTI Observed Cases'), Text(-0.08710086431579307, -1.0965461410426112, 'CAUTI: Observed Cases'), Text(0.33727186085962907, -1.0470184773308842, 'CLABSI Observed Cases'), Text(0.6808738285036718, -0.863950710202123, 'CLABSI: Observed Cases'), Text(0.9270976739010918, -0.592021877169404, 'MRSA Observed Cases'), Text(1.0242963816934723, -0.4010198529370591, 'SSI: Abdominal Observed Cases'), Text(1.084307891212098, -0.18513885884700987, 'SSI: Colon Observed Cases')], [Text(-0.17720278891031435, 0.5732357033563127, '59.5%'), Text(-0.35358029402403135, -0.4847483632544617, '10.9%'), Text(-0.04750956235406894, -0.5981160769323333, '6.7%'), Text(0.18396646955979767, -0.5711009876350277, '5.8%'), Text(0.3713857246383664, -0.47124584192843066, '5.6%'), Text(0.5056896403096863, -0.32292102391058397, '5.1%'), Text(0.5587071172873485, -0.21873810160203222, '1.1%'), Text(0.5914406679338716, -0.100984832098369, '5.4%')])
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title("Proportion of Infection Types")
plt.savefig("infection_donut.png") 

plt.show()

Interpretation: The donut chart shows that C. diff accounts for the majority of all observed infection cases, making up well over half of the total. All other infection types contribute much smaller proportions, with each representing only a small fraction of the overall infection burden. This highlights how disproportionately C. diff drives national infection totals compared to other hospital‑acquired infections.

5) Where are infection hotspots across states and infection types? (Heatmap)

The heatmap highlights infection intensity across states and infection types. Darker regions indicate higher infection counts.

import seaborn as sns 
from matplotlib.ticker import FuncFormatter
import numpy as np

heatmap_data = df_clean.pivot_table(
    index="State",
    columns="Measure Name",
    values="Score",
    aggfunc="sum"
)
heatmap_data = heatmap_data.fillna(0)

fig, ax = plt.subplots(figsize=(18, 10))

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ","))

hm = sns.heatmap(
    heatmap_data,
    cmap="coolwarm", 
    annot=True,
    fmt=",.0f",
    linewidths=0.2,
    linecolor="gray",
    square=False,
    cbar_kws={"format": comma_fmt}
)


plt.title("Infection Cases Hotspots Across States and Infection Types", fontsize=18, pad=15)
plt.xlabel("Infection Type", fontsize=16, labelpad=10)
plt.ylabel("State", fontsize=16, labelpad=10)

plt.xticks(rotation=45, ha="right", fontsize=12)
## (array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5]), [Text(0.5, 0, 'C.diff Observed Cases'), Text(1.5, 0, 'CAUTI Observed Cases'), Text(2.5, 0, 'CAUTI: Observed Cases'), Text(3.5, 0, 'CLABSI Observed Cases'), Text(4.5, 0, 'CLABSI: Observed Cases'), Text(5.5, 0, 'MRSA Observed Cases'), Text(6.5, 0, 'SSI: Abdominal Observed Cases'), Text(7.5, 0, 'SSI: Colon Observed Cases')])
plt.yticks(rotation=0, fontsize=12)
## (array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5,
##        11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
##        22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5, 30.5, 31.5, 32.5,
##        33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5, 40.5, 41.5, 42.5, 43.5,
##        44.5, 45.5, 46.5, 47.5, 48.5, 49.5, 50.5, 51.5, 52.5, 53.5]), [Text(0, 0.5, 'AK'), Text(0, 1.5, 'AL'), Text(0, 2.5, 'AR'), Text(0, 3.5, 'AZ'), Text(0, 4.5, 'CA'), Text(0, 5.5, 'CO'), Text(0, 6.5, 'CT'), Text(0, 7.5, 'DC'), Text(0, 8.5, 'DE'), Text(0, 9.5, 'FL'), Text(0, 10.5, 'GA'), Text(0, 11.5, 'GU'), Text(0, 12.5, 'HI'), Text(0, 13.5, 'IA'), Text(0, 14.5, 'ID'), Text(0, 15.5, 'IL'), Text(0, 16.5, 'IN'), Text(0, 17.5, 'KS'), Text(0, 18.5, 'KY'), Text(0, 19.5, 'LA'), Text(0, 20.5, 'MA'), Text(0, 21.5, 'MD'), Text(0, 22.5, 'ME'), Text(0, 23.5, 'MI'), Text(0, 24.5, 'MN'), Text(0, 25.5, 'MO'), Text(0, 26.5, 'MS'), Text(0, 27.5, 'MT'), Text(0, 28.5, 'NC'), Text(0, 29.5, 'ND'), Text(0, 30.5, 'NE'), Text(0, 31.5, 'NH'), Text(0, 32.5, 'NJ'), Text(0, 33.5, 'NM'), Text(0, 34.5, 'NV'), Text(0, 35.5, 'NY'), Text(0, 36.5, 'OH'), Text(0, 37.5, 'OK'), Text(0, 38.5, 'OR'), Text(0, 39.5, 'PA'), Text(0, 40.5, 'PR'), Text(0, 41.5, 'RI'), Text(0, 42.5, 'SC'), Text(0, 43.5, 'SD'), Text(0, 44.5, 'TN'), Text(0, 45.5, 'TX'), Text(0, 46.5, 'UT'), Text(0, 47.5, 'VA'), Text(0, 48.5, 'VI'), Text(0, 49.5, 'VT'), Text(0, 50.5, 'WA'), Text(0, 51.5, 'WI'), Text(0, 52.5, 'WV'), Text(0, 53.5, 'WY')])
cbar = hm.collections[0].colorbar

max_val = heatmap_data.to_numpy().max()

tick_step = 1000
ticks = list(range(0, int(max_val) + tick_step, tick_step))

cbar.set_ticks(ticks)
cbar.set_ticklabels([format(t, ",") for t in ticks])

cbar.set_label("Total Infection Cases", rotation=270, labelpad=20, fontsize=14)

plt.savefig("infection_heatmap.png") 
plt.show()

Interpretation: The heatmap reveals clear infection hotspots concentrated in large states such as California, Texas, Pennsylvania, and New York, which show consistently high case counts across multiple infection types. In contrast, many smaller states display much lower values across the board. The pattern highlights both geographic disparities and the fact that certain states face a heavier and more widespread infection burden than others.

Conclusion

This analysis provides a clear picture of how hospital‑acquired infections vary across states and infection types in the United States. Across all visualizations, a consistent pattern emerges: C. diff is the dominant infection type, contributing the largest share of cases nationally and driving much of the infection burden in high‑impact states. Geographic disparities are also evident, with large states such as California, Texas, Florida, New York, and Pennsylvania showing significantly higher totals across multiple infection categories.

The combination of summary statistics, bar charts, stacked distributions, and heatmaps highlights both the magnitude and complexity of infection patterns. While some states experience concentrated spikes in specific infection types, others face a broad mix of challenges. These findings underscore the importance of targeted infection‑control strategies that consider both state‑level burden and the disproportionate impact of certain infection types. Overall, the analysis demonstrates how data visualization can reveal meaningful trends that support more informed public health decision‑making.