This report analyzes hospital-associated infection data across hospitals in the United States. The dataset includes information on infection types, hospital locations, and observed infection case counts.
The goal of this analysis is to tell a meaningful story about: - Which states have the highest infection cases - Which infection types are most common - How infection cases are distributed across states - Where infection “hotspots” exist
The dataset was cleaned using Python. The following steps were performed:
This ensured that the analysis focused only on meaningful infection data.
import pandas as pd
df = pd.read_csv("C:/Users/wwwma/OneDrive - Loyola University Maryland/IS460W/Hospital Infection/Healthcare_Associated_Infections_-_Hospital.csv")
# Keep only needed columns
df = df[[
"Hospital Name",
"State",
"Measure Name",
"Compared to National",
"Score"
]]
# Convert Score to numeric
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")
# Drop missing values
df = df.dropna(subset=["Score"])
# Keep only observed cases
df_clean = df[df["Measure Name"].str.contains("Observed Cases", na=False)]
df_clean.head()
## Hospital Name ... Score
## 4 MARSHALL MEDICAL CENTER SOUTH ... 2.0
## 10 MARSHALL MEDICAL CENTER SOUTH ... 5.0
## 16 MARSHALL MEDICAL CENTER SOUTH ... 3.0
## 22 MARSHALL MEDICAL CENTER SOUTH ... 2.0
## 29 MARSHALL MEDICAL CENTER SOUTH ... 1.0
##
## [5 rows x 5 columns]
This project uses the Healthcare-Associated Infections – Hospital dataset, which reports observed hospital-acquired infection cases across U.S. hospitals. After selecting relevant variables and cleaning the data, the final dataset includes hospital name, state, infection type, and the number of observed cases. Converting the Score column to numeric values and removing missing entries ensures that all visualizations reflect accurate and comparable infection counts.
df_clean.describe()
## Score
## count 16911.000000
## mean 9.900834
## std 23.820175
## min 0.000000
## 25% 1.000000
## 50% 3.000000
## 75% 8.000000
## max 568.000000
Interpretation: The summary statistics for the observed infection case counts show a right-skewed distribution, where most hospitals report relatively low case numbers and a smaller number of hospitals report very high totals. This skewness raises the overall mean and influences how visualizations behave, especially bar charts and heatmaps. Understanding this distribution is important because it highlights the presence of outliers and helps explain why some states and infection types dominate the national totals.
type_avg = df_clean.groupby("Measure Name")["Score"].mean().sort_values()
type_avg
## Measure Name
## SSI: Abdominal Observed Cases 2.125571
## SSI: Colon Observed Cases 4.394442
## CAUTI: Observed Cases 4.452447
## MRSA Observed Cases 4.496051
## CLABSI: Observed Cases 4.550293
## CLABSI Observed Cases 4.824412
## CAUTI Observed Cases 8.005727
## C.diff Observed Cases 30.628264
## Name: Score, dtype: float64
Interpretation: The dataset includes several infection types, each representing a different hospital-acquired condition. By grouping the data by infection type, we can see which categories contribute the most to the national infection burden. Some infection types consistently report higher case counts, while others remain relatively low across states. This variation helps explain the patterns seen in later visualizations such as the stacked bar chart and donut chart.
This chart highlights the states with the highest total number of infection cases. States shown in red are above the national average, while green indicates below average.
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
state_total = df_clean.groupby("State")["Score"].sum().sort_values()
mean_val = state_total.mean()
def pick_color_according_to_mean():
colors = []
for val in state_total:
if val > mean_val * 1.01:
colors.append("lightcoral")
elif val < mean_val * 0.99:
colors.append("lightgreen")
else:
colors.append("gold")
return colors
my_colors3 = pick_color_according_to_mean()
Above=mpatches.Patch(color="lightcoral", label ="Above Mean")
At = mpatches.Patch(color ="gold", label ="At Mean")
Below = mpatches.Patch(color ="lightgreen", label ="Below Mean")
figs = plt.figure(figsize=(12,15))
ax1 = figs.add_subplot(1,1,1)
ax1.barh(state_total.index, state_total.values, color=my_colors3)
for i, val in enumerate(state_total.values):
if val > mean_val * 1.01:
color = "lightcoral"
elif val < mean_val * 0.99:
color = "lightgreen"
else:
color = "gold"
ax1.text(val + 0.5, i, str(round(val, 2)),
color="black", fontsize=12, va="center")
plt.axvline(mean_val, color="black", linestyle="dashed")
ax1.text(mean_val + 1, 0, "Mean = " + str(round(mean_val, 2)),
fontsize=12, va="bottom")
ax1.set_title("Top 10 States with Highest Hospital Infection Cases", size=20 )
ax1.set_xlabel("Total Observed Cases", fontsize=16)
ax1.set_ylabel("State", fontsize=16)
plt.xticks(fontsize=12)
## (array([ 0., 2500., 5000., 7500., 10000., 12500., 15000., 17500.,
## 20000.]), [Text(0.0, 0, '0'), Text(2500.0, 0, '2500'), Text(5000.0, 0, '5000'), Text(7500.0, 0, '7500'), Text(10000.0, 0, '10000'), Text(12500.0, 0, '12500'), Text(15000.0, 0, '15000'), Text(17500.0, 0, '17500'), Text(20000.0, 0, '20000')])
plt.yticks(fontsize=12)
## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53], [Text(0, 0, 'GU'), Text(0, 1, 'VI'), Text(0, 2, 'WY'), Text(0, 3, 'PR'), Text(0, 4, 'AK'), Text(0, 5, 'VT'), Text(0, 6, 'MT'), Text(0, 7, 'HI'), Text(0, 8, 'ID'), Text(0, 9, 'ND'), Text(0, 10, 'SD'), Text(0, 11, 'ME'), Text(0, 12, 'NH'), Text(0, 13, 'DE'), Text(0, 14, 'RI'), Text(0, 15, 'DC'), Text(0, 16, 'NE'), Text(0, 17, 'UT'), Text(0, 18, 'NM'), Text(0, 19, 'WV'), Text(0, 20, 'KS'), Text(0, 21, 'OR'), Text(0, 22, 'IA'), Text(0, 23, 'AR'), Text(0, 24, 'MS'), Text(0, 25, 'NV'), Text(0, 26, 'CO'), Text(0, 27, 'OK'), Text(0, 28, 'MN'), Text(0, 29, 'CT'), Text(0, 30, 'WI'), Text(0, 31, 'LA'), Text(0, 32, 'WA'), Text(0, 33, 'SC'), Text(0, 34, 'AL'), Text(0, 35, 'KY'), Text(0, 36, 'AZ'), Text(0, 37, 'MO'), Text(0, 38, 'IN'), Text(0, 39, 'MD'), Text(0, 40, 'VA'), Text(0, 41, 'MA'), Text(0, 42, 'TN'), Text(0, 43, 'NJ'), Text(0, 44, 'NC'), Text(0, 45, 'GA'), Text(0, 46, 'MI'), Text(0, 47, 'IL'), Text(0, 48, 'OH'), Text(0, 49, 'PA'), Text(0, 50, 'TX'), Text(0, 51, 'FL'), Text(0, 52, 'NY'), Text(0, 53, 'CA')])
ax1.legend(loc="lower right", handles=[Above, At, Below], fontsize=14)
plt.show()
Interpretation: This chart highlights how hospital infection cases are
heavily concentrated in a small group of states. California, New York,
Florida, and Texas stand far above the national mean, indicating
significantly higher infection burdens compared to the rest of the
country. The remaining states in the top ten also exceed the mean but at
a more moderate level. Overall, the distribution is clearly skewed, with
a few large states accounting for a disproportionate share of total
infections.
This chart shows the average number of infection cases for each infection type.
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
type_avg = df_clean.groupby("Measure Name")["Score"].mean().sort_values()
mean_val = type_avg.mean()
def pick_color_according_to_mean(values, mean):
colors = []
for v in values:
if v > mean * 1.01:
colors.append("lightcoral")
elif v < mean * 0.99:
colors.append("lightgreen")
else:
colors.append("black")
return colors
my_colors = pick_color_according_to_mean(type_avg.values, mean_val)
Above = mpatches.Patch(color="lightcoral", label="Above Average")
At = mpatches.Patch(color="black", label="Within 1% of Average")
Below = mpatches.Patch(color="lightgreen", label="Below Average")
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1,1,1)
ax.bar(type_avg.index, type_avg.values, color=my_colors)
for i, v in enumerate(type_avg.values):
ax.text(i, v + (v * 0.02), str(round(v, 2)),
ha="center", va="bottom", fontsize=12)
plt.axhline(mean_val, color="black", linestyle="dashed")
ax.text(len(type_avg)-0.5, mean_val + 0.5,
"Mean = " + str(round(mean_val, 2)),
fontsize=12, va="bottom")
plt.title("Average Infection Cases by Type", fontsize=20)
plt.xlabel("Infection Type", fontsize=16)
plt.ylabel("Average Cases", fontsize=16)
plt.xticks(rotation=45, fontsize=12)
## ([0, 1, 2, 3, 4, 5, 6, 7], [Text(0, 0, 'SSI: Abdominal Observed Cases'), Text(1, 0, 'SSI: Colon Observed Cases'), Text(2, 0, 'CAUTI: Observed Cases'), Text(3, 0, 'MRSA Observed Cases'), Text(4, 0, 'CLABSI: Observed Cases'), Text(5, 0, 'CLABSI Observed Cases'), Text(6, 0, 'CAUTI Observed Cases'), Text(7, 0, 'C.diff Observed Cases')])
plt.yticks(fontsize=12)
## (array([ 0., 5., 10., 15., 20., 25., 30., 35.]), [Text(0, 0.0, '0'), Text(0, 5.0, '5'), Text(0, 10.0, '10'), Text(0, 15.0, '15'), Text(0, 20.0, '20'), Text(0, 25.0, '25'), Text(0, 30.0, '30'), Text(0, 35.0, '35')])
ax.legend(handles=[Above, At, Below], fontsize=14, loc="upper left")
plt.show()
Interpretation: This chart shows that most infection types have relatively low average case counts, with values clustered well below the overall mean. The one major exception is C. diff, which stands out dramatically as the highest‑burden infection type. This indicates that C. diff contributes disproportionately to hospital‑acquired infections compared to all other categories.
The stacked bar chart shows how infection types contribute to total infections in the top states.
import matplotlib.pyplot as plt
top_states = df_clean.groupby("State")["Score"].sum().nlargest(5).index
df_top = df_clean[df_clean["State"].isin(top_states)]
stack_df = df_top.pivot_table(
index="State",
columns="Measure Name",
values="Score",
aggfunc="sum"
)
stack_df.plot(kind="bar", stacked=True, figsize=(14,8))
plt.title("Infection Types Distribution in Top 5 States", fontsize=18)
plt.xlabel("State", fontsize=14)
plt.ylabel("Total Cases", fontsize=14)
plt.xticks(rotation=30)
## (array([0, 1, 2, 3, 4]), [Text(0, 0, 'CA'), Text(1, 0, 'FL'), Text(2, 0, 'NY'), Text(3, 0, 'PA'), Text(4, 0, 'TX')])
plt.savefig("infection_stackbar.png")
plt.show()
Interpretation: This stacked bar chart shows how different infection types contribute to the total infection burden in the top five states. California, New York, and Florida report the highest overall case counts, but all five states display a similar pattern: multiple infection types contribute to their totals, with C.diff Observed case dominating across every state. Indicating that it is the primary driver of high infection totals in these high‑burden regions.
This chart illustrates the proportion of infections by type.
infection_total = df_clean.groupby("Measure Name")["Score"].sum()
plt.figure(figsize=(8,8))
plt.pie(infection_total, labels=infection_total.index, autopct='%1.1f%%')
## ([<matplotlib.patches.Wedge object at 0x000002404F390830>, <matplotlib.patches.Wedge object at 0x000002404FF2A710>, <matplotlib.patches.Wedge object at 0x000002404FF2AAD0>, <matplotlib.patches.Wedge object at 0x000002404FF2AE90>, <matplotlib.patches.Wedge object at 0x000002404FF2B250>, <matplotlib.patches.Wedge object at 0x000002404FF2B610>, <matplotlib.patches.Wedge object at 0x000002404FF2B9D0>, <matplotlib.patches.Wedge object at 0x000002404FF2BD90>], [Text(-0.3248717796689097, 1.0509321228199067, 'C.diff Observed Cases'), Text(-0.6482305390440575, -0.88870533263318, 'CAUTI Observed Cases'), Text(-0.08710086431579307, -1.0965461410426112, 'CAUTI: Observed Cases'), Text(0.33727186085962907, -1.0470184773308842, 'CLABSI Observed Cases'), Text(0.6808738285036718, -0.863950710202123, 'CLABSI: Observed Cases'), Text(0.9270976739010918, -0.592021877169404, 'MRSA Observed Cases'), Text(1.0242963816934723, -0.4010198529370591, 'SSI: Abdominal Observed Cases'), Text(1.084307891212098, -0.18513885884700987, 'SSI: Colon Observed Cases')], [Text(-0.17720278891031435, 0.5732357033563127, '59.5%'), Text(-0.35358029402403135, -0.4847483632544617, '10.9%'), Text(-0.04750956235406894, -0.5981160769323333, '6.7%'), Text(0.18396646955979767, -0.5711009876350277, '5.8%'), Text(0.3713857246383664, -0.47124584192843066, '5.6%'), Text(0.5056896403096863, -0.32292102391058397, '5.1%'), Text(0.5587071172873485, -0.21873810160203222, '1.1%'), Text(0.5914406679338716, -0.100984832098369, '5.4%')])
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title("Proportion of Infection Types")
plt.savefig("infection_donut.png")
plt.show()
Interpretation: The donut chart shows that C. diff accounts for the majority of all observed infection cases, making up well over half of the total. All other infection types contribute much smaller proportions, with each representing only a small fraction of the overall infection burden. This highlights how disproportionately C. diff drives national infection totals compared to other hospital‑acquired infections.
The heatmap highlights infection intensity across states and infection types. Darker regions indicate higher infection counts.
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import numpy as np
heatmap_data = df_clean.pivot_table(
index="State",
columns="Measure Name",
values="Score",
aggfunc="sum"
)
heatmap_data = heatmap_data.fillna(0)
fig, ax = plt.subplots(figsize=(18, 10))
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ","))
hm = sns.heatmap(
heatmap_data,
cmap="coolwarm",
annot=True,
fmt=",.0f",
linewidths=0.2,
linecolor="gray",
square=False,
cbar_kws={"format": comma_fmt}
)
plt.title("Infection Cases Hotspots Across States and Infection Types", fontsize=18, pad=15)
plt.xlabel("Infection Type", fontsize=16, labelpad=10)
plt.ylabel("State", fontsize=16, labelpad=10)
plt.xticks(rotation=45, ha="right", fontsize=12)
## (array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5]), [Text(0.5, 0, 'C.diff Observed Cases'), Text(1.5, 0, 'CAUTI Observed Cases'), Text(2.5, 0, 'CAUTI: Observed Cases'), Text(3.5, 0, 'CLABSI Observed Cases'), Text(4.5, 0, 'CLABSI: Observed Cases'), Text(5.5, 0, 'MRSA Observed Cases'), Text(6.5, 0, 'SSI: Abdominal Observed Cases'), Text(7.5, 0, 'SSI: Colon Observed Cases')])
plt.yticks(rotation=0, fontsize=12)
## (array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
## 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
## 22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5, 30.5, 31.5, 32.5,
## 33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5, 40.5, 41.5, 42.5, 43.5,
## 44.5, 45.5, 46.5, 47.5, 48.5, 49.5, 50.5, 51.5, 52.5, 53.5]), [Text(0, 0.5, 'AK'), Text(0, 1.5, 'AL'), Text(0, 2.5, 'AR'), Text(0, 3.5, 'AZ'), Text(0, 4.5, 'CA'), Text(0, 5.5, 'CO'), Text(0, 6.5, 'CT'), Text(0, 7.5, 'DC'), Text(0, 8.5, 'DE'), Text(0, 9.5, 'FL'), Text(0, 10.5, 'GA'), Text(0, 11.5, 'GU'), Text(0, 12.5, 'HI'), Text(0, 13.5, 'IA'), Text(0, 14.5, 'ID'), Text(0, 15.5, 'IL'), Text(0, 16.5, 'IN'), Text(0, 17.5, 'KS'), Text(0, 18.5, 'KY'), Text(0, 19.5, 'LA'), Text(0, 20.5, 'MA'), Text(0, 21.5, 'MD'), Text(0, 22.5, 'ME'), Text(0, 23.5, 'MI'), Text(0, 24.5, 'MN'), Text(0, 25.5, 'MO'), Text(0, 26.5, 'MS'), Text(0, 27.5, 'MT'), Text(0, 28.5, 'NC'), Text(0, 29.5, 'ND'), Text(0, 30.5, 'NE'), Text(0, 31.5, 'NH'), Text(0, 32.5, 'NJ'), Text(0, 33.5, 'NM'), Text(0, 34.5, 'NV'), Text(0, 35.5, 'NY'), Text(0, 36.5, 'OH'), Text(0, 37.5, 'OK'), Text(0, 38.5, 'OR'), Text(0, 39.5, 'PA'), Text(0, 40.5, 'PR'), Text(0, 41.5, 'RI'), Text(0, 42.5, 'SC'), Text(0, 43.5, 'SD'), Text(0, 44.5, 'TN'), Text(0, 45.5, 'TX'), Text(0, 46.5, 'UT'), Text(0, 47.5, 'VA'), Text(0, 48.5, 'VI'), Text(0, 49.5, 'VT'), Text(0, 50.5, 'WA'), Text(0, 51.5, 'WI'), Text(0, 52.5, 'WV'), Text(0, 53.5, 'WY')])
cbar = hm.collections[0].colorbar
max_val = heatmap_data.to_numpy().max()
tick_step = 1000
ticks = list(range(0, int(max_val) + tick_step, tick_step))
cbar.set_ticks(ticks)
cbar.set_ticklabels([format(t, ",") for t in ticks])
cbar.set_label("Total Infection Cases", rotation=270, labelpad=20, fontsize=14)
plt.savefig("infection_heatmap.png")
plt.show()
Interpretation: The heatmap reveals clear infection hotspots concentrated in large states such as California, Texas, Pennsylvania, and New York, which show consistently high case counts across multiple infection types. In contrast, many smaller states display much lower values across the board. The pattern highlights both geographic disparities and the fact that certain states face a heavier and more widespread infection burden than others.
This analysis provides a clear picture of how hospital‑acquired infections vary across states and infection types in the United States. Across all visualizations, a consistent pattern emerges: C. diff is the dominant infection type, contributing the largest share of cases nationally and driving much of the infection burden in high‑impact states. Geographic disparities are also evident, with large states such as California, Texas, Florida, New York, and Pennsylvania showing significantly higher totals across multiple infection categories.
The combination of summary statistics, bar charts, stacked distributions, and heatmaps highlights both the magnitude and complexity of infection patterns. While some states experience concentrated spikes in specific infection types, others face a broad mix of challenges. These findings underscore the importance of targeted infection‑control strategies that consider both state‑level burden and the disproportionate impact of certain infection types. Overall, the analysis demonstrates how data visualization can reveal meaningful trends that support more informed public health decision‑making.