Analysis Report Two - Data, Data Everywhere

Author

Mary Elizabeth Magyar

Executive Summary

This report analyzes pneumonia related data from the MIMIC-III database that helps determine which bacteria is most commonly linked to the diagnosis, and how the identified organism relates to the patient’s length of stay in the ICU. My analysis found that the three organisms that account for the majority of cases in the given population are Pseudomonas aeruginosa, Morganella morganii, and Escherichia coli. Depending on which of the three bacteria a patient is diagnosed with will determine the amount of time spent in the ICU. My report recommends that healthcare organizations should invest in clinical decision support systems (CDSS) that are able to analyze this type of infection data into supporting data for healthcare organizations. In doing this, healthcare organizations will be able to have faster treatment decisions, accurate resource planning, cost savings while implementing the proper training to clinicians.

Introduction

Healthcare organizations today have access to an ample amount of clinical data, driven by electronic medical record (EMR) systems, wearable devices, and AI powered analytic tools. Raghupathi and Raghupathi (Raghupathi and Raghupathi 2014) explain this shift as “big data analytics” in healthcare. This means, healthcare organizations would be able to identify large, complex databases that can be analyzed to help healthcare organizations be proactive rather than reactive when looking over their data. By being proactive, healthcare organizations will be able to identify high risk patients in their organization or community and intervene before issues occur. This article outlines four main frameworks such as queries, reports, online analytical processing (OLAP), and data mining (Raghupathi and Raghupathi 2014). Sutton (Sutton and Kroeker 2020) builds on this concept by looking at clinical decision support systems such as software tools that are meant to understand and combined clinical knowledge, patient information, and other health information to help physicians care for their patients. Sutton (Sutton and Kroeker 2020) describes two classified categories of CDSS; knowledge-based systems that follow programmed rules and non-knowledge-based systems that utilize machine learning’s to spot patterns. After looking this over, they discovered that there is real benefits from CDSS. There were fewer medication errors, lower cost and fewer hospital stays while also revealing risk such as; alert fatigue, disrupted workflows, and clinicians relying too much on automatic recommendations rather than their own. Abrams (Ken Abrams 2025) took this even further and looked at the present data that would address the growing volume of patient information and data that is generated from wearable devices and health applications. He stated that an increased amount of health data does not improve outcomes on its own. For example, patient are given an overload of health information at their fingertips with new, improved patient portals. According to the article, The Impact of Digital Patient Portals on Health Outcomes, System Efficiency, and Patient Attitudes: Updated Systematic Literature Review (Carini et al. 2021), giving patients more access to their health data does not improve outcomes unless that information is supplemented with understanding, support and tools to help positively impact their overall health. According to Abrams (Ken Abrams 2025), the proposed solution is an AI driven approach that filters data to the right person. All in all, these three readings maintain a consistent idea, that the value of healthcare data is based on how well it is analyzed, filtered, and interpreted. This report applies that idea by using the MIMIC-III data set to demonstrate how patient information can be utilized in the capacity the three articles explain.

The Healthcare Context

Technology continues to connect what used to be separate parts of healthcare organizations, such as diagnosis, microbiology systems, labs, financial planning, ICU operations/stays and more, into a integrated data systems such as EMR systems. The query and visualization that I created for this report, ‘Data, Data Everywhere’, show how data is integrated with our MIMIC-III database. I integrated the admissions, microbiologyevents, and icustays table to link a patient’s diagnosis to the organism identified in their microbiology results/labs to operational outcomes such as length of stay in the ICU, which could have been missed if the data systems were not integrated. By being able to integrate different parts of data can significantly help the quality of care provided to patient’s. Raghupathi and Raghupathi (Raghupathi and Raghupathi 2014) emphasize that connecting large data sets allow organizations to proactive rather than reactive when looking over the data, for example, looking at which bacteria most commonly shows during a pneumonia diagnosis, before they become urgent issues. Sutton (Sutton and Kroeker 2020) expresses that the CDSS tools that are built on a foundation of this type of data can help organizational efficiency while improving patient outcomes. While these tools point us in the right direction, if not managed carefully it can lead us down the wrong path that can overwhelm not only physicians but patients as well. Abrams (Ken Abrams 2025) expressed that even though we can connect to more data sources, such as EMR’s, lab systems, wearable devices, this is not always the best solution to obtaining data. Unless it is processed to the patient and physician correctly, this level of data accessibility can be overwhelming. These automatic alert systems are not only overwhelming for physicians but can result in alert fatigue and relying to heavily on an automatic system to give patient information rather than the physician giving their own instruction. The visualizations that I have created in this report show just that. Visualization one shows the data coming together as intended, by combining the admissions and microbiologyevents tables, we see a clear pattern in this data set and the stacked bar chart. Visualization two show limitations of the same data coming together, I have combined the diagnosis, organism and ICU stays (los) data into a box plot that can demonstrate that having a limited data set can sometimes make you have more questions if the interpretation of the chart is not clear. Overall, technology is a great resource as a whole but especially in healthcare organizations if understood and used properly. The ability to integrate it depends on the level of training necessary to turn data into useful and trustworthy guidance and information for clinicians and healthcare organizations.

Data Visualizations

Visualization One - Two Table Join

SELECT org_name
FROM admissions
Inner Join microbiologyevents
On admissions.hadm_id = microbiologyevents.hadm_id
WHERE diagnosis = 'PNEUMONIA'
AND (org_name = 'PSEUDOMONAS AERUGINOSA'
  or org_name = 'MORGANELLA MORGANII'
  or org_name = 'ESCHERICHIA COLI')
ggplot(data = myquery1.1,
       mapping= aes(x= 'Pneumonia Patients', fill = org_name)) +
  geom_bar() +
  theme_minimal() +  # Cleans up the background grid lines
  labs(
    title = "Visualization showing Top 3 Bacteria Found in Pneumonia Patients",
    subtitle = "Data taken from Admissions and Microbiologyevents table within MIMIC-III",
    x = "Bacteria Found",
    y = "Number of Patients",
    caption = "Source: MIMIC-III Clinical Database v1.4"
  )

This visualization represents the two tables, admissions and microbiologyevents within the MIMIC-III database. Joining these two tables with .hadm_id, which is the hopsital admission identifier, helps see the query created to isolate all microbiology results that are associated with patients who were diagosed with pneumonia. This stacked bar chart shows the three most common pneumonia bacteria that were found; escherichia coli (in orange), morganella morganii (in green), and psuedomonas aeruginosa (in blue). The stacked format allows someone to understand and assess the total volume of cases for each organism. This data and visualization is relevant to healthcare organizations because pneumonia is a leading cause in mortality worldwide(Shrestha 2022). According to the article, the “median time-to-first microbiology result is 26 hours,” and the “median time-to-last microbiology result is 144 hours” (Shrestha 2022). Pneumonia diagnosis is typically quicker than the microbiology results. Pneumonia gets diagnosed based on symptoms, exam, imaging and vital signs, providers do not wait on microbiology results to come back to create a treatment plan for someone with pneumonia but instead begin treatment upon diagnosis (Grief and Loza 2018). By analyzing the data for historical infection patterns as shown on the chart above, healthcare organizations can develop informed treatment guidelines for antibiotic therapy based on the most common bacteria for pneumonia. This type of data driven approach would be favorable in a clinic decision support system because providers would be able to understand which bacteria are most commonly associated with a pneumonia diagnosis (Sutton and Kroeker 2020). A clinical decision support system could retain this information promptly to give the provider an evidence based antibiotic choice while waiting for the microbiology results to come back to confirm the bacteria. Overall, this visualization demonstrates how a two-table query can impact clinically meaningful patterns that support faster, data driven treatment decisions in healthcare organizations.

Visualization Two - Three Table Join

SELECT admissions.hadm_id, diagnosis, org_name, los 
FROM admissions
INNER JOIN microbiologyevents
ON admissions.hadm_id = microbiologyevents.hadm_id
INNER JOIN icustays 
ON admissions.hadm_id = icustays.hadm_id
WHERE diagnosis = 'PNEUMONIA'
  AND (org_name = 'PSEUDOMONAS AERUGINOSA'
  or org_name = 'MORGANELLA MORGANII'
  or org_name = 'ESCHERICHIA COLI')
ggplot(data = myquery2.2,
       aes(x = org_name, y= los)) +
  geom_boxplot() +
  theme_minimal() +  # Cleans up the background grid lines
labs(
    title = "Visualization showing ICU Stays by Bacteria Type in Pneumonia Patients",
    subtitle = "Data taken from admissions, microbiologyevents, and icustays tables within MIMIC-III",
    x = "Bacteria Found",
    y = "Length of ICU Stay (in days)",
    caption = "Source: MIMIC-III Clinical Database v1.4"
)

This visualization represents three tables; admissions, microbiologyevents and icystays, from the MIMIC-III database. In this table, I have joined together the hospital admissions identifier, with hadm_id, to connect each patient’s pneumonia diagnosis, the identified bacteria to their ICU stay length (los). The query I created isolates the same three bacteria that was identified in visualization one, escherichia coli, morganella morganii, and psuedomonas aeruginosa and pulls in each patient’s length of stay in the ICU (in days) for comparison. This data is represented as a box plot, with bacteria on the x-axis and the length of ICU stay (in days) on the y-axis. This chart allows someone to observe the data and compare the ICU stay length across each bacterial organism. The length of stay (los) is directly tied to cost, making this analysis valuable for resource planning but also impacts the clinical decision support system (CDSS) by showing a pattern to help them understand how to act on cases in the future that have similar outlooks. According to McLean, “ICU patients cost hospitals an average of $5,982.12 per day”(McLean and Thompson 2023). Applying the visualization above, even a single extra day in the ICU for a patient with longer-staying organisms represent meaningful cost to the hospital. This is important to a healthcare organization because being able to identify high-risk infections early could have a huge financial impact for the organization. According to Sutton, there was an intervention that was driven by CDSS in the pediatric cardiovascular ICU that had CDSS interventions that have shown to reduce the inpatient LOS and produce significant cost savings (Sutton and Kroeker 2020). One example in the article was unnecessary lab testing, which saved the healthcare organization $717,538/year, without increasing the LOS or mortality rate (Sutton and Kroeker 2020). This example is beneficial to this chart because if certain bacteria is causing a longer LOS in the ICU, healthcare organizations could benefit from that information and point out higher risk pneumonia cases earlier and have a more aggressive initial treatment or use their resources where most necessary to lower the overall ICU stay. By looking at the data and connecting it to diagnosis, the type of organism, and the ICU LOS outcome as this query does, a CDSS could help healthcare organizations anticipate which resources will be needed, used the most and ways to identify which infections are associated with more complex, costly hospital stays. Overall, this visualization shows how connecting a diagnosis, organism and ICU length of stay can help healthcare organizations anticipate risk and also understand the financial impact of those risk. This overall will help support a CDSS that drives better care and resource allocations.

Recommendations for Industry

The data from both visualization show that a small number of bacteria drive most pneumonia cases in this data/population. Certain organisms are associated with longer ICU stays, which overall can be a financial strain on a healthcare organization. Raghupathi and Raghupathi’s view of big data analytics observe these patterns as a tool for identifying high-risk groups before issues arise (Raghupathi and Raghupathi 2014). My recommendation would be, rather than overloading clinicians with raw data, healthcare organizations should build CDSS tools that filter this information into alerts, as Abrams expresses the need to make use of AI in healthcare rather than seeing it as a negative (Ken Abrams 2025). I would also recommend that administrators in healthcare organizations should observe this type of pattern in the data to aid in financial planning, as Sutton and McLean both prove that data interventions can produce real cost savings in a healthcare organization. Taking this into consideration, my overall recommendation for a healthcare organization would be to truly invest in the clinical data that are targeted at clinical decision support systems tools. Transforming the raw data information into faster, safer and more cost-effective patient care would highly benefit a healthcare organization.

References

Carini, Elettra, Leonardo Villani, Angelo Maria Pezzullo, Andrea Gentili, Andrea Barbara, Walter Ricciardi, and Stefania Boccia. 2021. “The Impact of Digital Patient Portals on Health Outcomes, System Efficiency, and Patient Attitudes: Updated Systematic Literature Review.” J Med Internet Res 23 (9): 20. https://doi.org/10.2196/26189.
Grief, Samuel N, and Julie K Loza. 2018. “Guidelines for the Evaluation and Treatment of Pneumonia.” International Journal of Clinical Practice 45,3 (1): 485–503. https://doi.org/https://doi.org/10.1016/j.pop.2018.04.001.
Ken Abrams, MBA, M. D. 2025. “From Data Overload to Targeted Care: AI’s Role in Health.” Strategy & Analytics, Life Sciences & Health Care 1: 1.
McLean, Barbara, and Douglas Thompson. 2023. “MRI and the Critical Care Patient: Clinical, Operational, and Financial Challenges.” Critical Care Research and Practice 2023 (1): 2772181. https://doi.org/https://doi.org/10.1155/2023/2772181.
Raghupathi, and Raghupathi. 2014. “Big Data Analytics in Healthcare: Promise and Potential.” Health Information Science and Systems 2 (3): 10. https://doi.org/10.1186/2047-2501-2-3.
Shrestha, Georgiou, A. 2022. “Timeliness of Microbiology Test Result Reporting and Association with Outcomes of Adults Hospitalised with Unspecified Pneumonia: A Data Linkage Study.” International Journal of Clinical Practice 2022 (1): 8. https://doi.org/https://doi.org/10.1155/2022/9406499.
Sutton, Pincock, Reed T, and K Kroeker. 2020. “An Overview of Clinical Decision Support Systems: Benefits, Risks, and Strategies for Success.” Npj Digital Medicine 17 (1): 0. https://doi.org/https://doi.org/10.1038/s41746-020-0221-y.