MBA 673 Practice 3

Author

Dr. Mitchell Church

Grouping and Summarizing in SQL ——————————————————

The Group By clause in SQL organizes output into logical groups based on a variable. Once variables are grouped, we can then SELECT different things based on those groups, like the COUNT of a group, the MAX or MIN of a group, or the AVG of a group.

Let’s first consider some examples using qualitative data. For example, here is some data on how the different ICU units make use of electrolytes (Potassium chloride).

SELECT drug, first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"
LIMIT 5

5 records
drug	first_careunit
Potassium Chloride	MICU
Potassium Chloride	MICU
Potassium Chloride	MICU
Potassium Chloride	MICU
Potassium Chloride	MICU

It is just a long list, not super useful. We already know that we could count how often they use Potassium chloride using a geom_bar().

SELECT drug, first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"

ggplot(data = myquery,
       aes(x = first_careunit)) +
  geom_bar()

Sometimes you don’t want a graph though, you need a number. This is why you need to know to group and count. You could try to read these off the graph I guess, but only knowing how to produce one geom_bar() is pretty limiting. Let’s look at how to create the numbers that were used to generate the graph.

COUNT and GROUP BY

The COUNT function and GROUP BY clause are SQL’s primary tools for summarizing data. GROUP BY, instructs the database to slice that giant dataset into distinct buckets based on shared values in a specific column.

Once the data is sorted into these buckets, COUNT steps in to calculate the total number of rows within each individual group. For example, if you group an orders table by customer_id, the database aggregates the rows for each unique customer and uses COUNT to output exactly how many purchases each individual made.

SELECT COUNT(*), first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"
GROUP BY first_careunit

5 records
COUNT(*)	first_careunit
111	CCU
17	CSRU
164	MICU
95	SICU
14	TSICU

Notice how the only changes I made to the code involved adding a GROUP BY (always after any WHERE statements) and instead of only SELECTing the drug I now select the COUNT(*).

Tip

COUNTING and GROUPING makes some weird variable names. Graphing something like COUNT() can be hard, so I highly recommend adding a quick alias (eg. COUNT(*) AS drug_count or something similar)

the HAVING clause

Grouped data can be filtered using the HAVING clause. It works like a WHERE clause, but for groups when you want to filter based on an attribute possessed by THE GROUP. Importantly, a WHERE filter will not override a HAVING. The order of the commands is also not interchangeable. WHERE comes first, then GROUP BY, then HAVING. Look at the difference between the output of these two queries. Can you see how HAVING has altered the results?

SELECT COUNT(*)as drug_count, first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"
GROUP BY first_careunit

ggplot(data = myquery1,
       aes(x = first_careunit, y = drug_count)) +
  geom_col() +
  labs(title = "Plot showing all drug_count data")

SELECT COUNT(*) as drug_count, first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"
GROUP BY first_careunit
HAVING drug_count > 20

ggplot(data = myquery2,
       aes(x = first_careunit, y = drug_count)) +
  geom_col() +
  labs(title = "Plot with data filtered using HAVING drug_count > 20")

A new GEOM - geom_col()

When we count and group ourselves, we can no longer rely on geom_bar() to produce our counts. We must explicitly tell GGPLOT which column contains our variable names, and which column contains our counts. We then have access to geom_col(). It is like a bar chart, but for data that has already been summarized. Notice the way I use geom_col() in the two examples above. Also, notice that I graph the name of my alias (drug_count), not COUNT(*).

You can also use fill with geom_col to create stacked bar charts like we saw in Week 2. Here I select all types of “Chloride” and group by BOTH first_careunit and drug, before creating a stacked geom_col(). Now we can see that Potassium Chloride is used most often, and which ICU units use the most.

SELECT COUNT(*) as drug_count, drug, first_careunit
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug LIKE "%Chloride"
GROUP BY first_careunit, drug

ggplot(data = myquery3,
       aes(y = drug, x = drug_count, fill = first_careunit)) +
  geom_col()

HAVING and GROUP BY with Quantitative Data

Counting is great for things that aren’t quantitative numbers. If you have quantitative variables, you can use different functions including SUM(), AVG, MAX etc. Here, I look at the amount of Potassium Chloride prescribed.

SELECT drug, first_careunit,
SUM(CAST(dose_val_rx as INTEGER)) as total_prescribed, 
MAX(CAST(dose_val_rx as INTEGER)) as biggest_dose, 
MIN(CAST(dose_val_rx as INTEGER)) as smallest_dose, 
AVG(CAST(dose_val_rx as INTEGER)) as avg_dose
FROM prescriptions
INNER JOIN icustays
ON prescriptions.icustay_id = icustays.icustay_id
WHERE drug = "Potassium Chloride"
GROUP BY first_careunit

5 records
drug	first_careunit	total_prescribed	biggest_dose	smallest_dose	avg_dose
Potassium Chloride	CCU	4050	80	10	36.48649
Potassium Chloride	CSRU	450	40	10	26.47059
Potassium Chloride	MICU	6017	80	6	36.68902
Potassium Chloride	SICU	4043	100	6	42.55789
Potassium Chloride	TSICU	480	60	20	34.28571

Compare the table above to boxplots of the same data. Can you find the min, max and average values on the boxplots?

The CAREGIVERS table

Your report this week will utilize data from the CAREGIVERS table. The CAREGIVERS table in the MIMIC-III database serves as a vital bridge between clinical event data and the types of healthcare professionals who recorded it. Each caregiver has a unique, de-identified integer called a Caregiver ID (CGID). Whenever a vital sign is charted, a medication is administered, or a clinical note is written, the corresponding event table logs the CGID of the person who entered the data, allowing us to track the provider without compromising individual identity.

By joining CAREGIVERS with clinical event tables like CHARTEVENTS or NOTEEVENTS, you can calculate metrics that show how caregivers interact with a patient over the course of their ICU stay. Ultimately, the table ensures that clinical data can be analyzed through the lens of provider roles, adding a critical layer of “employee” context to the medical records.

Let’s demonstrate how the CAREGIVERS table can reveal the working conditions for our hospital care staff with a humorous example. Being a nurse is tough, thankless work. They have to handle a bunch of (literal) crap. To show our appreciation, let’s recognize our hardworking caregivers by creating a query that will return a list of the top 10 RN caregivers that handle the most!

OUTPUTEVENTS tracks anything that comes out of a patients body. And you can probably guess what “Stool Out Stool” from OUTPUTEVENTS tracks. Here, I’m grouping the data by Caregiver ID and then calculating the SUM() amount of Stool each caregiver which class RN has “handled”. I’m then sorting the data by the SUM(value) which I call total_output and LIMITing the results to the top five RNs.

Then we can plot the summarized data on a geom_col() to see if we have a clear winner.

Congratulations CGID 15023!