MBA 673 Practice 2

Author

Dr. Mitchell Church

INNER JOINS in SQL

Many times you need to use information from multiple tables to produce a report or answer a question. For example, what are the DOB of all of our married patients?

We need fields from two tables to answer this.

  • DOB - located in patients table
  • marital_status - located in admissions table

To make this work, we need a SQL join. Joins are not difficult, as long as you understand what you’re trying to accomplish.

Joins require a common key that is shared between tables, in this case subject_id. Once you know the fields you want to SELECT and the common key field, you are ready to create your query.

SELECT patients.subject_id, dob, marital_status 
FROM patients
INNER JOIN admissions
ON patients.subject_id = admissions.subject_id
WHERE marital_status = "MARRIED"
LIMIT 5
5 records
subject_id dob marital_status
10027 2108-01-15 00:00:00 MARRIED
10033 2051-04-21 00:00:00 MARRIED
10035 2053-04-13 00:00:00 MARRIED
10036 1885-03-24 00:00:00 MARRIED
10045 2061-03-25 00:00:00 MARRIED

A couple things to note:

  • patients.subject_id.This selects the subject_id column from the patients table. The patients. prefix is used to explicitly state which table the subject_id column comes from, which is good practice, especially when you have columns with the same name in different tables like we have here.

  • INNER JOIN admissions.This is the INNER JOIN clause. It combines rows from the patients table with rows from the admissions table. An INNER JOIN only returns rows where there is a match in both tables based on the shared key field. There are other types of JOINS, but INNER JOIN is the only one we will use.

  • ON patients.subject_id = admissions.subject_id. This is the ON clause, which explicitly states the shared the key field. It tells the database to link rows from the patients table to rows from the admissions table where the subject_id in the patients table is equal to the subject_id in the admissions table. This subject_id is a unique identifier for each patient, allowing you to follow the patient across tables in the database.

Double Join example

A join can have any number of tables. To join more tables, just keep adding INNER JOIN and ON clauses. Here, I join our previous query to the icustays table to find out when these patients were admitted to the ICU. The ICUStays table has an “intime” variable and, most importantly, also has our shared key, subject_id

SELECT patients.subject_id, dob, marital_status, intime  
FROM patients
INNER JOIN admissions
ON patients.subject_id = admissions.subject_id
INNER JOIN icustays 
ON patients.subject_id = icustays.subject_id
WHERE marital_status = "MARRIED"
LIMIT 5
5 records
subject_id dob marital_status intime
10027 2108-01-15 00:00:00 MARRIED 2190-07-13 10:39:07
10033 2051-04-21 00:00:00 MARRIED 2132-12-05 13:13:18
10035 2053-04-13 00:00:00 MARRIED 2129-03-04 13:40:11
10036 1885-03-24 00:00:00 MARRIED 2185-03-24 16:57:05
10045 2061-03-25 00:00:00 MARRIED 2129-11-24 22:46:57

myquery <- ’’ dbGetQuery(mydb,myquery)

the SQL CAST() function

As we work with more and different types of data in the MIMIC-3, we are likely to encounter unusual “real-world” stuff. MIMIC-3 has all the bumps and wrinkles of a real, working database.

Let’s see an example of this. Here, I am joining chartevents to d_items to pull readings for an item called “Heart Rate” with itemid = 220045. The join is necessary because while the values and charttimes are in chartevents, the label is in d_items. I also save the data this time to a dataframe. This should be familiar to you from last week.

SELECT valuenum, label, charttime
FROM chartevents
INNER JOIN d_items
ON chartevents.itemid = d_items.itemid
WHERE d_items.itemid = 220045
LIMIT 5
5 records
valuenum label charttime
94 Heart Rate 2130-02-04 04:32:00
88 Heart Rate 2130-02-04 05:01:00
82 Heart Rate 2130-02-04 06:00:00
89 Heart Rate 2130-02-04 07:00:00
84 Heart Rate 2130-02-04 08:00:00

In the query above, the variable valuenum from CHARTEVENTS is supposed to be a numeric value for a patients heartrate, but sometimes SQL will still interpret a number as text. How do we know it is treating valuenum as text? Well, if it is a number, we should be able to calculate the mean.

SELECT valuenum, label, charttime
FROM chartevents
INNER JOIN d_items
ON chartevents.itemid = d_items.itemid
WHERE d_items.itemid = 220045
mean(myquery1$valuenum)
Warning in mean.default(myquery1$valuenum): argument is not numeric or logical:
returning NA
[1] NA

We cannot. To fix this, we need to explicitly tell SQL to treat this as an integer with CAST(). Unfortunately, using functions like CAST can create some bad variable names that are hard to work with. Now SQL wants to treat the whole “CAST(valuenum AS INT)” as the name of this variable, so I use an alias with AS to rename this to just “val”.

Tip

You can always rename any variable you want using an alias with AS. Try it yourself!

SELECT CAST(valuenum AS INT) AS val, label, charttime
FROM chartevents
INNER JOIN d_items
ON chartevents.itemid = d_items.itemid
WHERE d_items.itemid = 220045

Now the mean heart rate calculates as expected.

mean(myquery1$val)
[1] 88.71016

New ggplot geoms

Bar charts (geom_bar()) like we used last week are great for counting qualitative data (names, cities etc.). Quantitative data though (like the valuenum example above) can be shown effectively using some different geoms. Let’s look at two of them using our heartrate data from the CAST() example above.

geom_boxplot()

Boxplots are a nice go-to for quantitative data like these heart rates. Notice that the only thing that is really different is geom_boxplot() instead of geom_bar(). They are sort of demanding on your audience though. People need to know some statistics to understand boxplots. Things like interquartile range, medians, etc. Plus with grouped data it can be tough to explain whether and how groups differ.

ggplot(data = myquery1,
       aes(val)) +
  geom_boxplot()

Tip

A boxplot displays a five-number summary of data:

  • Minimum: The lowest value, excluding outliers. (Bottom of the thin line)
  • First Quartile (Q1): The 25th percentile (bottom of the box).
  • Median: The 50th percentile (middle line of the box).
  • Third Quartile (Q3): The 75th percentile (top of the box).
  • Maximum: The highest value, excluding outliers. (Top of the thin line)
  • Outliers: Points located further than 1.5 times the box height from the box edges, displayed individually.

geom_violin()

If you think your audience is less statistically inclined, in such cases I recommend geom_violin(). A violin plot shows the distribution of a continuous variable. The width of the “violin” at any given point represents the density of data points at that value. They are excellent for comparing distributions of data across different groups or categories. Let’s add gender to our query, then we will create different geom_violins for each gender and see if men and women experience different distributions in their heart rate variations.

SELECT CAST(valuenum AS INT) AS val, gender, label, charttime
FROM chartevents
INNER JOIN d_items
ON chartevents.itemid = d_items.itemid
INNER JOIN patients
on patients.subject_id = chartevents.subject_id
WHERE d_items.itemid = 220045
ggplot(data = myquery2,
       aes(x = gender, y = val)) +
  geom_violin()

We can see from the violin plots that female heart rate tends to cluster higher than men, and men show more dispersion in terms of their rates.

Stacked bar charts

Finally, here is a nice trick you can do with a geom_bar(). A “stacked bar” chart lets you show two qualitative variables on the same bar chart. Let’s count the labels from myquery2 to see how many total readings we took of heart rates on chartevents.

ggplot(data = myquery2,
       mapping = aes(label)) +
  geom_bar()

We have over 8,000 distinct readings! How many do we have for each gender? Looks like a little more than half are for men.

ggplot(data = myquery2,
       mapping = aes(label, fill = gender)) +
  geom_bar()

Now that you’ve had a chance to read about JOINS in SQL, open your 673_practice2.qmd file to start practicing on your own. Remember that your visualizations for your report should use concepts from the practice, but show how you have learned to extend the practices, not merely copy them.