Many times you need to use information from multiple tables to produce a report or answer a question. For example, what are the DOB of all of our married patients?
We need fields from two tables to answer this.
DOB - located in patients table
marital_status - located in admissions table
To make this work, we need a SQL join. Joins are not difficult, as long as you understand what you’re trying to accomplish.
Joins require a common key that is shared between tables, in this case subject_id. Once you know the fields you want to SELECT and the common key field, you are ready to create your query.
patients.subject_id.This selects the subject_id column from the patients table. The patients. prefix is used to explicitly state which table the subject_id column comes from, which is good practice, especially when you have columns with the same name in different tables like we have here.
INNER JOIN admissions.This is the INNER JOIN clause. It combines rows from the patients table with rows from the admissions table. An INNER JOIN only returns rows where there is a match in both tables based on the shared key field. There are other types of JOINS, but INNER JOIN is the only one we will use.
ON patients.subject_id = admissions.subject_id. This is the ON clause, which explicitly states the shared the key field. It tells the database to link rows from the patients table to rows from the admissions table where the subject_id in the patients table is equal to the subject_id in the admissions table. This subject_id is a unique identifier for each patient, allowing you to follow the patient across tables in the database.
Double Join example
A join can have any number of tables. To join more tables, just keep adding INNER JOIN and ON clauses. Here, I join our previous query to the icustays table to find out when these patients were admitted to the ICU. The ICUStays table has an “intime” variable and, most importantly, also has our shared key, subject_id
SELECT patients.subject_id, dob, marital_status, intime FROM patientsINNERJOIN admissionsON patients.subject_id = admissions.subject_idINNERJOIN icustays ON patients.subject_id = icustays.subject_idWHERE marital_status ="MARRIED"LIMIT5
5 records
subject_id
dob
marital_status
intime
10027
2108-01-15 00:00:00
MARRIED
2190-07-13 10:39:07
10033
2051-04-21 00:00:00
MARRIED
2132-12-05 13:13:18
10035
2053-04-13 00:00:00
MARRIED
2129-03-04 13:40:11
10036
1885-03-24 00:00:00
MARRIED
2185-03-24 16:57:05
10045
2061-03-25 00:00:00
MARRIED
2129-11-24 22:46:57
myquery <- ’’ dbGetQuery(mydb,myquery)
the SQL CAST() function
As we work with more and different types of data in the MIMIC-3, we are likely to encounter unusual “real-world” stuff. MIMIC-3 has all the bumps and wrinkles of a real, working database.
Let’s see an example of this. Here, I am joining chartevents to d_items to pull readings for an item called “Heart Rate” with itemid = 220045. The join is necessary because while the values and charttimes are in chartevents, the label is in d_items. I also save the data this time to a dataframe. This should be familiar to you from last week.
In the query above, the variable valuenum from CHARTEVENTS is supposed to be a numeric value for a patients heartrate, but sometimes SQL will still interpret a number as text. How do we know it is treating valuenum as text? Well, if it is a number, we should be able to calculate the mean.
Warning in mean.default(myquery1$valuenum): argument is not numeric or logical:
returning NA
[1] NA
We cannot. To fix this, we need to explicitly tell SQL to treat this as an integer with CAST(). Unfortunately, using functions like CAST can create some bad variable names that are hard to work with. Now SQL wants to treat the whole “CAST(valuenum AS INT)” as the name of this variable, so I use an alias with AS to rename this to just “val”.
Tip
You can always rename any variable you want using an alias with AS. Try it yourself!
Bar charts (geom_bar()) like we used last week are great for counting qualitative data (names, cities etc.). Quantitative data though (like the valuenum example above) can be shown effectively using some different geoms. Let’s look at two of them using our heartrate data from the CAST() example above.
geom_boxplot()
Boxplots are a nice go-to for quantitative data like these heart rates. Notice that the only thing that is really different is geom_boxplot() instead of geom_bar(). They are sort of demanding on your audience though. People need to know some statistics to understand boxplots. Things like interquartile range, medians, etc. Plus with grouped data it can be tough to explain whether and how groups differ.
ggplot(data = myquery1,aes(val)) +geom_boxplot()
Tip
A boxplot displays a five-number summary of data:
Minimum: The lowest value, excluding outliers. (Bottom of the thin line)
First Quartile (Q1): The 25th percentile (bottom of the box).
Median: The 50th percentile (middle line of the box).
Third Quartile (Q3): The 75th percentile (top of the box).
Maximum: The highest value, excluding outliers. (Top of the thin line)
Outliers: Points located further than 1.5 times the box height from the box edges, displayed individually.
geom_violin()
If you think your audience is less statistically inclined, in such cases I recommend geom_violin(). A violin plot shows the distribution of a continuous variable. The width of the “violin” at any given point represents the density of data points at that value. They are excellent for comparing distributions of data across different groups or categories. Let’s add gender to our query, then we will create different geom_violins for each gender and see if men and women experience different distributions in their heart rate variations.
ggplot(data = myquery2,aes(x = gender, y = val)) +geom_violin()
We can see from the violin plots that female heart rate tends to cluster higher than men, and men show more dispersion in terms of their rates.
Stacked bar charts
Finally, here is a nice trick you can do with a geom_bar(). A “stacked bar” chart lets you show two qualitative variables on the same bar chart. Let’s count the labels from myquery2 to see how many total readings we took of heart rates on chartevents.
We have over 8,000 distinct readings! How many do we have for each gender? Looks like a little more than half are for men.
ggplot(data = myquery2,mapping =aes(label, fill = gender)) +geom_bar()
Now that you’ve had a chance to read about JOINS in SQL, open your 673_practice2.qmd file to start practicing on your own. Remember that your visualizations for your report should use concepts from the practice, but show how you have learned to extend the practices, not merely copy them.