library(tidyverse)
library(DBI)
library(RSQLite)
# Connect to the database
mydb <- dbConnect(RSQLite::SQLite(), "../MIMIC-III.db")MBA 673 Practice 1
Exploring MIMIC III
MIMIC-III (Medical Information Mart for Intensive Care III) is a massive, freely accessible database comprising de-identified health data from patients who stayed in intensive care units (ICUs) at the Beth Israel Deaconess Medical Center.
It is not a synthetic dataset; it contains real, highly granular clinical data collected over more than a decade.
Because it mimics the messy, complex reality of hospital data systems, it has become the global gold standard for teaching health informatics and clinical data analysis. It spans across multiple interconnected tables containing:
- Demographics & Administrative Data: Age, gender, insurance status, and length of stay (
patientsandadmissionstables). - Clinical Measurements: Vital signs, laboratory test results, and medications administered.
- Diagnostic Codes: Highly standardized international codes (ICD-9 codes) capturing explicit primary and secondary diagnoses.
Why Healthcare Managers need to know how to Query data structures like the MIMIC III.
For an administrator or manager in the health field, data is key to regulatory compliance and quality control. Learning how to write queries against a system like MIMIC-III is highly valuable for several key reasons:
- Querying clinical data directly allows managers to answer questions without waiting days for a specialized data analyst to build a report:
- Modern healthcare reimbursement operates tightly around documented diagnostic and procedural codes. If care is delivered but not coded accurately, a hospital loses revenue; if it’s over-coded, they face massive fraud penalties. Querying a relational schema teaches a manager how to conduct internal data forensics—verifying that the clinical timeline in the chart matches the final billing output.
- Relying on pre-packaged PDF reports limits a leader to looking backward. By mastering basic SQL and connecting it to visualization engines (like ggplot), a manager gains the power to design and interpret their own real-time clinical operations dashboards.
Ultimately, knowing how to query MIMIC-III bridges the gap between raw medical data and high-level strategy. Knowing how to find your way around a resource like this transforms you from a passive consumer of reports into an agile, data-driven leader.
Database Connection
First, we load some required packages and establish a connection to the MIMIC-3 SQLite database. Once this chunk is run, RStudio’s “Connections” pane will index the tables, making the database much easier to work with.
Database Exploration
This practice is mostly about learning how to explore a SQL database.
You can view the database schema for the MIMIC-3 using SchemaSpy here.
Watch the accompanying video on Moodle so that you understand what you are seeing.
##In addition to SchemaSpy, these next commands are also useful to find your way around.
#Use this command to view tables in a database
dbListTables(mydb) [1] "ADMISSIONS" "CALLOUT" "CAREGIVERS"
[4] "CHARTEVENTS" "CPTEVENTS" "DATETIMEEVENTS"
[7] "DIAGNOSES_ICD" "DRGCODES" "D_CPT"
[10] "D_ICD_DIAGNOSES" "D_ICD_PROCEDURES" "D_ITEMS"
[13] "D_LABITEMS" "ICUSTAYS" "INPUTEVENTS_CV"
[16] "INPUTEVENTS_MV" "LABEVENTS" "MICROBIOLOGYEVENTS"
[19] "NOTEEVENTS" "OUTPUTEVENTS" "PATIENTS"
[22] "PRESCRIPTIONS" "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"
[25] "SERVICES" "TRANSFERS"
##Use this command to view fields in a table, for example patients.
dbListFields(mydb, "admissions") [1] "row_id" "subject_id" "hadm_id"
[4] "admittime" "dischtime" "deathtime"
[7] "admission_type" "admission_location" "discharge_location"
[10] "insurance" "language" "religion"
[13] "marital_status" "ethnicity" "edregtime"
[16] "edouttime" "diagnosis" "hospital_expire_flag"
[19] "has_chartevents_data"
Key SQL Concepts for this Practice
SQL Clauses to Know
This first homework focuses on SELECT, FROM and WHERE commands in SQL.
- SELECT is used to pull “fields” out of the database and display them.
- FROM tells SQL which database table to pull the fields from.
- WHERE allows you to filter the results based on a variety of conditions.
Here is our first full SQL query. It uses SELECT and FROM to pull patient genders from the patients table. In this example, “gender” is the field and patients is the table. I knew the names of these variables from using SchemaSpy and the database exploration commands above. You NEVER need to guess about a variable name in a database.
-- This is a specific "SQL" block. It is designed to make working with SQL easy.
-- That is why the keywords are colored. You can hit "tab" inside this block to auto-complete a variable name.
-- Here is our first full SQL query. It uses SELECT and FROM to pull patient genders from the patients table.
-- In this example, "gender" is the field and patients is the table. LIMIT 5 makes it return just the first 5 results.
SELECT gender
FROM patients
LIMIT 5| gender |
|---|
| F |
| F |
| F |
| F |
| M |
Here’s another one that SELECTS three fields, subject_id, gender and dob (date of birth). When using SELECT, fields are always separated by commas. You can also put as much blank space as you want in your queries to make them easier to read. I’ve done this below, splitting up SELECT and FROM.
SELECT subject_id, gender, dob
FROM patients
LIMIT 5;| subject_id | gender | dob |
|---|---|---|
| 10006 | F | 2094-03-05 00:00:00 |
| 10011 | F | 2090-06-05 00:00:00 |
| 10013 | F | 2038-09-03 00:00:00 |
| 10017 | F | 2075-09-21 00:00:00 |
| 10019 | M | 2114-06-20 00:00:00 |
Adding the WHERE to filter results
Now we just need to add WHERE. We will learn different types of filters gradually. Here are some examples. Run them and see what they do. Play around with the examples A bit until you thoroughly understand them.
-- Using WHERE to select text that matches exactly with "="
SELECT subject_id, gender, dob
FROM patients
WHERE gender = "F"
LIMIT 5| subject_id | gender | dob |
|---|---|---|
| 10006 | F | 2094-03-05 00:00:00 |
| 10011 | F | 2090-06-05 00:00:00 |
| 10013 | F | 2038-09-03 00:00:00 |
| 10017 | F | 2075-09-21 00:00:00 |
| 10026 | F | 1895-05-17 00:00:00 |
-- With BETWEEN to select a range of numeric data
SELECT subject_id, gender, dob
FROM patients
WHERE subject_id BETWEEN 43779 AND 43800| subject_id | gender | dob |
|---|---|---|
| 43779 | M | 2097-01-07 00:00:00 |
| 43798 | M | 2136-07-29 00:00:00 |
-- Using LIKE to match a partial term. This matches all IDs that begin with 43.
SELECT subject_id, gender, dob
FROM patients
WHERE subject_id LIKE "43%"
LIMIT 5| subject_id | gender | dob |
|---|---|---|
| 43735 | M | 2046-07-05 00:00:00 |
| 43746 | F | 2029-12-07 00:00:00 |
| 43748 | M | 2099-09-02 00:00:00 |
| 43779 | M | 2097-01-07 00:00:00 |
| 43798 | M | 2136-07-29 00:00:00 |
-- Selecting two specific values with "IN".
SELECT subject_id, gender, dob
FROM patients
WHERE subject_id IN ("43779", "43746")| subject_id | gender | dob |
|---|---|---|
| 43746 | F | 2029-12-07 00:00:00 |
| 43779 | M | 2097-01-07 00:00:00 |
Introduction to GGPLOT
Pulling the data from the database is interesting, but the real power of Posit cloud and other similar software like Python etc. Comes from being able to then graph and visualize our results, all within the same script.
To do this, we will use a very powerful package called ggplot. ggplot2 is a powerful and flexible R package for creating a wide range of static and dynamic graphics. It follows a grammar of graphics approach, meaning it combines independent components like data, aesthetics, and geometric objects to build complex visualizations.
Basic Structure of a ggplot A typical ggplot consists of the following layers: ggplot(): Creates the base of the plot, specifying the data to be used. Aesthetic mappings (aes()): Determine how variables in the data are mapped to visual properties (x, y, color, size, etc.).
Geometries (geom_): Specify the type of graphical object to use
ggplot example
Are there more women or men in this database? We will plot gender on the “x-axis”. Two things are important to note here. The first is that the data for the plot comes from an object we will save using a SQL block. If you want to change the plot, you need to change this query in the SQL block and run that block again.
-- If you look at the code file, you'll notice this block has an output.var option, which saves the data to a dataframe object called myquery
-- you can click on myquery in your environment window (on the right side of the screen) to see what data it contains.
SELECT subject_id, gender, dob
FROM patientsNow that we have our data, we use an R code block and ggplot to graph it. Your variable names that you use MUST MATCH your data. We are plotting gender, not “Gender”, for example. Also, you can’t plot any variables that aren’t in your data.
ggplot is not some obscure tool. It is extremely popular in data analysis fields. You can google or ChatGPT any part of this code and get tons of examples and video explanations.
# Notice our data is myquery, and we put gender on the x axis.
ggplot(data = myquery,
aes(x = gender)) +
geom_bar()Now that you’ve had a chance to read about these concepts, open your 673_practice1.qmd file to start practicing on your own.