Instructions:

Before beginning, please test to see if the Rmd file will compile on your system by clicking the “Knit to HTML” button in R studio above. If it does not compile, seek out help in the forum or online. Chances are, you are missing a component necessary to have rmarkdown compile on your system. We would be very interested in: 1) What was wrong?, 2) How you fixed it?, and 3) What resources did you use? This will help us improve future workshops.

##INSTALE ESTOS PAQUETES QUE PEDIA AL ABRIR EL RMARKDOWN install.packages(“devtools”) install.packages(“Hmisc”) install.packages(“tableone”) El primer error lo tira en linea 73 in $data.frame(temp,gender_num..): replacement has 0 rows

How This Workshop Works: There is some text, followed by relevant examples with code and output, followed by some student questions. The intention is for students to complete the student questions, which will provide answers to the multiple choice assessment for this subsection of the course that can be entered into EdX. The goal is to provide you with a solid foundation in some concepts related to R, rmarkdown, and exploratory data analysis.

To complete the workshop, fill in necessary code in the places indicated with # Students: Insert your code here. Use the output from “Knit to HTML” to answer the multiple choice questions in EdX to receive credit for completing the workshop.

Principles of Exploratory Data Analysis (EDA)

EDA’s goal is to better understand the data and the process by which it was generated.

Within statistics, it is largely considered separate from inferential/confirmatory statistics (e.g., hypothesis testing, point and interval estimates, etc), where EDA has a very diverse and important set of goals:

Provide an opportunity to do additional data cleaning.
Understand how the data is generated, and what the relationships between variables may be.
Suggest questions and hypotheses that can be subsequently answered and tested.
Identify what statistical methods may be most appropriate for the data to follow up with these questions and hypotheses.

EDA was coined, developed and advocated for by John Tukey. His book, entitled “Exploratory Data Analysis” was published in 1977, and is still in use today. It may seem like an oddity, but it was a fundamental change in how data science / statistics was done. Fundamentally he sums up EDA with this quote:

“It is important to understand what you CAN DO, before you learn to measure how WELL you seem to have DONE it.” – J. W. Tukey (1977)

If you don’t understand the data, it becomes difficult to know how to analyze it. Confirmatory and exploratory analyses are not superior or inferior to one another, rather they are complementary. With all the tools available to do both, ignoring one of them is inexcusable.

“Today, exploratory and confirmatory (analysis) – can – and should – proceed side by side.” – J. W. Tukey (1977)

Cognitive Disfluency – make it work for you?

There is often an urge due to productivity, laziness, or other factors to plow through with an analysis, using sophisticated analysis techniques to find the results you are seeking. With the proliferation of large datasets, this can be quite ineffective, as it largely separates the analyst from the data, resulting in misunderstanding or not understanding the data at all.

There is some evidence that cognitive disfluency (making it harder to learn) can lead to deeper learning. For analysts and data scientists this means slowing things down, often using basic (and sometimes tedious) methods to integrate the primary structure and relationships contained in the data, before pulling out the heavy machinery of modern data analysis.

See: Alter, A.L., 2013. The benefits of cognitive disfluency. Current Directions in Psychological Science, 22(6), pp.437-442.

All too often the success/failure of an analysis is determined from a single number, when in reality, understanding the data should be the goal.

Prerequisites

For this workshop we will use data from a study that examined the effect of indwelling arterial catheters (aka IAC or aline) on 28 day mortality in intensive care unit (ICU) patients on mechanical ventilation during the first day of ICU admission. The data originates from MIMIC II v2.6. The data is ready for exploratory data analysis (the data extraction and cleaning have already been completed), and contained in a comma separated value (.csv) file generated after this process and stored on Physionet. Start by loading the data file from Physionet into a data frame called dat:

# If the following command does not work for you, please download the data file from Physionet at the URL below, run file.choose(), locate your downloaded CSV file, and replace read.csv(url) with read.csv("file_path") where file_path is the output from file.choose().

#El codigo no funciono para importar la base ni descargarla
#url <- "https://github.com/criticaldata/hst953-edx/blob/master/data/aline_full_cohort_data.csv"
#dat <- read.csv
# Public dataset has NA values for variables required to complete workshop
# Replace NA values with defaults since dataset only intended for teaching. Reemplaza los valores con 0

dat <- read.csv("C:/Users/Pilar/Desktop/CURSO_MIT/hst953-edx-master/data/aline_full_cohort_data.csv", stringsAsFactors=TRUE)
#Hago un summary para ver los Na
summary(dat)

##    aline_flg       icu_los_day     hospital_los_day       age       
##  Min.   :0.0000   Min.   : 0.500   Min.   :  1.000   Min.   :15.18  
##  1st Qu.:0.0000   1st Qu.: 1.370   1st Qu.:  3.000   1st Qu.:38.25  
##  Median :1.0000   Median : 2.185   Median :  6.000   Median :53.68  
##  Mean   :0.5541   Mean   : 3.346   Mean   :  8.111   Mean   :54.38  
##  3rd Qu.:1.0000   3rd Qu.: 4.003   3rd Qu.: 10.000   3rd Qu.:72.76  
##  Max.   :1.0000   Max.   :28.240   Max.   :112.000   Max.   :99.11  
##                                                                     
##    gender_num      weight_first         bmi         sapsi_first   
##  Min.   :0.0000   Min.   : 30.00   Min.   :12.78   Min.   : 3.00  
##  1st Qu.:0.0000   1st Qu.: 65.40   1st Qu.:22.62   1st Qu.:11.00  
##  Median :1.0000   Median : 77.00   Median :26.32   Median :14.00  
##  Mean   :0.5775   Mean   : 80.08   Mean   :27.83   Mean   :14.14  
##  3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.:30.80   3rd Qu.:17.00  
##  Max.   :1.0000   Max.   :257.60   Max.   :98.80   Max.   :32.00  
##  NA's   :1        NA's   :110      NA's   :466     NA's   :85     
##    sofa_first     service_unit  service_num       day_icu_intime
##  Min.   : 0.000   FICU: 62     Min.   :0.0000   Friday   :234   
##  1st Qu.: 4.000   MICU:732     1st Qu.:0.0000   Monday   :260   
##  Median : 6.000   SICU:982     Median :1.0000   Saturday :278   
##  Mean   : 5.821                Mean   :0.5529   Sunday   :230   
##  3rd Qu.: 7.000                3rd Qu.:1.0000   Thursday :261   
##  Max.   :17.000                Max.   :1.0000   Tuesday  :257   
##  NA's   :6                                      Wednesday:256   
##  day_icu_intime_num hour_icu_intime  hosp_exp_flg     icu_exp_flg     
##  Min.   :1.000      Min.   : 0.00   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:2.000      1st Qu.: 3.00   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :4.000      Median : 9.00   Median :0.0000   Median :0.00000  
##  Mean   :4.054      Mean   :10.59   Mean   :0.1374   Mean   :0.09572  
##  3rd Qu.:6.000      3rd Qu.:19.00   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :7.000      Max.   :23.00   Max.   :1.0000   Max.   :1.00000  
##                                                                       
##    day_28_flg     mort_day_censored   censor_flg       sepsis_flg
##  Min.   :0.0000   Min.   :   0.0    Min.   :0.0000   Min.   :0   
##  1st Qu.:0.0000   1st Qu.: 434.3    1st Qu.:0.0000   1st Qu.:0   
##  Median :0.0000   Median : 731.0    Median :1.0000   Median :0   
##  Mean   :0.1593   Mean   : 614.3    Mean   :0.7202   Mean   :0   
##  3rd Qu.:0.0000   3rd Qu.: 731.0    3rd Qu.:1.0000   3rd Qu.:0   
##  Max.   :1.0000   Max.   :3094.1    Max.   :1.0000   Max.   :0   
##                                                                  
##     chf_flg          afib_flg        renal_flg         liver_flg      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1199   Mean   :0.1166   Mean   :0.03378   Mean   :0.05574  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                       
##     copd_flg         cad_flg          stroke_flg       mal_flg      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.000   Median :0.0000  
##  Mean   :0.0884   Mean   :0.06926   Mean   :0.125   Mean   :0.1441  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000   Max.   :1.0000  
##                                                                     
##     resp_flg         map_1st           hr_1st          temp_1st     
##  Min.   :0.0000   Min.   :  5.00   Min.   : 30.00   Min.   : 32.00  
##  1st Qu.:0.0000   1st Qu.: 76.67   1st Qu.: 74.75   1st Qu.: 96.90  
##  Median :0.0000   Median : 87.00   Median : 87.00   Median : 98.10  
##  Mean   :0.3181   Mean   : 88.25   Mean   : 87.91   Mean   : 97.79  
##  3rd Qu.:1.0000   3rd Qu.: 99.00   3rd Qu.:100.00   3rd Qu.: 99.30  
##  Max.   :1.0000   Max.   :195.00   Max.   :158.00   Max.   :104.80  
##                                                     NA's   :3       
##     spo2_1st        abg_count         wbc_first        hgb_first    
##  Min.   :  4.00   Min.   :  0.000   Min.   :  0.17   Min.   : 2.00  
##  1st Qu.: 98.00   1st Qu.:  1.000   1st Qu.:  8.20   1st Qu.:11.10  
##  Median :100.00   Median :  3.000   Median : 11.30   Median :12.70  
##  Mean   : 98.43   Mean   :  5.985   Mean   : 12.32   Mean   :12.55  
##  3rd Qu.:100.00   3rd Qu.:  7.000   3rd Qu.: 15.00   3rd Qu.:14.12  
##  Max.   :100.00   Max.   :115.000   Max.   :109.80   Max.   :19.00  
##                                     NA's   :8        NA's   :8      
##  platelet_first   sodium_first   potassium_first   tco2_first   
##  Min.   :  7.0   Min.   :105.0   Min.   :1.900   Min.   : 2.00  
##  1st Qu.:182.0   1st Qu.:137.0   1st Qu.:3.600   1st Qu.:22.00  
##  Median :239.0   Median :140.0   Median :4.000   Median :24.00  
##  Mean   :246.1   Mean   :139.6   Mean   :4.108   Mean   :24.42  
##  3rd Qu.:297.0   3rd Qu.:142.0   3rd Qu.:4.400   3rd Qu.:27.00  
##  Max.   :988.0   Max.   :165.0   Max.   :9.800   Max.   :62.00  
##  NA's   :8       NA's   :5       NA's   :5       NA's   :5      
##  chloride_first    bun_first      creatinine_first   po2_first    
##  Min.   : 78.0   Min.   :  2.00   Min.   : 0.000   Min.   : 22.0  
##  1st Qu.:101.0   1st Qu.: 11.00   1st Qu.: 0.700   1st Qu.:108.0  
##  Median :104.0   Median : 15.00   Median : 0.900   Median :195.0  
##  Mean   :103.8   Mean   : 19.28   Mean   : 1.096   Mean   :227.6  
##  3rd Qu.:107.0   3rd Qu.: 22.00   3rd Qu.: 1.100   3rd Qu.:323.0  
##  Max.   :133.0   Max.   :139.00   Max.   :18.300   Max.   :634.0  
##  NA's   :5       NA's   :5        NA's   :6        NA's   :186    
##    pco2_first        iv_day_1      
##  Min.   :  8.00   Min.   :    0.0  
##  1st Qu.: 36.00   1st Qu.:  329.8  
##  Median : 41.00   Median : 1081.5  
##  Mean   : 43.41   Mean   : 1622.9  
##  3rd Qu.: 47.00   3rd Qu.: 2493.9  
##  Max.   :158.00   Max.   :13910.0  
##  NA's   :186      NA's   :143

#sofa_first tiene 6 NA y gender_num tiene 1 NA y los transforma en 0
#Despues me fijo con summary y veo que no tiene NA
dat$gender_num[is.na(dat$gender_num)] <- 0
dat$sofa_first[is.na(dat$sofa_first)] <- 0
summary(dat$gender_num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5771  1.0000  1.0000

summary(dat$sofa_first)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   6.000   5.801   7.000  17.000

It is very important you understand how to load CSV files, as often this is the easiest way to load the data when not using bigrquery. As noted in the code, using the file.choose() function can be useful for identifying the path to the CSV file. You should run it in the console. #El comando file.choose me ayuda a buscar en el directorio donde tengo la base

Numerical Forms of EDA

One form of EDA is to provide numerical summaries of the dataset. This can have many purposes:

To verify the dataset you loaded is the one you think you did.
To quantify characteristics of the dataset which need to be reported numerically.

The `summary` function in `R`

R has a very handy function, which performs differently for depending on the type of data structures you apply it to. This is the summary function, and it provides a very useful data summary of data frames. This comes in the form of five-number summaries (plus the mean) (min-Q1-mean-median-Q3-max) for numeric data and counts for categorical (factor) data. Summary also works on many types of objects in R, and when you don’t know what to do with an R object (obj), it is often good to try summary(obj).

summary(dat)

##    aline_flg       icu_los_day     hospital_los_day       age       
##  Min.   :0.0000   Min.   : 0.500   Min.   :  1.000   Min.   :15.18  
##  1st Qu.:0.0000   1st Qu.: 1.370   1st Qu.:  3.000   1st Qu.:38.25  
##  Median :1.0000   Median : 2.185   Median :  6.000   Median :53.68  
##  Mean   :0.5541   Mean   : 3.346   Mean   :  8.111   Mean   :54.38  
##  3rd Qu.:1.0000   3rd Qu.: 4.003   3rd Qu.: 10.000   3rd Qu.:72.76  
##  Max.   :1.0000   Max.   :28.240   Max.   :112.000   Max.   :99.11  
##                                                                     
##    gender_num      weight_first         bmi         sapsi_first   
##  Min.   :0.0000   Min.   : 30.00   Min.   :12.78   Min.   : 3.00  
##  1st Qu.:0.0000   1st Qu.: 65.40   1st Qu.:22.62   1st Qu.:11.00  
##  Median :1.0000   Median : 77.00   Median :26.32   Median :14.00  
##  Mean   :0.5771   Mean   : 80.08   Mean   :27.83   Mean   :14.14  
##  3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.:30.80   3rd Qu.:17.00  
##  Max.   :1.0000   Max.   :257.60   Max.   :98.80   Max.   :32.00  
##                   NA's   :110      NA's   :466     NA's   :85     
##    sofa_first     service_unit  service_num       day_icu_intime
##  Min.   : 0.000   FICU: 62     Min.   :0.0000   Friday   :234   
##  1st Qu.: 4.000   MICU:732     1st Qu.:0.0000   Monday   :260   
##  Median : 6.000   SICU:982     Median :1.0000   Saturday :278   
##  Mean   : 5.801                Mean   :0.5529   Sunday   :230   
##  3rd Qu.: 7.000                3rd Qu.:1.0000   Thursday :261   
##  Max.   :17.000                Max.   :1.0000   Tuesday  :257   
##                                                 Wednesday:256   
##  day_icu_intime_num hour_icu_intime  hosp_exp_flg     icu_exp_flg     
##  Min.   :1.000      Min.   : 0.00   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:2.000      1st Qu.: 3.00   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :4.000      Median : 9.00   Median :0.0000   Median :0.00000  
##  Mean   :4.054      Mean   :10.59   Mean   :0.1374   Mean   :0.09572  
##  3rd Qu.:6.000      3rd Qu.:19.00   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :7.000      Max.   :23.00   Max.   :1.0000   Max.   :1.00000  
##                                                                       
##    day_28_flg     mort_day_censored   censor_flg       sepsis_flg
##  Min.   :0.0000   Min.   :   0.0    Min.   :0.0000   Min.   :0   
##  1st Qu.:0.0000   1st Qu.: 434.3    1st Qu.:0.0000   1st Qu.:0   
##  Median :0.0000   Median : 731.0    Median :1.0000   Median :0   
##  Mean   :0.1593   Mean   : 614.3    Mean   :0.7202   Mean   :0   
##  3rd Qu.:0.0000   3rd Qu.: 731.0    3rd Qu.:1.0000   3rd Qu.:0   
##  Max.   :1.0000   Max.   :3094.1    Max.   :1.0000   Max.   :0   
##                                                                  
##     chf_flg          afib_flg        renal_flg         liver_flg      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1199   Mean   :0.1166   Mean   :0.03378   Mean   :0.05574  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                       
##     copd_flg         cad_flg          stroke_flg       mal_flg      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.000   Median :0.0000  
##  Mean   :0.0884   Mean   :0.06926   Mean   :0.125   Mean   :0.1441  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000   Max.   :1.0000  
##                                                                     
##     resp_flg         map_1st           hr_1st          temp_1st     
##  Min.   :0.0000   Min.   :  5.00   Min.   : 30.00   Min.   : 32.00  
##  1st Qu.:0.0000   1st Qu.: 76.67   1st Qu.: 74.75   1st Qu.: 96.90  
##  Median :0.0000   Median : 87.00   Median : 87.00   Median : 98.10  
##  Mean   :0.3181   Mean   : 88.25   Mean   : 87.91   Mean   : 97.79  
##  3rd Qu.:1.0000   3rd Qu.: 99.00   3rd Qu.:100.00   3rd Qu.: 99.30  
##  Max.   :1.0000   Max.   :195.00   Max.   :158.00   Max.   :104.80  
##                                                     NA's   :3       
##     spo2_1st        abg_count         wbc_first        hgb_first    
##  Min.   :  4.00   Min.   :  0.000   Min.   :  0.17   Min.   : 2.00  
##  1st Qu.: 98.00   1st Qu.:  1.000   1st Qu.:  8.20   1st Qu.:11.10  
##  Median :100.00   Median :  3.000   Median : 11.30   Median :12.70  
##  Mean   : 98.43   Mean   :  5.985   Mean   : 12.32   Mean   :12.55  
##  3rd Qu.:100.00   3rd Qu.:  7.000   3rd Qu.: 15.00   3rd Qu.:14.12  
##  Max.   :100.00   Max.   :115.000   Max.   :109.80   Max.   :19.00  
##                                     NA's   :8        NA's   :8      
##  platelet_first   sodium_first   potassium_first   tco2_first   
##  Min.   :  7.0   Min.   :105.0   Min.   :1.900   Min.   : 2.00  
##  1st Qu.:182.0   1st Qu.:137.0   1st Qu.:3.600   1st Qu.:22.00  
##  Median :239.0   Median :140.0   Median :4.000   Median :24.00  
##  Mean   :246.1   Mean   :139.6   Mean   :4.108   Mean   :24.42  
##  3rd Qu.:297.0   3rd Qu.:142.0   3rd Qu.:4.400   3rd Qu.:27.00  
##  Max.   :988.0   Max.   :165.0   Max.   :9.800   Max.   :62.00  
##  NA's   :8       NA's   :5       NA's   :5       NA's   :5      
##  chloride_first    bun_first      creatinine_first   po2_first    
##  Min.   : 78.0   Min.   :  2.00   Min.   : 0.000   Min.   : 22.0  
##  1st Qu.:101.0   1st Qu.: 11.00   1st Qu.: 0.700   1st Qu.:108.0  
##  Median :104.0   Median : 15.00   Median : 0.900   Median :195.0  
##  Mean   :103.8   Mean   : 19.28   Mean   : 1.096   Mean   :227.6  
##  3rd Qu.:107.0   3rd Qu.: 22.00   3rd Qu.: 1.100   3rd Qu.:323.0  
##  Max.   :133.0   Max.   :139.00   Max.   :18.300   Max.   :634.0  
##  NA's   :5       NA's   :5        NA's   :6        NA's   :186    
##    pco2_first        iv_day_1      
##  Min.   :  8.00   Min.   :    0.0  
##  1st Qu.: 36.00   1st Qu.:  329.8  
##  Median : 41.00   Median : 1081.5  
##  Mean   : 43.41   Mean   : 1622.9  
##  3rd Qu.: 47.00   3rd Qu.: 2493.9  
##  Max.   :158.00   Max.   :13910.0  
##  NA's   :186      NA's   :143

As you can see, this function is very verbose, but produces some useful output. At this point, it’s also a good idea to verify the number of rows and columns are correct:

nrow(dat)

## [1] 1776

ncol(dat)

## [1] 46

This should be what you’re expecting (1776 and 46). If it’s not, this could indicate a loading error, or a problem with the data extraction.

As you will note, many of the flg variables listed in the summary output above, are constrained by 0 and 1. This is because they have a binary encoding (usually 1 if present, and 0 if not). Although not necessary in this particular instance, it is sometimes useful to encode these types of variables as factors. The function convert.bin.fac will do this, and we’ll use it to create a new data frame called dat2, and call the summary function on the new data frame. We first need to load/install two packages (devtools and MIMICbook).

#Instalo los paquetes MIMICbook , devtools lo instale antes #Con convert.bin.fac(data) transforma los variables binarias en factores, como la variable gender_num o aline_flag

if(!("devtools" %in% installed.packages()[,1])) {
 install.packages("devtools",repos="https://cloud.r-project.org")
}
library(devtools)

## Loading required package: usethis

if(!("MIMICbook" %in% installed.packages()[,1])) {
 install_github("jraffa/MIMICbook")
}
library(MIMICbook)

The MIMICbook package provides some useful functions written for the textbook that we will use throughout some of the workshops. It installs via GitHub.

dat2 <- convert.bin.fac(dat)
summary(dat2)

##  aline_flg  icu_los_day     hospital_los_day       age        gender_num
##  0:792     Min.   : 0.500   Min.   :  1.000   Min.   :15.18   0: 751    
##  1:984     1st Qu.: 1.370   1st Qu.:  3.000   1st Qu.:38.25   1:1025    
##            Median : 2.185   Median :  6.000   Median :53.68             
##            Mean   : 3.346   Mean   :  8.111   Mean   :54.38             
##            3rd Qu.: 4.003   3rd Qu.: 10.000   3rd Qu.:72.76             
##            Max.   :28.240   Max.   :112.000   Max.   :99.11             
##                                                                         
##   weight_first         bmi         sapsi_first      sofa_first     service_unit
##  Min.   : 30.00   Min.   :12.78   Min.   : 3.00   Min.   : 0.000   FICU: 62    
##  1st Qu.: 65.40   1st Qu.:22.62   1st Qu.:11.00   1st Qu.: 4.000   MICU:732    
##  Median : 77.00   Median :26.32   Median :14.00   Median : 6.000   SICU:982    
##  Mean   : 80.08   Mean   :27.83   Mean   :14.14   Mean   : 5.801               
##  3rd Qu.: 90.00   3rd Qu.:30.80   3rd Qu.:17.00   3rd Qu.: 7.000               
##  Max.   :257.60   Max.   :98.80   Max.   :32.00   Max.   :17.000               
##  NA's   :110      NA's   :466     NA's   :85                                   
##  service_num   day_icu_intime day_icu_intime_num hour_icu_intime hosp_exp_flg
##  0:794       Friday   :234    Min.   :1.000      Min.   : 0.00   0:1532      
##  1:982       Monday   :260    1st Qu.:2.000      1st Qu.: 3.00   1: 244      
##              Saturday :278    Median :4.000      Median : 9.00               
##              Sunday   :230    Mean   :4.054      Mean   :10.59               
##              Thursday :261    3rd Qu.:6.000      3rd Qu.:19.00               
##              Tuesday  :257    Max.   :7.000      Max.   :23.00               
##              Wednesday:256                                                   
##  icu_exp_flg day_28_flg mort_day_censored censor_flg sepsis_flg chf_flg 
##  0:1606      0:1493     Min.   :   0.0    0: 497     0:1776     0:1563  
##  1: 170      1: 283     1st Qu.: 434.3    1:1279                1: 213  
##                         Median : 731.0                                  
##                         Mean   : 614.3                                  
##                         3rd Qu.: 731.0                                  
##                         Max.   :3094.1                                  
##                                                                         
##  afib_flg renal_flg liver_flg copd_flg cad_flg  stroke_flg mal_flg  resp_flg
##  0:1569   0:1716    0:1677    0:1619   0:1653   0:1554     0:1520   0:1211  
##  1: 207   1:  60    1:  99    1: 157   1: 123   1: 222     1: 256   1: 565  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     map_1st           hr_1st          temp_1st         spo2_1st     
##  Min.   :  5.00   Min.   : 30.00   Min.   : 32.00   Min.   :  4.00  
##  1st Qu.: 76.67   1st Qu.: 74.75   1st Qu.: 96.90   1st Qu.: 98.00  
##  Median : 87.00   Median : 87.00   Median : 98.10   Median :100.00  
##  Mean   : 88.25   Mean   : 87.91   Mean   : 97.79   Mean   : 98.43  
##  3rd Qu.: 99.00   3rd Qu.:100.00   3rd Qu.: 99.30   3rd Qu.:100.00  
##  Max.   :195.00   Max.   :158.00   Max.   :104.80   Max.   :100.00  
##                                    NA's   :3                        
##    abg_count         wbc_first        hgb_first     platelet_first 
##  Min.   :  0.000   Min.   :  0.17   Min.   : 2.00   Min.   :  7.0  
##  1st Qu.:  1.000   1st Qu.:  8.20   1st Qu.:11.10   1st Qu.:182.0  
##  Median :  3.000   Median : 11.30   Median :12.70   Median :239.0  
##  Mean   :  5.985   Mean   : 12.32   Mean   :12.55   Mean   :246.1  
##  3rd Qu.:  7.000   3rd Qu.: 15.00   3rd Qu.:14.12   3rd Qu.:297.0  
##  Max.   :115.000   Max.   :109.80   Max.   :19.00   Max.   :988.0  
##                    NA's   :8        NA's   :8       NA's   :8      
##   sodium_first   potassium_first   tco2_first    chloride_first 
##  Min.   :105.0   Min.   :1.900   Min.   : 2.00   Min.   : 78.0  
##  1st Qu.:137.0   1st Qu.:3.600   1st Qu.:22.00   1st Qu.:101.0  
##  Median :140.0   Median :4.000   Median :24.00   Median :104.0  
##  Mean   :139.6   Mean   :4.108   Mean   :24.42   Mean   :103.8  
##  3rd Qu.:142.0   3rd Qu.:4.400   3rd Qu.:27.00   3rd Qu.:107.0  
##  Max.   :165.0   Max.   :9.800   Max.   :62.00   Max.   :133.0  
##  NA's   :5       NA's   :5       NA's   :5       NA's   :5      
##    bun_first      creatinine_first   po2_first       pco2_first    
##  Min.   :  2.00   Min.   : 0.000   Min.   : 22.0   Min.   :  8.00  
##  1st Qu.: 11.00   1st Qu.: 0.700   1st Qu.:108.0   1st Qu.: 36.00  
##  Median : 15.00   Median : 0.900   Median :195.0   Median : 41.00  
##  Mean   : 19.28   Mean   : 1.096   Mean   :227.6   Mean   : 43.41  
##  3rd Qu.: 22.00   3rd Qu.: 1.100   3rd Qu.:323.0   3rd Qu.: 47.00  
##  Max.   :139.00   Max.   :18.300   Max.   :634.0   Max.   :158.00  
##  NA's   :5        NA's   :6        NA's   :186     NA's   :186     
##     iv_day_1      
##  Min.   :    0.0  
##  1st Qu.:  329.8  
##  Median : 1081.5  
##  Mean   : 1622.9  
##  3rd Qu.: 2493.9  
##  Max.   :13910.0  
##  NA's   :143

As you can now see, instead of means (which under the old encoding equate to proportions of patients where the variable == 1), now we have counts of patients with each level of the variable. This is because R’s summary function treats factors and numerical values differently.

Often, you will want to report these summaries separately for different groups. For instance, is the mean or median age the same for those who received an IAC, and those who didn’t? A multi-purpose function called tapply can help us with this. #Calcula las medidas de tendencia central posicion de los que tuvieron y no arteria : funcion tapply

tapply(dat2$age,dat2$aline_flg,summary)

## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.18   34.80   50.85   53.02   72.11   97.46 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.19   40.36   56.02   55.48   73.21   99.11

This function stratifies the first argument (age) by the second argument (aline_flg) and run the third argument (summary) on it. So, in our case, run the summary function on age for those who received an IAC (aline_flg = 1) and those who didn’t (aline_flg = 0).

Student Question 1:

Using the dat2 data frame, run the summary function for sofa_first, and service_unit separately for those with an IAC, and those without.

Run the summary function for age sofa_first, and service_unit separately for those who died within 28 days, and those who survived.

# Students: Insert your code here
tapply(dat2$sofa_first,dat2$service_unit,summary)

## $FICU
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   6.000   5.661   7.000  14.000 
## 
## $MICU
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   6.000   5.777   7.000  17.000 
## 
## $SICU
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   6.000   5.828   7.000  16.000

tapply(dat2$age,dat2$day_28_flg,summary)

## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.18   34.83   49.46   50.78   66.62   99.11 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.06   65.32   77.83   73.35   83.83   97.46

tapply(dat2$sofa_first,dat2$day_28_flg,summary)

## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   5.000   5.661   7.000  16.000 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   6.000   6.541   8.000  17.000

tapply(dat2$service_unit,dat$day_28_flg,summary)

## $`0`
## FICU MICU SICU 
##   59  605  829 
## 
## $`1`
## FICU MICU SICU 
##    3  127  153

#Codigo que hice yo para el ejercicio
tapply(dat2$service_unit,dat2$aline_flg,summary)

## $`0`
## FICU MICU SICU 
##   24  480  288 
## 
## $`1`
## FICU MICU SICU 
##   38  252  694

tapply(dat2$day_28_flg,dat2$aline_flg,summary)

## $`0`
##   0   1 
## 679 113 
## 
## $`1`
##   0   1 
## 814 170

Producing a Table One and Other Tables

Como crear la tabla 1

The output from summary is very useful, but is generally not acceptable for formal research reports, let alone a published paper. There are several ways to produce a publication which has a better layout. One way is described in the textbook (Chapter 15). Another, which we will cover here, is through an R package called tableone.

As some of you may know, “Table 1” often refers to the table presented in most medical manuscripts which contains information used to describe the cohort. This typically includes information such as average patient age, gender distribution, and other important demographic, clinical, and socioeconomic characteristics. We will cover briefly how to use the CreateTableOne function in this package to generate a table which is closer to being publication worthy.

The following code will install the tableone package and load it. #Ya lo tengo instalado, no necesito correr este codigo # Documentacion: https://www.rdocumentation.org/packages/tableone/versions/0.12.0/topics/CreateTableOne

 if(!("tableone" %in% installed.packages()[,1])) {
 install.packages("tableone",repos="https://cloud.r-project.org")
 }
 library(tableone)

Here is an example functional call to CreateTableOne, which computes either the mean and standard deviation for numeric variables, or count and percentage for factors. You specify which variables you want to include in the table

CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),data=dat2)

##                     
##                      Overall      
##   n                   1776        
##   age (mean (SD))    54.38 (21.06)
##   service_unit (%)                
##      FICU               62 ( 3.5) 
##      MICU              732 (41.2) 
##      SICU              982 (55.3) 
##   aline_flg = 1 (%)    984 (55.4) 
##   day_28_flg = 1 (%)   283 (15.9)

We may want to breakdown these summaries further, like we did above with tapply, but we can do it with one function with the CreateTableOne function by passing the strata parameter. strata specifies which variable to stratify (breakdown) the others by. For example, here is the same table in the previous chunk, broken down by whether a patient received an IAC or not.

CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),strata="aline_flg",data=dat2,test=FALSE)

##                     Stratified by aline_flg
##                      0             1             
##   n                    792           984         
##   age (mean (SD))    53.02 (21.67) 55.48 (20.51) 
##   service_unit (%)                               
##      FICU               24 ( 3.0)     38 (  3.9) 
##      MICU              480 (60.6)    252 ( 25.6) 
##      SICU              288 (36.4)    694 ( 70.5) 
##   aline_flg = 1 (%)      0 ( 0.0)    984 (100.0) 
##   day_28_flg = 1 (%)   113 (14.3)    170 ( 17.3)

Student Question 2:

Compute a Table to summarize those variable considered before (age, service_unit, aline_flg and day_28_flg) in addition to gender_num and chf_flg, but now stratify by survival at 28 days (gender_num and chf_flg).

Repeat part a), but now use the dat data frame instead of dat2. Note the differences in how variables that were previously recast as factors are summarized.Veo que tomo las variables categoricas binarias y las resume como media y Ds porque las toma como 0 y 1

# Students: Insert your code here
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg","gender_num","chf_flg"),strata="day_28_flg",data=dat2,test=FALSE)

##                     Stratified by day_28_flg
##                      0             1             
##   n                   1493           283         
##   age (mean (SD))    50.78 (20.06) 73.35 (15.32) 
##   service_unit (%)                               
##      FICU               59 ( 4.0)      3 (  1.1) 
##      MICU              605 (40.5)    127 ( 44.9) 
##      SICU              829 (55.5)    153 ( 54.1) 
##   aline_flg = 1 (%)    814 (54.5)    170 ( 60.1) 
##   day_28_flg = 1 (%)     0 ( 0.0)    283 (100.0) 
##   gender_num = 1 (%)   886 (59.3)    139 ( 49.1) 
##   chf_flg = 1 (%)      145 ( 9.7)     68 ( 24.0)

CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg","gender_num","chf_flg"),strata="day_28_flg",data=dat,test=FALSE)

##                         Stratified by day_28_flg
##                          0             1            
##   n                       1493           283        
##   age (mean (SD))        50.78 (20.06) 73.35 (15.32)
##   service_unit (%)                                  
##      FICU                   59 ( 4.0)      3 ( 1.1) 
##      MICU                  605 (40.5)    127 (44.9) 
##      SICU                  829 (55.5)    153 (54.1) 
##   aline_flg (mean (SD))   0.55 (0.50)   0.60 (0.49) 
##   day_28_flg (mean (SD))  0.00 (0.00)   1.00 (0.00) 
##   gender_num (mean (SD))  0.59 (0.49)   0.49 (0.50) 
##   chf_flg (mean (SD))     0.10 (0.30)   0.24 (0.43)

Optional:

As an aside, the following code may help for your projects, as it improves the presentation of the tables above. You will still need to update the column and row names manually, but this should paste nicely into Word or LateX!

 if(!("dplyr" %in% installed.packages()[,1])) {
 install.packages("dplyr")
 }
library(dplyr)
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),strata="aline_flg",data=dat2,test=FALSE) %>% print(
  printToggle      = FALSE,
  showAllLevels    = TRUE,
  cramVars         = "kon"
) %>% 
{data.frame(
  variable_name    = gsub(" ", "&nbsp;", rownames(.), fixed = TRUE), ., 
  row.names        = NULL, 
  check.names      = FALSE, 
  stringsAsFactors = FALSE)} %>% 
knitr::kable()

variable_name	level	0	1
n		792	984
age (mean (SD))		53.02 (21.67)	55.48 (20.51)
service_unit (%)	FICU	24 ( 3.0)	38 ( 3.9)
	MICU	480 ( 60.6)	252 ( 25.6)
	SICU	288 ( 36.4)	694 ( 70.5)
aline_flg (%)	0	792 (100.0)	0 ( 0.0)
	1	0 ( 0.0)	984 (100.0)
day_28_flg (%)	0	679 ( 85.7)	814 ( 82.7)
	1	113 ( 14.3)	170 ( 17.3)

Other Bivariate Numerical Summaries

Otras estadisticas sumarias bivariadas

Sometimes you may wish to display the relationships between two or more variables directly. For categorical variables this can be tricky. One common way to explore relationships between categorical variables is by producing the cross tabulated tables (“crosstabs” for short). This is mainly done via the table function, which can take several categorical variables, and produce the number of patients which meet criteria for those variables. For instance, looking at how an IAC was used in men and women:

table(dat2$gender_num,dat2$aline_flg,dnn=c("Gender","IAC"))

##       IAC
## Gender   0   1
##      0 345 406
##      1 447 578

we can see that an IAC was used 578 times in men (gender_num=1) and 447 times in women (gender_num=0). The raw numbers are often difficult to compare, so often the proportions are more useful. Applying prop.table to our existing table, and adding the argument 1 (for by row, use 2 for columns), we get the proportion of men and women who had an IAC (56% vs. 54%).

prop.table(table(dat2$gender_num,dat2$aline_flg,dnn=c("Gender","IAC")),1)

##       IAC
## Gender         0         1
##      0 0.4593875 0.5406125
##      1 0.4360976 0.5639024

#Coeficiente de correlacion A different summary for bivariate numeric data exists to present the strength of the relationship between two variables, called the correlation coefficient. There is a cor function in R, but when dealing with only two variables, it’s easiest to use the cor.test function. Under the defaults, it computes the Pearson product-moment correlation and computes a hypothesis test to assess if there’s evidence that the correlation is not zero. Other forms of correlation are computed below, including Spearman’s rho and Kendall’s tau. These latter methods are useful when dealing with data which is not necessarily numeric but ordered (e.g., likert based rankings on a 1-5 scale) or has outliers. Spearman’s rho and Kendall’s tau are rank based methods, and also have a certain degree of robustness to outliers in the data. None of these methods are robust to non-linear relationships, and it’s very easy to miss a strong relationship between two variables if you rely on these methods in isolation.

cor.test(dat2$bun_first,dat2$creatinine_first)

## 
##  Pearson's product-moment correlation
## 
## data:  dat2$bun_first and dat2$creatinine_first
## t = 33.233, df = 1768, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5905441 0.6479479
## sample estimates:
##       cor 
## 0.6200752

cor.test(dat2$bun_first,dat2$creatinine_first,method="spearman")

## Warning in cor.test.default(dat2$bun_first, dat2$creatinine_first, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  dat2$bun_first and dat2$creatinine_first
## S = 415893613, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5499986

cor.test(dat2$bun_first,dat2$creatinine_first,method="kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  dat2$bun_first and dat2$creatinine_first
## z = 24.807, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##      tau 
## 0.418814

We can produce a scatterplot of the same two variables:

with(dat2,plot(bun_first,creatinine_first))

We can see that there is indeed a positive correlation between the two variables, but the data has more variability for higher values of bun_first and creatinine_first. It’s advisable to consider transformations of these two variables and be wary about using Pearson’s correlation.

Going beyond two dimensions can be a little tricky. Plotting on a three dimensional axis, while possible, is not ideal, and very few people can see in four dimensions.

What is possible, is to use other aspects of the plot (e.g., size, color, shape, hue, transparency, location) to identify features you would like to see. For instance, in the above plot, we can add color to identify those who died:

with(dat2,plot(bun_first,creatinine_first,col=day_28_flg,pch=19))

Creating Categorical Variables from Continuous/Numeric Variables

Crear variables categoricas a partir de variables continuas

Funcion cut2. Se puede pedir por numero de grupos o por grupos etarios

Sometimes numeric variables need to be broken down into categorical variables or factors. This can be done for a variety of reasons. There is a useful function called cut2 in the Hmisc package. We install it and use it below.

if(!("Hmisc" %in% installed.packages()[,1])) {
 install.packages("Hmisc",repos="https://cloud.r-project.org")
}
library(Hmisc)

dat2$age.cat <- cut2(dat2$age,g=5)
table(dat2$age.cat)

## 
## [15.2,33.6) [33.6,47.6) [47.6,60.3) [60.3,76.4) [76.4,99.1] 
##         356         355         355         355         355

dat2$age.cat2 <- cut2(dat2$age,c(25,40,55,70,85))
table(dat2$age.cat2)

## 
## [15.2,25.0) [25.0,40.0) [40.0,55.0) [55.0,70.0) [70.0,85.0) [85.0,99.1] 
##         208         287         428         341         382         130

cut2 typically needs two arguments. The first is a numeric variable to convert into a factor, and the second is how to do the splitting. Specifying g=5 (as above for age.cat) breaks the numeric variable into 5 groups, with the cut points determined by attempting to make the groups as equally sized as possible. As you can see in this example, due to the odd number of patients, they are not perfectly even. The second approach requires passing the cut points. In the second example, we tell R to cut the data at 25, 40,…. This results in 6 groups for five cut points.

Student Question 3:

Create a new variable in the dat2 data frame called sofa.cat made up of four (approximately) equally sized groups for SOFA. Print the sample size in each group. Consider why the group sizes may differ significantly from each other.

#Sucede por la cantidad de observaciones en cada grupo , el grupo no lo corta > b) For each SOFA group calculate the number of people who survived and died in the hospital and at 28 days (use the hosp_exp_flg and day_28_flg variable). Does the mortality increase or decrease as SOFA increases? #La mortalidad aumenta de acuerdo a SOFA #prop.table(table(dat2$sofa.cat,dat2$day_28_flg,dnn=c(“SOFA”,“ICU Mortality”)),1) al ponerle 1 suma las filas y si le pongo 2 suma columnas. #CreateTableOne(vars=c(“sofa.cat”),strata=“hosp_exp_flg”,data=dat2,test = FALSE) esta opcion la probe yo

# Students: Insert your code here
dat2$sofa.cat<-cut2(dat2$sofa_first,g=4)
table(dat2$sofa.cat)

## 
## [0, 5) [5, 7)      7 [8,17] 
##    529    640    243    364

prop.table(table(dat2$sofa.cat,dat2$hosp_exp_flg,dnn=c("SOFA","Hosp. Mortality")),1)

##         Hosp. Mortality
## SOFA              0          1
##   [0, 5) 0.93950851 0.06049149
##   [5, 7) 0.84218750 0.15781250
##   7      0.83127572 0.16872428
##   [8,17] 0.80769231 0.19230769

prop.table(table(dat2$sofa.cat,dat2$day_28_flg,dnn=c("SOFA","ICU Mortality")),1)

##         ICU Mortality
## SOFA              0          1
##   [0, 5) 0.92060491 0.07939509
##   [5, 7) 0.82187500 0.17812500
##   7      0.80246914 0.19753086
##   [8,17] 0.78296703 0.21703297

prop.table(table(dat2$sofa.cat,dat2$day_28_flg,dnn=c("SOFA","ICU Mortality")),2)

##         ICU Mortality
## SOFA             0         1
##   [0, 5) 0.3261889 0.1484099
##   [5, 7) 0.3523108 0.4028269
##   7      0.1306095 0.1696113
##   [8,17] 0.1908908 0.2791519

CreateTableOne(vars=c("sofa.cat"),strata="hosp_exp_flg",data=dat2,test = FALSE)

##               Stratified by hosp_exp_flg
##                0            1          
##   n            1532         244        
##   sofa.cat (%)                         
##      [0, 5)     497 (32.4)   32 (13.1) 
##      [5, 7)     539 (35.2)  101 (41.4) 
##      7          202 (13.2)   41 (16.8) 
##      [8,17]     294 (19.2)   70 (28.7)

CreateTableOne(vars=c("sofa.cat"),strata="day_28_flg",data=dat2,test=FALSE)

##               Stratified by day_28_flg
##                0            1          
##   n            1493         283        
##   sofa.cat (%)                         
##      [0, 5)     487 (32.6)   42 (14.8) 
##      [5, 7)     526 (35.2)  114 (40.3) 
##      7          195 (13.1)   48 (17.0) 
##      [8,17]     285 (19.1)   79 (27.9)

Plotting relationships with discrete variables

Graficar relaciones entre variables discretas

Plotting discrete data can be a little tricky, but if done right can be very effective. For an example of why it’s difficult, let’s plot two discrete variables: gender_num and aline_flg.

plot(dat2$gender_num,dat2$aline_flg,xlab="Gender",ylab="IAC")

Because we have converted gender_num and aline_flg to a factor, R gives us what is called a “Factor Plot”. The area of the light grey region is proportional to the proportion of each gender who received an IAC. In this case, there is not that big of a difference between the genders.

This factor plot is more useful than if we were to keep the original numerical class both these variables had. For example, if we use the original dat data frame both gender_num and aline_flg are numeric variables, and when we plot these, we don’t end up with something very useful:

plot(dat$gender_num,dat$aline_flg,xlab="Gender",ylab="IAC")

In this case we only have four different types of data points, and although we could try to jitter the values to get a slightly more useful plot, it’s unlikely that this would give us a good visual interpretation of the data.

Sometimes the covariate may take on more than two levels. Here, we plot the in-hospital mortality rate by the different SOFA values, and put a smooth curve through the points. This covers a more advanced topic, and we don’t expect you to understand the technical details of the code below. #La función sapply en R es una función vectorizada de la familia de aplicaciones que permite iterar sobre una lista o vector sin la necesidad de usar el bucle for,

plot(names(table(dat2$sofa_first)),sapply(split(dat2,dat2$sofa_first),function(x) { mean(x$hosp_exp_flg==1,na.rm=T)}),xlab="SOFA",ylab="In-Hospital Mortality",cex=log(as.numeric(table(dat2$sofa_first)+1))/5)
lines(smooth.spline(dat2$sofa_first,dat2$hosp_exp_flg==1),type="l")

SOFA is a validated disease severity scale for the ICU, and generally correlates strongly with mortality. Here, while the mortality rate generally increases as SOFA increases, the smooth fit isn’t necessarily non-decreasing as SOFA values increase. We have added points roughly proportional to the sample size of each SOFA level, and you’ll see towards the high levels of SOFA, very few patients are observed, with the second highest score (16) having a 100% survival rate (but with only one patient).

For binary outcomes, it is often useful to plot the proportion of patients with the outcome (e.g., mortality rate) by the different levels of a covariate of interest. Because sample size plays such an important role in the uncertainty associated with these estimate proportions, it seems appropriate to include an estimate of our uncertainty via a confidence interval.

In the MIMICbook package you installed above, there is a plot_prop_by_level which can plot the proportion of patients with an outcome by one or two factor variables. For instance, if we wished to plot the in hospital mortality rate by the SOFA categories (sofa.cat) we defined above, we can using:

#Grafica mortalidad hospitalaria segun categoria de SOFA previamente definida. Paquete MIMIKBOOK

plot_prop_by_level(dat2,"sofa.cat","hosp_exp_flg")

## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## Please use `mutate()` instead.
## See vignette('programming') for more help

## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help

plot_prop_by_level(dat2,"sofa.cat","day_28_flg")

Often it’s useful to consider more than one covariate at a time to assess confounding and effect modification. Here, if we wished to examine sofa.cat and gender_num at the same time, we add factor.var2="gender_num" to our previous use of plot_prop_by_level.

plot_prop_by_level(dat2,"sofa.cat","hosp_exp_flg",factor.var2="gender_num")

plot_prop_by_level(dat2,"sofa.cat","day_28_flg",factor.var2="aline_flg")

Here we see that in hospital mortality is higher in women for the SOFA groups we have considered, suggesting that it might be an important confounder for this outcome and variable.

Student Question 4:

Make a factor plot of the categories of SOFA we created (sofa.cat) and hospital mortality (hosp_exp_flg). Does the trend align with your expectations based on the non-graphical EDA performer earlier? Si, la mortalidad hospitalaria aumenta a medida que aumenta SOFA

Use plot_prop_by_level using sofa.cat as the covariate of interest and 28 day mortality (day_28_flg) as the outcome.

Include the main covariate of interest for this study aline_flg as the second factor variable and extend part b).

A medida que aumenta SOFA la mortalidad aumenta. La mortalidad es menor en aquellos pacientes que tuvieron un acceso arterial ,excepto para el grupo de SOFA mas bajo

Repeat part c), but swap the IAC and SOFA arguments (intercambie los argumentos?) . Consider how the different depictions of the underlying data could better support different objectives. A medida que SOFA aumenta , aumenta la proporcion de pacientes con arteria y la mortalidad de ellos es menor que la de aquellos sin arteria

Create a new variable, sofa.cat2, with cut points at 3, 6, 9, 12. Repeat parts b) and c). Se ve que en la categoria de SOFA mas bajo , los pacientes con acceso arterial , en grupo 2 y 3 es semejante. A partir de SOFA 9 se observa menor mortalidad de los pacientes con acceso arterial

Make a plot of the 28 day mortality outcome, aline_flg and chf_flg. Ignoring the statistical significance (i.e., do not perform any formal testing), consider why this plot may suggest the complexity of any potential effect of an IAC on mortality. chf_flg es FALLA CARDIACA CONGESTIVA #Segun estos graficos pareceria ser que los pacientes con acceso arterial se mueren mas que aquellos sin acceso arterial. REPENSAR

plot(dat2$sofa.cat, dat2$hosp_exp_flg,xlab="Categoria SOFA", ylab="Fallecido")

plot_prop_by_level(dat2,"sofa.cat","day_28_flg")

plot_prop_by_level(dat2,"sofa.cat","day_28_flg",factor.var2="aline_flg")

plot_prop_by_level(dat2,"sofa.cat","aline_flg",factor.var2="day_28_flg")

dat2$sofa.cat2<-cut2(dat2$sofa_first,c(3,6,9,12))
table(dat2$sofa.cat2)

## 
## [ 0, 3) [ 3, 6) [ 6, 9) [ 9,12) [12,17] 
##      83     792     685     183      33

plot(dat2$sofa.cat2, dat2$hosp_exp_flg,xlab="Categoria SOFA 2", ylab="Fallecido")

plot_prop_by_level(dat2,"sofa.cat2","day_28_flg")

plot_prop_by_level(dat2,"aline_flg","day_28_flg")

plot_prop_by_level(dat2,"sofa.cat2","day_28_flg",factor.var2="aline_flg")

plot_prop_by_level(dat2,"day_28_flg","aline_flg",factor.var2="chf_flg")

plot(dat2$aline_flg,dat2$day_28_flg,xlab="Cateter arterial",ylab="Mortalidad a 28 dias")

Odds ratios

Note: For those with a programming background, R indexes vectors starting from 1.

As previously discussed, odds ratios are very commonly used to communicate relative effect sizes for binary outcomes, particularly in observational data. Calculation is straightforward, but often misunderstood. We start with a 2 x 2 table. Below is the 2 x 2 table for in hospital mortality and having an arterial line. I’ve assigned it to a new variable called egtab.

egtab <- table(dat2$aline_flg,dat2$hosp_exp_flg,dnn=c("IAC","Hosp. Mort"))
egtab

##    Hosp. Mort
## IAC   0   1
##   0 702  90
##   1 830 154

It’s hard to interpret the raw counts, so we’ll use prop.table to compute the proportions who died and lived by row (margin 1, IAC). Muestra los porcentajes por filas

pegtab <- prop.table(egtab,1)
pegtab

##    Hosp. Mort
## IAC         0         1
##   0 0.8863636 0.1136364
##   1 0.8434959 0.1565041

Calcula el Odds como paciente con evento/ total pacientes sin evento o sea para IAC 0 , muertes/ no muertes. O sea columna 2/ col 1 Me da 0,1282051 para paciente IAC = 0 y 0.1855422 para IAC 1

Odds are $\frac{p}{1-p}$ where $p$ is the proportion with the outcome (death) in a group of patients, which is in the second column. We can index the above table by column (tab[,idx] will retrieve column idx from the table [or matrix] tab) to compute the odds in each group.

Oddsegtab <-pegtab[,2]/pegtab[,1]
Oddsegtab

##         0         1 
## 0.1282051 0.1855422

Now we have the odds of the outcome in those who got an IAC 1 and those who didn’t 0. We need to pick a reference group. We’ll calculate it both ways, but let’s assume we want those without an IAC to be the reference: DIVIDO LOS QUE TUVIERON ART POR LOS QUE NO TUVIERON SI LA REFERENCIA ES LOS QUE NO TUVIERON

Oddsegtab[2]/Oddsegtab[1]

##        1 
## 1.447229

If we wanted those with an IAC to be the reference group:

Oddsegtab[1]/Oddsegtab[2]

##         0 
## 0.6909757

If we wanted to plot this information, and include a confidence interval, we can use the plot_OR_by_level from the MIMICbook package:

plot_OR_by_level(dat2,"aline_flg","hosp_exp_flg")

## Warning: `rename_()` was deprecated in dplyr 0.7.0.
## Please use `rename()` instead.

This by default includes an odds ratio of 1 indicating the reference group. To remove this point use the include.ref.group.effect argument:

plot_OR_by_level(dat2,"aline_flg","hosp_exp_flg",include.ref.group.effect = FALSE)

You can also look at more than one covariate at a time. For instance, looking at aline_flg and the gender_num variable:

plot_OR_by_level(dat2,"gender_num","hosp_exp_flg",factor.var2="aline_flg",include.ref.group.effect = TRUE)

Here we have computed the odds ratio for an IAC (vs no IAC) separately for men and women.

Student Question 5:

Create a 2 x 2 table with chf_flg and the variable day_28_flg outcome and assign it to a variable called tab22.

Compute the odds ratio for having CHF vs. not having CHF using this table.

#Código mas limpio tab22 <- table(dat2$chf_flg,dat2$day_28_flg,dnn=c(“CHF”,“28d Mortality”)) ptab22 <- prop.table(tab22,1) oddstab22 <- ptab22[,2]/ptab22[,1] oddstab22[2]/oddstab22[1]

Construct a plot of the odds ratios and 95% confidence intervals using the plot_OR_by_level function for CHF and 28 day mortality.

Create a 4 x 2 table with the sofa.cat variable and the day_28_flg outcome and assign it to a variable called tab42.

Pick and define a reference group for sofa.cat, and compute the odds ratio(s) for the other levels of sofa.cat using day_28_flg as your outcome. #Codigo mas limpio tab42 <- table(dat2$sofa.cat, dat2$day_28_flg,dnn=c(“SOFA”,“28d Mortality”)) ptab42 <- prop.table(tab42,1) oddstab42 <- ptab42[,2]/ptab42[,1] oddstab42[]/oddstab42[1]

Construct a plot of the odds ratios and 95% confidence intervals using the plot_OR_by_level function for the SOFA categories and 28 day mortality. Make sure the reference groups are the same as parts d) and e). Look into the relevel function in R or the ref.group argument in the plot_OR_by_level function.

Construct a plot looking at the 28 day mortality outcome, and the two variables we considered here, sofa.cat and chf_flg. Exchange the variables assigned to the factor.var1 and factor.var2 arguments, and consider briefly two reasons why you might prefer one plot over the other, and what you would conclude from your chosen plot. #Prefiero el grafico que estratifica por presencia de Insuficiencia cardiaca , plot_OR_by_level(dat2,“chf_flg”,“day_28_flg”,factor.var2= “sofa.cat”, include.ref.group.effect = TRUE) Los pacientes con mayor SOFA , con Insuficiencia cardiaca tienen menor mortalidad que aquellos del mismo grupo sin IC en relacion Tambien tiene menor mortalidad que categorias de SOFA mas bajo.

tab22<-table(dat2$chf_flg,dat2$day_28_flg,dnn=c("Insuficiencia Cardiaca","Mort_28d"))
tab22

##                       Mort_28d
## Insuficiencia Cardiaca    0    1
##                      0 1348  215
##                      1  145   68

pegtab_22 <- prop.table(tab22,1)
pegtab_22

##                       Mort_28d
## Insuficiencia Cardiaca         0         1
##                      0 0.8624440 0.1375560
##                      1 0.6807512 0.3192488

Oddstab22 <-pegtab_22[,2]/pegtab_22[,1]
Oddstab22

##         0         1 
## 0.1594955 0.4689655

Oddstab22[2]/Oddstab22[1]

##        1 
## 2.940305

plot_OR_by_level(dat2,"chf_flg","day_28_flg",include.ref.group.effect = FALSE)

tab42<-table(dat2$sofa.cat,dat2$day_28_flg,dnn = c("Cat_SOFA","Mort_28d"))
ptab42<-prop.table(tab42,1)
ptab42

##         Mort_28d
## Cat_SOFA          0          1
##   [0, 5) 0.92060491 0.07939509
##   [5, 7) 0.82187500 0.17812500
##   7      0.80246914 0.19753086
##   [8,17] 0.78296703 0.21703297

oddstab42 <- ptab42[,2]/ptab42[,1]
oddstab42

##    [0, 5)    [5, 7)         7    [8,17] 
## 0.0862423 0.2167300 0.2461538 0.2771930

oddstab42[2]/oddstab42[1]

##   [5, 7) 
## 2.513036

oddstab42[3]/oddstab42[1]

##        7 
## 2.854212

oddstab42[4]/oddstab42[1]

##   [8,17] 
## 3.214119

oddstab42[]/oddstab42[1]

##   [0, 5)   [5, 7)        7   [8,17] 
## 1.000000 2.513036 2.854212 3.214119

plot_OR_by_level(dat2,"sofa.cat","day_28_flg",include.ref.group.effect = TRUE)

plot_OR_by_level(dat2,"sofa.cat","day_28_flg",factor.var2= "chf_flg", include.ref.group.effect = TRUE)

plot_OR_by_level(dat2,"chf_flg","day_28_flg",factor.var2= "sofa.cat", include.ref.group.effect = TRUE)

Codigo de uno de los ejercicios

plot_prop_by_level(dat2,"sofa.cat2","day_28_flg",factor.var2="aline_flg")

HST.953x Workshop 2.07: Exploratory Data Analysis

Maria del Pilar Arias

July 4, 2021

Instructions:

Principles of Exploratory Data Analysis (EDA)

Cognitive Disfluency – make it work for you?

Prerequisites

Numerical Forms of EDA

The `summary` function in `R`

Student Question 1:

Producing a Table One and Other Tables

Como crear la tabla 1

Student Question 2:

Optional:

Other Bivariate Numerical Summaries

Otras estadisticas sumarias bivariadas

Creating Categorical Variables from Continuous/Numeric Variables

Crear variables categoricas a partir de variables continuas

Funcion cut2. Se puede pedir por numero de grupos o por grupos etarios

Student Question 3:

Plotting relationships with discrete variables

Graficar relaciones entre variables discretas

Student Question 4:

Odds ratios

Student Question 5:

HST.953x Workshop 2.07: Exploratory Data Analysis

Maria del Pilar Arias

July 4, 2021

Instructions:

Principles of Exploratory Data Analysis (EDA)

Cognitive Disfluency – make it work for you?

Prerequisites

Numerical Forms of EDA

The summary function in R

Student Question 1:

Producing a Table One and Other Tables

Como crear la tabla 1

Student Question 2:

Optional:

Other Bivariate Numerical Summaries

Otras estadisticas sumarias bivariadas

Creating Categorical Variables from Continuous/Numeric Variables

Crear variables categoricas a partir de variables continuas

Funcion cut2. Se puede pedir por numero de grupos o por grupos etarios

Student Question 3:

Plotting relationships with discrete variables

Graficar relaciones entre variables discretas

Student Question 4:

Odds ratios

Student Question 5:

The `summary` function in `R`