Before beginning, please test to see if the Rmd file will compile on your system by clicking the “Knit to HTML” button in R studio above. If it does not compile, seek out help in the forum or online. Chances are, you are missing a component necessary to have rmarkdown compile on your system. We would be very interested in: 1) What was wrong?, 2) How you fixed it?, and 3) What resources did you use? This will help us improve future workshops.
##INSTALE ESTOS PAQUETES QUE PEDIA AL ABRIR EL RMARKDOWN install.packages(“devtools”) install.packages(“Hmisc”) install.packages(“tableone”) El primer error lo tira en linea 73 in $data.frame(temp,gender_num..): replacement has 0 rows
How This Workshop Works: There is some text, followed by relevant examples with code and output, followed by some student questions. The intention is for students to complete the student questions, which will provide answers to the multiple choice assessment for this subsection of the course that can be entered into EdX. The goal is to provide you with a solid foundation in some concepts related to R, rmarkdown, and exploratory data analysis.
To complete the workshop, fill in necessary code in the places indicated with # Students: Insert your code here. Use the output from “Knit to HTML” to answer the multiple choice questions in EdX to receive credit for completing the workshop.
EDA’s goal is to better understand the data and the process by which it was generated.
Within statistics, it is largely considered separate from inferential/confirmatory statistics (e.g., hypothesis testing, point and interval estimates, etc), where EDA has a very diverse and important set of goals:
EDA was coined, developed and advocated for by John Tukey. His book, entitled “Exploratory Data Analysis” was published in 1977, and is still in use today. It may seem like an oddity, but it was a fundamental change in how data science / statistics was done. Fundamentally he sums up EDA with this quote:
“It is important to understand what you CAN DO, before you learn to measure how WELL you seem to have DONE it.” – J. W. Tukey (1977)
If you don’t understand the data, it becomes difficult to know how to analyze it. Confirmatory and exploratory analyses are not superior or inferior to one another, rather they are complementary. With all the tools available to do both, ignoring one of them is inexcusable.
“Today, exploratory and confirmatory (analysis) – can – and should – proceed side by side.” – J. W. Tukey (1977)
There is often an urge due to productivity, laziness, or other factors to plow through with an analysis, using sophisticated analysis techniques to find the results you are seeking. With the proliferation of large datasets, this can be quite ineffective, as it largely separates the analyst from the data, resulting in misunderstanding or not understanding the data at all.
There is some evidence that cognitive disfluency (making it harder to learn) can lead to deeper learning. For analysts and data scientists this means slowing things down, often using basic (and sometimes tedious) methods to integrate the primary structure and relationships contained in the data, before pulling out the heavy machinery of modern data analysis.
See: Alter, A.L., 2013. The benefits of cognitive disfluency. Current Directions in Psychological Science, 22(6), pp.437-442.
All too often the success/failure of an analysis is determined from a single number, when in reality, understanding the data should be the goal.
For this workshop we will use data from a study that examined the effect of indwelling arterial catheters (aka IAC or aline) on 28 day mortality in intensive care unit (ICU) patients on mechanical ventilation during the first day of ICU admission. The data originates from MIMIC II v2.6. The data is ready for exploratory data analysis (the data extraction and cleaning have already been completed), and contained in a comma separated value (.csv) file generated after this process and stored on Physionet. Start by loading the data file from Physionet into a data frame called dat:
# If the following command does not work for you, please download the data file from Physionet at the URL below, run file.choose(), locate your downloaded CSV file, and replace read.csv(url) with read.csv("file_path") where file_path is the output from file.choose().
#El codigo no funciono para importar la base ni descargarla
#url <- "https://github.com/criticaldata/hst953-edx/blob/master/data/aline_full_cohort_data.csv"
#dat <- read.csv
# Public dataset has NA values for variables required to complete workshop
# Replace NA values with defaults since dataset only intended for teaching. Reemplaza los valores con 0
dat <- read.csv("C:/Users/Pilar/Desktop/CURSO_MIT/hst953-edx-master/data/aline_full_cohort_data.csv", stringsAsFactors=TRUE)
#Hago un summary para ver los Na
summary(dat)
## aline_flg icu_los_day hospital_los_day age
## Min. :0.0000 Min. : 0.500 Min. : 1.000 Min. :15.18
## 1st Qu.:0.0000 1st Qu.: 1.370 1st Qu.: 3.000 1st Qu.:38.25
## Median :1.0000 Median : 2.185 Median : 6.000 Median :53.68
## Mean :0.5541 Mean : 3.346 Mean : 8.111 Mean :54.38
## 3rd Qu.:1.0000 3rd Qu.: 4.003 3rd Qu.: 10.000 3rd Qu.:72.76
## Max. :1.0000 Max. :28.240 Max. :112.000 Max. :99.11
##
## gender_num weight_first bmi sapsi_first
## Min. :0.0000 Min. : 30.00 Min. :12.78 Min. : 3.00
## 1st Qu.:0.0000 1st Qu.: 65.40 1st Qu.:22.62 1st Qu.:11.00
## Median :1.0000 Median : 77.00 Median :26.32 Median :14.00
## Mean :0.5775 Mean : 80.08 Mean :27.83 Mean :14.14
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.:30.80 3rd Qu.:17.00
## Max. :1.0000 Max. :257.60 Max. :98.80 Max. :32.00
## NA's :1 NA's :110 NA's :466 NA's :85
## sofa_first service_unit service_num day_icu_intime
## Min. : 0.000 FICU: 62 Min. :0.0000 Friday :234
## 1st Qu.: 4.000 MICU:732 1st Qu.:0.0000 Monday :260
## Median : 6.000 SICU:982 Median :1.0000 Saturday :278
## Mean : 5.821 Mean :0.5529 Sunday :230
## 3rd Qu.: 7.000 3rd Qu.:1.0000 Thursday :261
## Max. :17.000 Max. :1.0000 Tuesday :257
## NA's :6 Wednesday:256
## day_icu_intime_num hour_icu_intime hosp_exp_flg icu_exp_flg
## Min. :1.000 Min. : 0.00 Min. :0.0000 Min. :0.00000
## 1st Qu.:2.000 1st Qu.: 3.00 1st Qu.:0.0000 1st Qu.:0.00000
## Median :4.000 Median : 9.00 Median :0.0000 Median :0.00000
## Mean :4.054 Mean :10.59 Mean :0.1374 Mean :0.09572
## 3rd Qu.:6.000 3rd Qu.:19.00 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :7.000 Max. :23.00 Max. :1.0000 Max. :1.00000
##
## day_28_flg mort_day_censored censor_flg sepsis_flg
## Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0
## 1st Qu.:0.0000 1st Qu.: 434.3 1st Qu.:0.0000 1st Qu.:0
## Median :0.0000 Median : 731.0 Median :1.0000 Median :0
## Mean :0.1593 Mean : 614.3 Mean :0.7202 Mean :0
## 3rd Qu.:0.0000 3rd Qu.: 731.0 3rd Qu.:1.0000 3rd Qu.:0
## Max. :1.0000 Max. :3094.1 Max. :1.0000 Max. :0
##
## chf_flg afib_flg renal_flg liver_flg
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1199 Mean :0.1166 Mean :0.03378 Mean :0.05574
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## copd_flg cad_flg stroke_flg mal_flg
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.000 Median :0.0000
## Mean :0.0884 Mean :0.06926 Mean :0.125 Mean :0.1441
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.0000
##
## resp_flg map_1st hr_1st temp_1st
## Min. :0.0000 Min. : 5.00 Min. : 30.00 Min. : 32.00
## 1st Qu.:0.0000 1st Qu.: 76.67 1st Qu.: 74.75 1st Qu.: 96.90
## Median :0.0000 Median : 87.00 Median : 87.00 Median : 98.10
## Mean :0.3181 Mean : 88.25 Mean : 87.91 Mean : 97.79
## 3rd Qu.:1.0000 3rd Qu.: 99.00 3rd Qu.:100.00 3rd Qu.: 99.30
## Max. :1.0000 Max. :195.00 Max. :158.00 Max. :104.80
## NA's :3
## spo2_1st abg_count wbc_first hgb_first
## Min. : 4.00 Min. : 0.000 Min. : 0.17 Min. : 2.00
## 1st Qu.: 98.00 1st Qu.: 1.000 1st Qu.: 8.20 1st Qu.:11.10
## Median :100.00 Median : 3.000 Median : 11.30 Median :12.70
## Mean : 98.43 Mean : 5.985 Mean : 12.32 Mean :12.55
## 3rd Qu.:100.00 3rd Qu.: 7.000 3rd Qu.: 15.00 3rd Qu.:14.12
## Max. :100.00 Max. :115.000 Max. :109.80 Max. :19.00
## NA's :8 NA's :8
## platelet_first sodium_first potassium_first tco2_first
## Min. : 7.0 Min. :105.0 Min. :1.900 Min. : 2.00
## 1st Qu.:182.0 1st Qu.:137.0 1st Qu.:3.600 1st Qu.:22.00
## Median :239.0 Median :140.0 Median :4.000 Median :24.00
## Mean :246.1 Mean :139.6 Mean :4.108 Mean :24.42
## 3rd Qu.:297.0 3rd Qu.:142.0 3rd Qu.:4.400 3rd Qu.:27.00
## Max. :988.0 Max. :165.0 Max. :9.800 Max. :62.00
## NA's :8 NA's :5 NA's :5 NA's :5
## chloride_first bun_first creatinine_first po2_first
## Min. : 78.0 Min. : 2.00 Min. : 0.000 Min. : 22.0
## 1st Qu.:101.0 1st Qu.: 11.00 1st Qu.: 0.700 1st Qu.:108.0
## Median :104.0 Median : 15.00 Median : 0.900 Median :195.0
## Mean :103.8 Mean : 19.28 Mean : 1.096 Mean :227.6
## 3rd Qu.:107.0 3rd Qu.: 22.00 3rd Qu.: 1.100 3rd Qu.:323.0
## Max. :133.0 Max. :139.00 Max. :18.300 Max. :634.0
## NA's :5 NA's :5 NA's :6 NA's :186
## pco2_first iv_day_1
## Min. : 8.00 Min. : 0.0
## 1st Qu.: 36.00 1st Qu.: 329.8
## Median : 41.00 Median : 1081.5
## Mean : 43.41 Mean : 1622.9
## 3rd Qu.: 47.00 3rd Qu.: 2493.9
## Max. :158.00 Max. :13910.0
## NA's :186 NA's :143
#sofa_first tiene 6 NA y gender_num tiene 1 NA y los transforma en 0
#Despues me fijo con summary y veo que no tiene NA
dat$gender_num[is.na(dat$gender_num)] <- 0
dat$sofa_first[is.na(dat$sofa_first)] <- 0
summary(dat$gender_num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5771 1.0000 1.0000
summary(dat$sofa_first)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.000 5.801 7.000 17.000
It is very important you understand how to load CSV files, as often this is the easiest way to load the data when not using bigrquery. As noted in the code, using the file.choose() function can be useful for identifying the path to the CSV file. You should run it in the console. #El comando file.choose me ayuda a buscar en el directorio donde tengo la base
One form of EDA is to provide numerical summaries of the dataset. This can have many purposes:
summary function in RR has a very handy function, which performs differently for depending on the type of data structures you apply it to. This is the summary function, and it provides a very useful data summary of data frames. This comes in the form of five-number summaries (plus the mean) (min-Q1-mean-median-Q3-max) for numeric data and counts for categorical (factor) data. Summary also works on many types of objects in R, and when you don’t know what to do with an R object (obj), it is often good to try summary(obj).
summary(dat)
## aline_flg icu_los_day hospital_los_day age
## Min. :0.0000 Min. : 0.500 Min. : 1.000 Min. :15.18
## 1st Qu.:0.0000 1st Qu.: 1.370 1st Qu.: 3.000 1st Qu.:38.25
## Median :1.0000 Median : 2.185 Median : 6.000 Median :53.68
## Mean :0.5541 Mean : 3.346 Mean : 8.111 Mean :54.38
## 3rd Qu.:1.0000 3rd Qu.: 4.003 3rd Qu.: 10.000 3rd Qu.:72.76
## Max. :1.0000 Max. :28.240 Max. :112.000 Max. :99.11
##
## gender_num weight_first bmi sapsi_first
## Min. :0.0000 Min. : 30.00 Min. :12.78 Min. : 3.00
## 1st Qu.:0.0000 1st Qu.: 65.40 1st Qu.:22.62 1st Qu.:11.00
## Median :1.0000 Median : 77.00 Median :26.32 Median :14.00
## Mean :0.5771 Mean : 80.08 Mean :27.83 Mean :14.14
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.:30.80 3rd Qu.:17.00
## Max. :1.0000 Max. :257.60 Max. :98.80 Max. :32.00
## NA's :110 NA's :466 NA's :85
## sofa_first service_unit service_num day_icu_intime
## Min. : 0.000 FICU: 62 Min. :0.0000 Friday :234
## 1st Qu.: 4.000 MICU:732 1st Qu.:0.0000 Monday :260
## Median : 6.000 SICU:982 Median :1.0000 Saturday :278
## Mean : 5.801 Mean :0.5529 Sunday :230
## 3rd Qu.: 7.000 3rd Qu.:1.0000 Thursday :261
## Max. :17.000 Max. :1.0000 Tuesday :257
## Wednesday:256
## day_icu_intime_num hour_icu_intime hosp_exp_flg icu_exp_flg
## Min. :1.000 Min. : 0.00 Min. :0.0000 Min. :0.00000
## 1st Qu.:2.000 1st Qu.: 3.00 1st Qu.:0.0000 1st Qu.:0.00000
## Median :4.000 Median : 9.00 Median :0.0000 Median :0.00000
## Mean :4.054 Mean :10.59 Mean :0.1374 Mean :0.09572
## 3rd Qu.:6.000 3rd Qu.:19.00 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :7.000 Max. :23.00 Max. :1.0000 Max. :1.00000
##
## day_28_flg mort_day_censored censor_flg sepsis_flg
## Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0
## 1st Qu.:0.0000 1st Qu.: 434.3 1st Qu.:0.0000 1st Qu.:0
## Median :0.0000 Median : 731.0 Median :1.0000 Median :0
## Mean :0.1593 Mean : 614.3 Mean :0.7202 Mean :0
## 3rd Qu.:0.0000 3rd Qu.: 731.0 3rd Qu.:1.0000 3rd Qu.:0
## Max. :1.0000 Max. :3094.1 Max. :1.0000 Max. :0
##
## chf_flg afib_flg renal_flg liver_flg
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1199 Mean :0.1166 Mean :0.03378 Mean :0.05574
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## copd_flg cad_flg stroke_flg mal_flg
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.000 Median :0.0000
## Mean :0.0884 Mean :0.06926 Mean :0.125 Mean :0.1441
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.0000
##
## resp_flg map_1st hr_1st temp_1st
## Min. :0.0000 Min. : 5.00 Min. : 30.00 Min. : 32.00
## 1st Qu.:0.0000 1st Qu.: 76.67 1st Qu.: 74.75 1st Qu.: 96.90
## Median :0.0000 Median : 87.00 Median : 87.00 Median : 98.10
## Mean :0.3181 Mean : 88.25 Mean : 87.91 Mean : 97.79
## 3rd Qu.:1.0000 3rd Qu.: 99.00 3rd Qu.:100.00 3rd Qu.: 99.30
## Max. :1.0000 Max. :195.00 Max. :158.00 Max. :104.80
## NA's :3
## spo2_1st abg_count wbc_first hgb_first
## Min. : 4.00 Min. : 0.000 Min. : 0.17 Min. : 2.00
## 1st Qu.: 98.00 1st Qu.: 1.000 1st Qu.: 8.20 1st Qu.:11.10
## Median :100.00 Median : 3.000 Median : 11.30 Median :12.70
## Mean : 98.43 Mean : 5.985 Mean : 12.32 Mean :12.55
## 3rd Qu.:100.00 3rd Qu.: 7.000 3rd Qu.: 15.00 3rd Qu.:14.12
## Max. :100.00 Max. :115.000 Max. :109.80 Max. :19.00
## NA's :8 NA's :8
## platelet_first sodium_first potassium_first tco2_first
## Min. : 7.0 Min. :105.0 Min. :1.900 Min. : 2.00
## 1st Qu.:182.0 1st Qu.:137.0 1st Qu.:3.600 1st Qu.:22.00
## Median :239.0 Median :140.0 Median :4.000 Median :24.00
## Mean :246.1 Mean :139.6 Mean :4.108 Mean :24.42
## 3rd Qu.:297.0 3rd Qu.:142.0 3rd Qu.:4.400 3rd Qu.:27.00
## Max. :988.0 Max. :165.0 Max. :9.800 Max. :62.00
## NA's :8 NA's :5 NA's :5 NA's :5
## chloride_first bun_first creatinine_first po2_first
## Min. : 78.0 Min. : 2.00 Min. : 0.000 Min. : 22.0
## 1st Qu.:101.0 1st Qu.: 11.00 1st Qu.: 0.700 1st Qu.:108.0
## Median :104.0 Median : 15.00 Median : 0.900 Median :195.0
## Mean :103.8 Mean : 19.28 Mean : 1.096 Mean :227.6
## 3rd Qu.:107.0 3rd Qu.: 22.00 3rd Qu.: 1.100 3rd Qu.:323.0
## Max. :133.0 Max. :139.00 Max. :18.300 Max. :634.0
## NA's :5 NA's :5 NA's :6 NA's :186
## pco2_first iv_day_1
## Min. : 8.00 Min. : 0.0
## 1st Qu.: 36.00 1st Qu.: 329.8
## Median : 41.00 Median : 1081.5
## Mean : 43.41 Mean : 1622.9
## 3rd Qu.: 47.00 3rd Qu.: 2493.9
## Max. :158.00 Max. :13910.0
## NA's :186 NA's :143
As you can see, this function is very verbose, but produces some useful output. At this point, it’s also a good idea to verify the number of rows and columns are correct:
nrow(dat)
## [1] 1776
ncol(dat)
## [1] 46
This should be what you’re expecting (1776 and 46). If it’s not, this could indicate a loading error, or a problem with the data extraction.
As you will note, many of the flg variables listed in the summary output above, are constrained by 0 and 1. This is because they have a binary encoding (usually 1 if present, and 0 if not). Although not necessary in this particular instance, it is sometimes useful to encode these types of variables as factors. The function convert.bin.fac will do this, and we’ll use it to create a new data frame called dat2, and call the summary function on the new data frame. We first need to load/install two packages (devtools and MIMICbook).
#Instalo los paquetes MIMICbook , devtools lo instale antes #Con convert.bin.fac(data) transforma los variables binarias en factores, como la variable gender_num o aline_flag
if(!("devtools" %in% installed.packages()[,1])) {
install.packages("devtools",repos="https://cloud.r-project.org")
}
library(devtools)
## Loading required package: usethis
if(!("MIMICbook" %in% installed.packages()[,1])) {
install_github("jraffa/MIMICbook")
}
library(MIMICbook)
The MIMICbook package provides some useful functions written for the textbook that we will use throughout some of the workshops. It installs via GitHub.
dat2 <- convert.bin.fac(dat)
summary(dat2)
## aline_flg icu_los_day hospital_los_day age gender_num
## 0:792 Min. : 0.500 Min. : 1.000 Min. :15.18 0: 751
## 1:984 1st Qu.: 1.370 1st Qu.: 3.000 1st Qu.:38.25 1:1025
## Median : 2.185 Median : 6.000 Median :53.68
## Mean : 3.346 Mean : 8.111 Mean :54.38
## 3rd Qu.: 4.003 3rd Qu.: 10.000 3rd Qu.:72.76
## Max. :28.240 Max. :112.000 Max. :99.11
##
## weight_first bmi sapsi_first sofa_first service_unit
## Min. : 30.00 Min. :12.78 Min. : 3.00 Min. : 0.000 FICU: 62
## 1st Qu.: 65.40 1st Qu.:22.62 1st Qu.:11.00 1st Qu.: 4.000 MICU:732
## Median : 77.00 Median :26.32 Median :14.00 Median : 6.000 SICU:982
## Mean : 80.08 Mean :27.83 Mean :14.14 Mean : 5.801
## 3rd Qu.: 90.00 3rd Qu.:30.80 3rd Qu.:17.00 3rd Qu.: 7.000
## Max. :257.60 Max. :98.80 Max. :32.00 Max. :17.000
## NA's :110 NA's :466 NA's :85
## service_num day_icu_intime day_icu_intime_num hour_icu_intime hosp_exp_flg
## 0:794 Friday :234 Min. :1.000 Min. : 0.00 0:1532
## 1:982 Monday :260 1st Qu.:2.000 1st Qu.: 3.00 1: 244
## Saturday :278 Median :4.000 Median : 9.00
## Sunday :230 Mean :4.054 Mean :10.59
## Thursday :261 3rd Qu.:6.000 3rd Qu.:19.00
## Tuesday :257 Max. :7.000 Max. :23.00
## Wednesday:256
## icu_exp_flg day_28_flg mort_day_censored censor_flg sepsis_flg chf_flg
## 0:1606 0:1493 Min. : 0.0 0: 497 0:1776 0:1563
## 1: 170 1: 283 1st Qu.: 434.3 1:1279 1: 213
## Median : 731.0
## Mean : 614.3
## 3rd Qu.: 731.0
## Max. :3094.1
##
## afib_flg renal_flg liver_flg copd_flg cad_flg stroke_flg mal_flg resp_flg
## 0:1569 0:1716 0:1677 0:1619 0:1653 0:1554 0:1520 0:1211
## 1: 207 1: 60 1: 99 1: 157 1: 123 1: 222 1: 256 1: 565
##
##
##
##
##
## map_1st hr_1st temp_1st spo2_1st
## Min. : 5.00 Min. : 30.00 Min. : 32.00 Min. : 4.00
## 1st Qu.: 76.67 1st Qu.: 74.75 1st Qu.: 96.90 1st Qu.: 98.00
## Median : 87.00 Median : 87.00 Median : 98.10 Median :100.00
## Mean : 88.25 Mean : 87.91 Mean : 97.79 Mean : 98.43
## 3rd Qu.: 99.00 3rd Qu.:100.00 3rd Qu.: 99.30 3rd Qu.:100.00
## Max. :195.00 Max. :158.00 Max. :104.80 Max. :100.00
## NA's :3
## abg_count wbc_first hgb_first platelet_first
## Min. : 0.000 Min. : 0.17 Min. : 2.00 Min. : 7.0
## 1st Qu.: 1.000 1st Qu.: 8.20 1st Qu.:11.10 1st Qu.:182.0
## Median : 3.000 Median : 11.30 Median :12.70 Median :239.0
## Mean : 5.985 Mean : 12.32 Mean :12.55 Mean :246.1
## 3rd Qu.: 7.000 3rd Qu.: 15.00 3rd Qu.:14.12 3rd Qu.:297.0
## Max. :115.000 Max. :109.80 Max. :19.00 Max. :988.0
## NA's :8 NA's :8 NA's :8
## sodium_first potassium_first tco2_first chloride_first
## Min. :105.0 Min. :1.900 Min. : 2.00 Min. : 78.0
## 1st Qu.:137.0 1st Qu.:3.600 1st Qu.:22.00 1st Qu.:101.0
## Median :140.0 Median :4.000 Median :24.00 Median :104.0
## Mean :139.6 Mean :4.108 Mean :24.42 Mean :103.8
## 3rd Qu.:142.0 3rd Qu.:4.400 3rd Qu.:27.00 3rd Qu.:107.0
## Max. :165.0 Max. :9.800 Max. :62.00 Max. :133.0
## NA's :5 NA's :5 NA's :5 NA's :5
## bun_first creatinine_first po2_first pco2_first
## Min. : 2.00 Min. : 0.000 Min. : 22.0 Min. : 8.00
## 1st Qu.: 11.00 1st Qu.: 0.700 1st Qu.:108.0 1st Qu.: 36.00
## Median : 15.00 Median : 0.900 Median :195.0 Median : 41.00
## Mean : 19.28 Mean : 1.096 Mean :227.6 Mean : 43.41
## 3rd Qu.: 22.00 3rd Qu.: 1.100 3rd Qu.:323.0 3rd Qu.: 47.00
## Max. :139.00 Max. :18.300 Max. :634.0 Max. :158.00
## NA's :5 NA's :6 NA's :186 NA's :186
## iv_day_1
## Min. : 0.0
## 1st Qu.: 329.8
## Median : 1081.5
## Mean : 1622.9
## 3rd Qu.: 2493.9
## Max. :13910.0
## NA's :143
As you can now see, instead of means (which under the old encoding equate to proportions of patients where the variable == 1), now we have counts of patients with each level of the variable. This is because R’s summary function treats factors and numerical values differently.
Often, you will want to report these summaries separately for different groups. For instance, is the mean or median age the same for those who received an IAC, and those who didn’t? A multi-purpose function called tapply can help us with this. #Calcula las medidas de tendencia central posicion de los que tuvieron y no arteria : funcion tapply
tapply(dat2$age,dat2$aline_flg,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.18 34.80 50.85 53.02 72.11 97.46
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.19 40.36 56.02 55.48 73.21 99.11
This function stratifies the first argument (age) by the second argument (aline_flg) and run the third argument (summary) on it. So, in our case, run the summary function on age for those who received an IAC (aline_flg = 1) and those who didn’t (aline_flg = 0).
- Using the
dat2data frame, run the summary function forsofa_first, andservice_unitseparately for those with an IAC, and those without.- Run the summary function for
agesofa_first, andservice_unitseparately for those who died within 28 days, and those who survived.
# Students: Insert your code here
tapply(dat2$sofa_first,dat2$service_unit,summary)
## $FICU
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 6.000 5.661 7.000 14.000
##
## $MICU
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.000 5.777 7.000 17.000
##
## $SICU
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.000 5.828 7.000 16.000
tapply(dat2$age,dat2$day_28_flg,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.18 34.83 49.46 50.78 66.62 99.11
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.06 65.32 77.83 73.35 83.83 97.46
tapply(dat2$sofa_first,dat2$day_28_flg,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 5.000 5.661 7.000 16.000
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 6.000 6.541 8.000 17.000
tapply(dat2$service_unit,dat$day_28_flg,summary)
## $`0`
## FICU MICU SICU
## 59 605 829
##
## $`1`
## FICU MICU SICU
## 3 127 153
#Codigo que hice yo para el ejercicio
tapply(dat2$service_unit,dat2$aline_flg,summary)
## $`0`
## FICU MICU SICU
## 24 480 288
##
## $`1`
## FICU MICU SICU
## 38 252 694
tapply(dat2$day_28_flg,dat2$aline_flg,summary)
## $`0`
## 0 1
## 679 113
##
## $`1`
## 0 1
## 814 170
The output from summary is very useful, but is generally not acceptable for formal research reports, let alone a published paper. There are several ways to produce a publication which has a better layout. One way is described in the textbook (Chapter 15). Another, which we will cover here, is through an R package called tableone.
As some of you may know, “Table 1” often refers to the table presented in most medical manuscripts which contains information used to describe the cohort. This typically includes information such as average patient age, gender distribution, and other important demographic, clinical, and socioeconomic characteristics. We will cover briefly how to use the CreateTableOne function in this package to generate a table which is closer to being publication worthy.
The following code will install the tableone package and load it. #Ya lo tengo instalado, no necesito correr este codigo # Documentacion: https://www.rdocumentation.org/packages/tableone/versions/0.12.0/topics/CreateTableOne
if(!("tableone" %in% installed.packages()[,1])) {
install.packages("tableone",repos="https://cloud.r-project.org")
}
library(tableone)
Here is an example functional call to CreateTableOne, which computes either the mean and standard deviation for numeric variables, or count and percentage for factors. You specify which variables you want to include in the table
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),data=dat2)
##
## Overall
## n 1776
## age (mean (SD)) 54.38 (21.06)
## service_unit (%)
## FICU 62 ( 3.5)
## MICU 732 (41.2)
## SICU 982 (55.3)
## aline_flg = 1 (%) 984 (55.4)
## day_28_flg = 1 (%) 283 (15.9)
We may want to breakdown these summaries further, like we did above with tapply, but we can do it with one function with the CreateTableOne function by passing the strata parameter. strata specifies which variable to stratify (breakdown) the others by. For example, here is the same table in the previous chunk, broken down by whether a patient received an IAC or not.
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),strata="aline_flg",data=dat2,test=FALSE)
## Stratified by aline_flg
## 0 1
## n 792 984
## age (mean (SD)) 53.02 (21.67) 55.48 (20.51)
## service_unit (%)
## FICU 24 ( 3.0) 38 ( 3.9)
## MICU 480 (60.6) 252 ( 25.6)
## SICU 288 (36.4) 694 ( 70.5)
## aline_flg = 1 (%) 0 ( 0.0) 984 (100.0)
## day_28_flg = 1 (%) 113 (14.3) 170 ( 17.3)
- Compute a Table to summarize those variable considered before (
age,service_unit,aline_flgandday_28_flg) in addition togender_numandchf_flg, but now stratify by survival at 28 days (gender_numandchf_flg).- Repeat part a), but now use the
datdata frame instead ofdat2. Note the differences in how variables that were previously recast as factors are summarized.Veo que tomo las variables categoricas binarias y las resume como media y Ds porque las toma como 0 y 1
# Students: Insert your code here
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg","gender_num","chf_flg"),strata="day_28_flg",data=dat2,test=FALSE)
## Stratified by day_28_flg
## 0 1
## n 1493 283
## age (mean (SD)) 50.78 (20.06) 73.35 (15.32)
## service_unit (%)
## FICU 59 ( 4.0) 3 ( 1.1)
## MICU 605 (40.5) 127 ( 44.9)
## SICU 829 (55.5) 153 ( 54.1)
## aline_flg = 1 (%) 814 (54.5) 170 ( 60.1)
## day_28_flg = 1 (%) 0 ( 0.0) 283 (100.0)
## gender_num = 1 (%) 886 (59.3) 139 ( 49.1)
## chf_flg = 1 (%) 145 ( 9.7) 68 ( 24.0)
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg","gender_num","chf_flg"),strata="day_28_flg",data=dat,test=FALSE)
## Stratified by day_28_flg
## 0 1
## n 1493 283
## age (mean (SD)) 50.78 (20.06) 73.35 (15.32)
## service_unit (%)
## FICU 59 ( 4.0) 3 ( 1.1)
## MICU 605 (40.5) 127 (44.9)
## SICU 829 (55.5) 153 (54.1)
## aline_flg (mean (SD)) 0.55 (0.50) 0.60 (0.49)
## day_28_flg (mean (SD)) 0.00 (0.00) 1.00 (0.00)
## gender_num (mean (SD)) 0.59 (0.49) 0.49 (0.50)
## chf_flg (mean (SD)) 0.10 (0.30) 0.24 (0.43)
As an aside, the following code may help for your projects, as it improves the presentation of the tables above. You will still need to update the column and row names manually, but this should paste nicely into Word or LateX!
if(!("dplyr" %in% installed.packages()[,1])) {
install.packages("dplyr")
}
library(dplyr)
CreateTableOne(vars=c("age","service_unit","aline_flg","day_28_flg"),strata="aline_flg",data=dat2,test=FALSE) %>% print(
printToggle = FALSE,
showAllLevels = TRUE,
cramVars = "kon"
) %>%
{data.frame(
variable_name = gsub(" ", " ", rownames(.), fixed = TRUE), .,
row.names = NULL,
check.names = FALSE,
stringsAsFactors = FALSE)} %>%
knitr::kable()
| variable_name | level | 0 | 1 |
|---|---|---|---|
| n | 792 | 984 | |
| age (mean (SD)) | 53.02 (21.67) | 55.48 (20.51) | |
| service_unit (%) | FICU | 24 ( 3.0) | 38 ( 3.9) |
| MICU | 480 ( 60.6) | 252 ( 25.6) | |
| SICU | 288 ( 36.4) | 694 ( 70.5) | |
| aline_flg (%) | 0 | 792 (100.0) | 0 ( 0.0) |
| 1 | 0 ( 0.0) | 984 (100.0) | |
| day_28_flg (%) | 0 | 679 ( 85.7) | 814 ( 82.7) |
| 1 | 113 ( 14.3) | 170 ( 17.3) |
Sometimes you may wish to display the relationships between two or more variables directly. For categorical variables this can be tricky. One common way to explore relationships between categorical variables is by producing the cross tabulated tables (“crosstabs” for short). This is mainly done via the table function, which can take several categorical variables, and produce the number of patients which meet criteria for those variables. For instance, looking at how an IAC was used in men and women:
table(dat2$gender_num,dat2$aline_flg,dnn=c("Gender","IAC"))
## IAC
## Gender 0 1
## 0 345 406
## 1 447 578
we can see that an IAC was used 578 times in men (gender_num=1) and 447 times in women (gender_num=0). The raw numbers are often difficult to compare, so often the proportions are more useful. Applying prop.table to our existing table, and adding the argument 1 (for by row, use 2 for columns), we get the proportion of men and women who had an IAC (56% vs. 54%).
prop.table(table(dat2$gender_num,dat2$aline_flg,dnn=c("Gender","IAC")),1)
## IAC
## Gender 0 1
## 0 0.4593875 0.5406125
## 1 0.4360976 0.5639024
#Coeficiente de correlacion A different summary for bivariate numeric data exists to present the strength of the relationship between two variables, called the correlation coefficient. There is a cor function in R, but when dealing with only two variables, it’s easiest to use the cor.test function. Under the defaults, it computes the Pearson product-moment correlation and computes a hypothesis test to assess if there’s evidence that the correlation is not zero. Other forms of correlation are computed below, including Spearman’s rho and Kendall’s tau. These latter methods are useful when dealing with data which is not necessarily numeric but ordered (e.g., likert based rankings on a 1-5 scale) or has outliers. Spearman’s rho and Kendall’s tau are rank based methods, and also have a certain degree of robustness to outliers in the data. None of these methods are robust to non-linear relationships, and it’s very easy to miss a strong relationship between two variables if you rely on these methods in isolation.
cor.test(dat2$bun_first,dat2$creatinine_first)
##
## Pearson's product-moment correlation
##
## data: dat2$bun_first and dat2$creatinine_first
## t = 33.233, df = 1768, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5905441 0.6479479
## sample estimates:
## cor
## 0.6200752
cor.test(dat2$bun_first,dat2$creatinine_first,method="spearman")
## Warning in cor.test.default(dat2$bun_first, dat2$creatinine_first, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: dat2$bun_first and dat2$creatinine_first
## S = 415893613, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5499986
cor.test(dat2$bun_first,dat2$creatinine_first,method="kendall")
##
## Kendall's rank correlation tau
##
## data: dat2$bun_first and dat2$creatinine_first
## z = 24.807, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.418814
We can produce a scatterplot of the same two variables:
with(dat2,plot(bun_first,creatinine_first))
We can see that there is indeed a positive correlation between the two variables, but the data has more variability for higher values of bun_first and creatinine_first. It’s advisable to consider transformations of these two variables and be wary about using Pearson’s correlation.
Going beyond two dimensions can be a little tricky. Plotting on a three dimensional axis, while possible, is not ideal, and very few people can see in four dimensions.
What is possible, is to use other aspects of the plot (e.g., size, color, shape, hue, transparency, location) to identify features you would like to see. For instance, in the above plot, we can add color to identify those who died:
with(dat2,plot(bun_first,creatinine_first,col=day_28_flg,pch=19))
Sometimes numeric variables need to be broken down into categorical variables or factors. This can be done for a variety of reasons. There is a useful function called cut2 in the Hmisc package. We install it and use it below.
if(!("Hmisc" %in% installed.packages()[,1])) {
install.packages("Hmisc",repos="https://cloud.r-project.org")
}
library(Hmisc)
dat2$age.cat <- cut2(dat2$age,g=5)
table(dat2$age.cat)
##
## [15.2,33.6) [33.6,47.6) [47.6,60.3) [60.3,76.4) [76.4,99.1]
## 356 355 355 355 355
dat2$age.cat2 <- cut2(dat2$age,c(25,40,55,70,85))
table(dat2$age.cat2)
##
## [15.2,25.0) [25.0,40.0) [40.0,55.0) [55.0,70.0) [70.0,85.0) [85.0,99.1]
## 208 287 428 341 382 130
cut2 typically needs two arguments. The first is a numeric variable to convert into a factor, and the second is how to do the splitting. Specifying g=5 (as above for age.cat) breaks the numeric variable into 5 groups, with the cut points determined by attempting to make the groups as equally sized as possible. As you can see in this example, due to the odd number of patients, they are not perfectly even. The second approach requires passing the cut points. In the second example, we tell R to cut the data at 25, 40,…. This results in 6 groups for five cut points.
- Create a new variable in the
dat2data frame calledsofa.catmade up of four (approximately) equally sized groups for SOFA. Print the sample size in each group. Consider why the group sizes may differ significantly from each other.
#Sucede por la cantidad de observaciones en cada grupo , el grupo no lo corta > b) For each SOFA group calculate the number of people who survived and died in the hospital and at 28 days (use the hosp_exp_flg and day_28_flg variable). Does the mortality increase or decrease as SOFA increases? #La mortalidad aumenta de acuerdo a SOFA #prop.table(table(dat2\(sofa.cat,dat2\)day_28_flg,dnn=c(“SOFA”,“ICU Mortality”)),1) al ponerle 1 suma las filas y si le pongo 2 suma columnas. #CreateTableOne(vars=c(“sofa.cat”),strata=“hosp_exp_flg”,data=dat2,test = FALSE) esta opcion la probe yo
# Students: Insert your code here
dat2$sofa.cat<-cut2(dat2$sofa_first,g=4)
table(dat2$sofa.cat)
##
## [0, 5) [5, 7) 7 [8,17]
## 529 640 243 364
prop.table(table(dat2$sofa.cat,dat2$hosp_exp_flg,dnn=c("SOFA","Hosp. Mortality")),1)
## Hosp. Mortality
## SOFA 0 1
## [0, 5) 0.93950851 0.06049149
## [5, 7) 0.84218750 0.15781250
## 7 0.83127572 0.16872428
## [8,17] 0.80769231 0.19230769
prop.table(table(dat2$sofa.cat,dat2$day_28_flg,dnn=c("SOFA","ICU Mortality")),1)
## ICU Mortality
## SOFA 0 1
## [0, 5) 0.92060491 0.07939509
## [5, 7) 0.82187500 0.17812500
## 7 0.80246914 0.19753086
## [8,17] 0.78296703 0.21703297
prop.table(table(dat2$sofa.cat,dat2$day_28_flg,dnn=c("SOFA","ICU Mortality")),2)
## ICU Mortality
## SOFA 0 1
## [0, 5) 0.3261889 0.1484099
## [5, 7) 0.3523108 0.4028269
## 7 0.1306095 0.1696113
## [8,17] 0.1908908 0.2791519
CreateTableOne(vars=c("sofa.cat"),strata="hosp_exp_flg",data=dat2,test = FALSE)
## Stratified by hosp_exp_flg
## 0 1
## n 1532 244
## sofa.cat (%)
## [0, 5) 497 (32.4) 32 (13.1)
## [5, 7) 539 (35.2) 101 (41.4)
## 7 202 (13.2) 41 (16.8)
## [8,17] 294 (19.2) 70 (28.7)
CreateTableOne(vars=c("sofa.cat"),strata="day_28_flg",data=dat2,test=FALSE)
## Stratified by day_28_flg
## 0 1
## n 1493 283
## sofa.cat (%)
## [0, 5) 487 (32.6) 42 (14.8)
## [5, 7) 526 (35.2) 114 (40.3)
## 7 195 (13.1) 48 (17.0)
## [8,17] 285 (19.1) 79 (27.9)
Plotting discrete data can be a little tricky, but if done right can be very effective. For an example of why it’s difficult, let’s plot two discrete variables: gender_num and aline_flg.
plot(dat2$gender_num,dat2$aline_flg,xlab="Gender",ylab="IAC")
Because we have converted gender_num and aline_flg to a factor, R gives us what is called a “Factor Plot”. The area of the light grey region is proportional to the proportion of each gender who received an IAC. In this case, there is not that big of a difference between the genders.
This factor plot is more useful than if we were to keep the original numerical class both these variables had. For example, if we use the original dat data frame both gender_num and aline_flg are numeric variables, and when we plot these, we don’t end up with something very useful:
plot(dat$gender_num,dat$aline_flg,xlab="Gender",ylab="IAC")
In this case we only have four different types of data points, and although we could try to jitter the values to get a slightly more useful plot, it’s unlikely that this would give us a good visual interpretation of the data.
Sometimes the covariate may take on more than two levels. Here, we plot the in-hospital mortality rate by the different SOFA values, and put a smooth curve through the points. This covers a more advanced topic, and we don’t expect you to understand the technical details of the code below. #La función sapply en R es una función vectorizada de la familia de aplicaciones que permite iterar sobre una lista o vector sin la necesidad de usar el bucle for,
plot(names(table(dat2$sofa_first)),sapply(split(dat2,dat2$sofa_first),function(x) { mean(x$hosp_exp_flg==1,na.rm=T)}),xlab="SOFA",ylab="In-Hospital Mortality",cex=log(as.numeric(table(dat2$sofa_first)+1))/5)
lines(smooth.spline(dat2$sofa_first,dat2$hosp_exp_flg==1),type="l")
SOFA is a validated disease severity scale for the ICU, and generally correlates strongly with mortality. Here, while the mortality rate generally increases as SOFA increases, the smooth fit isn’t necessarily non-decreasing as SOFA values increase. We have added points roughly proportional to the sample size of each SOFA level, and you’ll see towards the high levels of SOFA, very few patients are observed, with the second highest score (16) having a 100% survival rate (but with only one patient).
For binary outcomes, it is often useful to plot the proportion of patients with the outcome (e.g., mortality rate) by the different levels of a covariate of interest. Because sample size plays such an important role in the uncertainty associated with these estimate proportions, it seems appropriate to include an estimate of our uncertainty via a confidence interval.
In the MIMICbook package you installed above, there is a plot_prop_by_level which can plot the proportion of patients with an outcome by one or two factor variables. For instance, if we wished to plot the in hospital mortality rate by the SOFA categories (sofa.cat) we defined above, we can using:
#Grafica mortalidad hospitalaria segun categoria de SOFA previamente definida. Paquete MIMIKBOOK
plot_prop_by_level(dat2,"sofa.cat","hosp_exp_flg")
## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## Please use `mutate()` instead.
## See vignette('programming') for more help
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
plot_prop_by_level(dat2,"sofa.cat","day_28_flg")
Often it’s useful to consider more than one covariate at a time to assess confounding and effect modification. Here, if we wished to examine sofa.cat and gender_num at the same time, we add factor.var2="gender_num" to our previous use of plot_prop_by_level.
plot_prop_by_level(dat2,"sofa.cat","hosp_exp_flg",factor.var2="gender_num")
plot_prop_by_level(dat2,"sofa.cat","day_28_flg",factor.var2="aline_flg")
Here we see that in hospital mortality is higher in women for the SOFA groups we have considered, suggesting that it might be an important confounder for this outcome and variable.
- Make a factor plot of the categories of SOFA we created (
sofa.cat) and hospital mortality (hosp_exp_flg). Does the trend align with your expectations based on the non-graphical EDA performer earlier? Si, la mortalidad hospitalaria aumenta a medida que aumenta SOFA
- Use
plot_prop_by_levelusingsofa.catas the covariate of interest and 28 day mortality (day_28_flg) as the outcome.
- Include the main covariate of interest for this study
aline_flgas the second factor variable and extend part b).
A medida que aumenta SOFA la mortalidad aumenta. La mortalidad es menor en aquellos pacientes que tuvieron un acceso arterial ,excepto para el grupo de SOFA mas bajo
- Repeat part c), but swap the IAC and SOFA arguments (intercambie los argumentos?) . Consider how the different depictions of the underlying data could better support different objectives. A medida que SOFA aumenta , aumenta la proporcion de pacientes con arteria y la mortalidad de ellos es menor que la de aquellos sin arteria
- Create a new variable,
sofa.cat2, with cut points at 3, 6, 9, 12. Repeat parts b) and c). Se ve que en la categoria de SOFA mas bajo , los pacientes con acceso arterial , en grupo 2 y 3 es semejante. A partir de SOFA 9 se observa menor mortalidad de los pacientes con acceso arterial
- Make a plot of the 28 day mortality outcome,
aline_flgandchf_flg. Ignoring the statistical significance (i.e., do not perform any formal testing), consider why this plot may suggest the complexity of any potential effect of an IAC on mortality. chf_flg es FALLA CARDIACA CONGESTIVA #Segun estos graficos pareceria ser que los pacientes con acceso arterial se mueren mas que aquellos sin acceso arterial. REPENSAR
plot(dat2$sofa.cat, dat2$hosp_exp_flg,xlab="Categoria SOFA", ylab="Fallecido")
plot_prop_by_level(dat2,"sofa.cat","day_28_flg")
plot_prop_by_level(dat2,"sofa.cat","day_28_flg",factor.var2="aline_flg")
plot_prop_by_level(dat2,"sofa.cat","aline_flg",factor.var2="day_28_flg")
dat2$sofa.cat2<-cut2(dat2$sofa_first,c(3,6,9,12))
table(dat2$sofa.cat2)
##
## [ 0, 3) [ 3, 6) [ 6, 9) [ 9,12) [12,17]
## 83 792 685 183 33
plot(dat2$sofa.cat2, dat2$hosp_exp_flg,xlab="Categoria SOFA 2", ylab="Fallecido")
plot_prop_by_level(dat2,"sofa.cat2","day_28_flg")
plot_prop_by_level(dat2,"aline_flg","day_28_flg")
plot_prop_by_level(dat2,"sofa.cat2","day_28_flg",factor.var2="aline_flg")
plot_prop_by_level(dat2,"day_28_flg","aline_flg",factor.var2="chf_flg")
plot(dat2$aline_flg,dat2$day_28_flg,xlab="Cateter arterial",ylab="Mortalidad a 28 dias")
Note: For those with a programming background, R indexes vectors starting from 1.
As previously discussed, odds ratios are very commonly used to communicate relative effect sizes for binary outcomes, particularly in observational data. Calculation is straightforward, but often misunderstood. We start with a 2 x 2 table. Below is the 2 x 2 table for in hospital mortality and having an arterial line. I’ve assigned it to a new variable called egtab.
egtab <- table(dat2$aline_flg,dat2$hosp_exp_flg,dnn=c("IAC","Hosp. Mort"))
egtab
## Hosp. Mort
## IAC 0 1
## 0 702 90
## 1 830 154
It’s hard to interpret the raw counts, so we’ll use prop.table to compute the proportions who died and lived by row (margin 1, IAC). Muestra los porcentajes por filas
pegtab <- prop.table(egtab,1)
pegtab
## Hosp. Mort
## IAC 0 1
## 0 0.8863636 0.1136364
## 1 0.8434959 0.1565041
Calcula el Odds como paciente con evento/ total pacientes sin evento o sea para IAC 0 , muertes/ no muertes. O sea columna 2/ col 1 Me da 0,1282051 para paciente IAC = 0 y 0.1855422 para IAC 1
Odds are \(\frac{p}{1-p}\) where \(p\) is the proportion with the outcome (death) in a group of patients, which is in the second column. We can index the above table by column (tab[,idx] will retrieve column idx from the table [or matrix] tab) to compute the odds in each group.
Oddsegtab <-pegtab[,2]/pegtab[,1]
Oddsegtab
## 0 1
## 0.1282051 0.1855422
Now we have the odds of the outcome in those who got an IAC 1 and those who didn’t 0. We need to pick a reference group. We’ll calculate it both ways, but let’s assume we want those without an IAC to be the reference: DIVIDO LOS QUE TUVIERON ART POR LOS QUE NO TUVIERON SI LA REFERENCIA ES LOS QUE NO TUVIERON
Oddsegtab[2]/Oddsegtab[1]
## 1
## 1.447229
If we wanted those with an IAC to be the reference group:
Oddsegtab[1]/Oddsegtab[2]
## 0
## 0.6909757
If we wanted to plot this information, and include a confidence interval, we can use the plot_OR_by_level from the MIMICbook package:
plot_OR_by_level(dat2,"aline_flg","hosp_exp_flg")
## Warning: `rename_()` was deprecated in dplyr 0.7.0.
## Please use `rename()` instead.
This by default includes an odds ratio of 1 indicating the reference group. To remove this point use the include.ref.group.effect argument:
plot_OR_by_level(dat2,"aline_flg","hosp_exp_flg",include.ref.group.effect = FALSE)
You can also look at more than one covariate at a time. For instance, looking at aline_flg and the gender_num variable:
plot_OR_by_level(dat2,"gender_num","hosp_exp_flg",factor.var2="aline_flg",include.ref.group.effect = TRUE)
Here we have computed the odds ratio for an IAC (vs no IAC) separately for men and women.
- Create a 2 x 2 table with
chf_flgand the variableday_28_flgoutcome and assign it to a variable calledtab22.- Compute the odds ratio for having CHF vs. not having CHF using this table.
#Código mas limpio tab22 <- table(dat2\(chf_flg,dat2\)day_28_flg,dnn=c(“CHF”,“28d Mortality”)) ptab22 <- prop.table(tab22,1) oddstab22 <- ptab22[,2]/ptab22[,1] oddstab22[2]/oddstab22[1]
- Construct a plot of the odds ratios and 95% confidence intervals using the
plot_OR_by_levelfunction for CHF and 28 day mortality.
- Create a 4 x 2 table with the
sofa.catvariable and theday_28_flgoutcome and assign it to a variable calledtab42.
- Pick and define a reference group for
sofa.cat, and compute the odds ratio(s) for the other levels ofsofa.catusingday_28_flgas your outcome. #Codigo mas limpio tab42 <- table(dat2\(sofa.cat, dat2\)day_28_flg,dnn=c(“SOFA”,“28d Mortality”)) ptab42 <- prop.table(tab42,1) oddstab42 <- ptab42[,2]/ptab42[,1] oddstab42[]/oddstab42[1]
- Construct a plot of the odds ratios and 95% confidence intervals using the
plot_OR_by_levelfunction for the SOFA categories and 28 day mortality. Make sure the reference groups are the same as parts d) and e). Look into therelevelfunction inRor theref.groupargument in theplot_OR_by_levelfunction.
- Construct a plot looking at the 28 day mortality outcome, and the two variables we considered here,
sofa.catandchf_flg. Exchange the variables assigned to the factor.var1 and factor.var2 arguments, and consider briefly two reasons why you might prefer one plot over the other, and what you would conclude from your chosen plot. #Prefiero el grafico que estratifica por presencia de Insuficiencia cardiaca , plot_OR_by_level(dat2,“chf_flg”,“day_28_flg”,factor.var2= “sofa.cat”, include.ref.group.effect = TRUE) Los pacientes con mayor SOFA , con Insuficiencia cardiaca tienen menor mortalidad que aquellos del mismo grupo sin IC en relacion Tambien tiene menor mortalidad que categorias de SOFA mas bajo.
tab22<-table(dat2$chf_flg,dat2$day_28_flg,dnn=c("Insuficiencia Cardiaca","Mort_28d"))
tab22
## Mort_28d
## Insuficiencia Cardiaca 0 1
## 0 1348 215
## 1 145 68
pegtab_22 <- prop.table(tab22,1)
pegtab_22
## Mort_28d
## Insuficiencia Cardiaca 0 1
## 0 0.8624440 0.1375560
## 1 0.6807512 0.3192488
Oddstab22 <-pegtab_22[,2]/pegtab_22[,1]
Oddstab22
## 0 1
## 0.1594955 0.4689655
Oddstab22[2]/Oddstab22[1]
## 1
## 2.940305
plot_OR_by_level(dat2,"chf_flg","day_28_flg",include.ref.group.effect = FALSE)
tab42<-table(dat2$sofa.cat,dat2$day_28_flg,dnn = c("Cat_SOFA","Mort_28d"))
ptab42<-prop.table(tab42,1)
ptab42
## Mort_28d
## Cat_SOFA 0 1
## [0, 5) 0.92060491 0.07939509
## [5, 7) 0.82187500 0.17812500
## 7 0.80246914 0.19753086
## [8,17] 0.78296703 0.21703297
oddstab42 <- ptab42[,2]/ptab42[,1]
oddstab42
## [0, 5) [5, 7) 7 [8,17]
## 0.0862423 0.2167300 0.2461538 0.2771930
oddstab42[2]/oddstab42[1]
## [5, 7)
## 2.513036
oddstab42[3]/oddstab42[1]
## 7
## 2.854212
oddstab42[4]/oddstab42[1]
## [8,17]
## 3.214119
oddstab42[]/oddstab42[1]
## [0, 5) [5, 7) 7 [8,17]
## 1.000000 2.513036 2.854212 3.214119
plot_OR_by_level(dat2,"sofa.cat","day_28_flg",include.ref.group.effect = TRUE)
plot_OR_by_level(dat2,"sofa.cat","day_28_flg",factor.var2= "chf_flg", include.ref.group.effect = TRUE)
plot_OR_by_level(dat2,"chf_flg","day_28_flg",factor.var2= "sofa.cat", include.ref.group.effect = TRUE)
Codigo de uno de los ejercicios
plot_prop_by_level(dat2,"sofa.cat2","day_28_flg",factor.var2="aline_flg")