Part 1

Why are we talking about R/Python?

They are the industry standard…

The impressive growth of R
The incredible growth of Python
R is particularly widely used in healthcare
It’s in high demand in the data science job market

…for good reason

R is free and open-source
Very active user and contributor base
R packages allow users to employ cutting-edge statistics, econometrics, optimization, machine learning and simulation techniques. This makes R the leading analytics language in academia and industry

Ok, let’s get started

Set up your project

Create a folder where you’ll save project files. I recommend following a folder structure similar to the one mentioned here
Open RStudio, create a new project, save it in the root folder.
Open a new R script. Now let’s get started with the code!

Install the packages you need

You will only need to do this once. I recommend running these lines in the console instead of the script editor.

# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("here")
# install.packages("readr")
# install.packages("GGally")
# install.packages("gapminder")

Load the packages for the current session

library("dplyr")  # for manipulating data frames 
library("ggplot2")  # for data viz
library("here")  # for simplifying folder references 
library("readr")  # reading and writing data files 
library("GGally")  # extension to ggplot2
library("gapminder")  # gapminder dataset
library("janitor")  # for cleaning up dataframes

Built-in datasets

R comes with some built-in datasets. Let’s start by taking a look at a few of them.

mtcars dataset

mtcars %>% 
    
    # don't use the following lines; they're specifically for Rmarkdown files
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Summary functions:

These functions are good for getting a quick sense of a dataset.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

head(mtcars)  # by default shows top 10 rows

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Mini-exercise 1

Try doing the same with the following datasets:

iris
airquality
diamonds

Outputting data from R

We’re going to save mtcars as a csv file somewhere in our project folder. Note that with the here( ) package, we don’t have to specify a full file path - just a path relative to the project root. This means you can copy the whole project to another location on your computer, or a completely different computer, and all file references should still work.

# Where is the root folder right now? 
here()  # result will be different for everyone

## [1] "D:/r-for-data-analysis-and-modelling"

# save to the sub-folder "results" >> "output from src"
write_csv(cbind(rownames(mtcars), mtcars),    # save WHAT?
          here("results",                     # save WHERE?    
               "output from src",      
               "mtcars-original.csv"))

Side-by-side analysis in R and Excel

Take at look at the basic data manipulation vocabulary

We’re going to do the same steps in R and Excel. Note that once you do the steps in R, you will never have to do them again - your work is “conserved”. When the data changes, you don’t have to do the work all over again, just run the code, and it will repeat all the steps you specified. The same is not true in Excel.

This may seem trivial when you’re thinking of doing analysis one file at a time in a “manual” fashion. However, to multiply the productivity and effectiveness of a data science team, it is necessary to automate these tasks so that they can be performed on dozens or even hundreds of files simultaneously without human intervention.

df1.mtcars.modified <- mtcars %>% 
    # Notes on notation: 
    # "x <- 10" means "save the number 10 with object name x"
    # "%>%" translates as "then". That is, first do x %>% do y
    
    # select certain COLUMNS
    select(cyl, 
           mpg, 
           disp) %>% 
    
    # filter out certain ROWS
    filter(mpg <= 30) %>%  # let's say these are outliers 
    
    rename(cylinders = cyl, 
           miles.per.gallon = mpg, 
           displacement = disp) %>% 
    
    # let's say one of the entries of mpg was a known data error: 
    mutate(miles.per.gallon = case_when(
        miles.per.gallon == 15 ~ 15.5,  # "~" is like the "then" statement in SQL
        TRUE ~ miles.per.gallon
        )) %>% 
    
    # Wait, what kind of savages use units like miles and gallons? 
    # let's create a new column with proper civilized units: 
    mutate(kilometres.per.litre = (1.609*miles.per.gallon)/3.785) %>% 
    group_by(cylinders) %>% 
    
    summarise(avg.kpl = mean(kilometres.per.litre) %>% round(1), 
              avg.disp = mean(displacement) %>% round(1))

# show the output
df1.mtcars.modified %>% 
    kable %>%  # use this to print 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cylinders	avg.kpl	avg.disp
4	10.1	119.4
6	8.4	183.3
8	6.4	353.1

Output the modified dataset

write_csv(df1.mtcars.modified, 
          here("results",
               "output from src",      
               "mtcars-summarized.csv"))

Read in data:

Although we already have mtcars built-in, let’s practice reading in the csv file we created.

df2.mtcars.original <- read_csv(here("results",
                                     "output from src",
                                     "mtcars-original.csv"))

Data visualization with ggplot

Scatterplots:

p1.overall.mean <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable (aka "factor"): 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    # geom_point creates a scatterpolot
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # overall mean: 
    stat_smooth(method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
    
# print: 
p1.overall.mean

p2.group.means <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
# print: 
p2.group.means

p3.group.trends <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm") + 
    
    theme_classic()
    
# print: 
p3.group.trends

p4.overall.trend <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth() +  # also try "lm"
    
    theme_classic()
    
# print: 
p4.overall.trend

Boxplot:

p5.box <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = cyl,  # specify x and y axis
               y = kpl)) + 
    
    geom_boxplot() + 
    
    # by default, boxplot shows only median, not mean
    # we'll add in the mean here: 
    stat_summary(fun.y = mean, 
                 geom = "point", 
                 colour = "firebrick") + 
    
    theme_classic()

# print: 
p5.box

Summary of all important variables in the dataset

p6.pairs <- df2.mtcars.original %>% 
    select(mpg, 
           cyl, 
           hp, 
           disp) %>% 
    
    mutate(cyl = as.factor(cyl)) %>% 
    
    ggpairs()
    
# print: 
p6.pairs

Two ways of saving plots:

Individually with ggsave():

ggsave(here("results", 
            "output from src", 
            "data-summary-plot.pdf"), 
       p6.pairs, 
       width = 10, 
       units = "in")

Multiple plots using the pdf() device:

pdf(here("results", 
            "output from src", 
            "all-plots.pdf"), 
    width = 10)

p1.overall.mean
p2.group.means
p3.group.trends
p4.overall.trend
p5.box
p6.pairs
dev.off()

## png 
##   2

Let’s talk about models for data

Raw data is pretty much useless for decision-making in any complex environment. Simply “presenting the data” is first, not really possible, and second, not desirable. It’s information overload and rarely allows the audience to make an intelligent conclusion.

It is the job of the data analyst to convert the raw data into a form that is useful for decision-making. This means developing models that abstract away from the raw data by doing one of three things:

Summarizing the data
Making inferences from the data
Making predictions from the data

This is a good summary of the difference between 2 and 3

We’ll be covering several different statistical models in the next few days. For now, I encourage you to think about the models we have already developed here, and how you might develop them further.

Visualizing the Gapminder dataset

str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

p7.gapminder.pairs <- gapminder %>% 
    select(-c(country)) %>% 
    ggpairs()

p7.gapminder.pairs

GDP per capita growth over time

p8.gapminder.gdppercap <- gapminder %>% 
    ggplot(aes(x=year, 
               y=gdpPercap)) + 
    geom_jitter(aes(colour = continent) ,
                alpha = 0.2) + 
    stat_smooth(aes(group = continent, 
                    colour = continent)) + 
    # scale_y_log10() + 
    
    theme_classic()

p8.gapminder.gdppercap

How can we spread out the y-axis to see differences between the three lowest lines?

p9.gdppercap.log <- p8.gapminder.gdppercap + 
    scale_y_log10()

p9.gdppercap.log

Relationship between life expectancy and gdp per capita

p10.life.gdp <- gapminder %>% 
    ggplot(aes(x = log(gdpPercap), 
               y = lifeExp)) + 
    
    geom_point(aes(colour = year), 
               alpha = 0.5) + 
    
    facet_wrap(~continent) + 
    
    geom_smooth(method = "lm")

p10.life.gdp

Part 2

Exercise: Exploring LGH ED data

Use dplyr and gpplot2 (and any other packages you want to) to explore the LGH ED data saved in this shared folder:

Folder link: G:Learning Sessions018-11-09_R for data analysis and modelling
Alternative (for those on personal laptops): https://sftp.phsa.ca/?u=HdA4RmM&p=rLsWDWB

Start by saving the data in the “data” sub-folder of your project. Then use the following command.

# change the dataframe's name if you want to  
df3.ed.data <- read_csv(here::here("data", 
                             "ed-data.csv"), 
                        na = "NULL") %>% 
    
    # convert column names to lowercase, replace space with "_", etc.
    clean_names() %>%  # this is from the "janitor" package 
    
    # change data types: 
    mutate(ed_los_minutes = as.character(ed_los_minutes) %>% 
               as.integer, 
           ad_los_days = as.character(ad_los_days) %>% 
               as.integer, 
           age = as.character(age) %>% 
               as.integer)

Questions for investigation:

We are interested in predicting acute LOS in order to manage patient flow. Can we use ED LOS to predict Acute LOS?
Can we “adjust for” CTAS?
Relationship between age and acute LOS?
Relationship between ED LOS and CTAS.
Think about how quantifying these relationships can be useful for identifying anomalies, designing interventions, etc.

Let’s start by taking a quick look at some of the relationships between the variables.

str(df3.ed.data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    29881 obs. of  11 variables:
##  $ facility_short_name        : chr  "LGH" "LGH" "LGH" "LGH" ...
##  $ patient_id                 : int  18127870 18026225 16774828 2576271 3416991 16872795 16589263 17198994 129272 17313785 ...
##  $ visit_id                   : int  14284344 14284015 14284031 14284167 14284034 14284386 14284051 14284068 14284182 14284319 ...
##  $ start_date                 : Date, format: "2015-01-01" "2015-01-01" ...
##  $ start_date_fiscal_year     : chr  "14/15" "14/15" "14/15" "14/15" ...
##  $ age                        : int  19 19 22 27 34 42 49 60 62 69 ...
##  $ triage_acuity_code         : int  2 2 2 2 3 3 2 2 2 2 ...
##  $ charlson_index_total_weight: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ admission_nursing_unit_code: chr  "EIP" "EIP" "EIP" "MIU" ...
##  $ ed_los_minutes             : int  210 2233 850 152 795 1626 585 839 384 949 ...
##  $ ad_los_days                : int  3 1 4 25 1 1 5 6 5 1 ...

p11.pairs <- df3.ed.data %>%
    select(ed_los_minutes, 
           ad_los_days, 
           age, 
           triage_acuity_code) %>% 
    
    ggpairs(); p11.pairs

# save output 
ggsave(here("results", 
            "output from src", 
            "01_ed-date.jpeg"))  # by default ggsave saves the last plot that was called

Two types of correlation: Pearson’s and Spearman’s

Let’s look at the correlation between ED LOS and acute LOS. Note that Pearson’s correlation specifically measures the strength of linear relationships between 2 variables.

On the other hand, Spearman’s correlation is more general, and is able to quantify the strength of monotonic relationships. For more, see here

cor(df3.ed.data$ed_los_minutes, 
    df3.ed.data$ad_los_days, 
    method = "pearson",  # measures strength of *linear* relationship 
    use = "complete.obs")  # remove NA values

## [1] 0.03470739

cor(df3.ed.data$ed_los_minutes, 
    df3.ed.data$ad_los_days, 
    method = "spearman",  # more general than pearson
    use = "complete.obs")

## [1] 0.05261599

cor(df3.ed.data$age, 
    df3.ed.data$ad_los_days, 
    method = "spearman", 
    use = "complete.obs")  # 0.28

## [1] 0.2892739

Plotting Acute LOS vs ED LOS

p12.ed.and.ad.los <- df3.ed.data %>% 
    ggplot(aes(x = ed_los_minutes, 
               y = ad_los_days)) + 
    geom_point(alpha = 0.1) +  # alpha sets transparency of points 
    
    # data is too bunched together to see properly. 
    # let's convert to log-scales: 
    
    scale_y_log10(breaks = c(1, 5, 10, 30, 100)) + 
    scale_x_log10(breaks = c(30, 60, 600, 1000, 2000, 10000)) + 
    
    geom_smooth() +  # by default this uses method = GAM 
    
    geom_smooth(method = "lm", 
                colour = "skyblue", 
                se = FALSE) + 
    
    facet_wrap(~triage_acuity_code); p12.ed.and.ad.los

# save output 
ggsave(here("results", 
            "output from src", 
            "02_ed-data.jpeg"))  # by default ggsave saves the last plot that was called

Note that for CTAS 2-4, the overall pattern is that as ED LOS increases, mean acute LOS also increases. This is not true for CTAS 1 patients. Also interesting is the “wall” at 600 minutes for CTAS 3 and CTAS 4 patients. The mean acute LOS actually seems to decrease slightly for patients who just cross the 600 min mark - it’s like this is a completely separate patient population.

Acute LOS vs ED LOS by age

p13.ed.and.ad.los.age.groups <- df3.ed.data %>%
    
    # remove CTAS levels with too few observations: 
    filter(!triage_acuity_code %in% c("5", NA)) %>% 
    
    mutate(age.group = case_when(
        age < 20 ~ "under 20 yr", 
        age >=20 & age <60 ~ "20-60 yr", 
        age >= 60 ~ "over 60 yr") %>% 
            factor(levels = c("under 20 yr", 
                              "20-60 yr", 
                              "over 60 yr"))) %>%
    
    ggplot(aes(x = ed_los_minutes, 
               y = ad_los_days)) + 
    geom_point(alpha = 0.1) + 
    geom_smooth(aes(group = age.group, 
                    colour = age.group), 
                se = TRUE, 
                alpha = 0.2) + 
    
    # zoom in on the x-axis: 
    # coord_cartesian(xlim = c(0, 600)) +
    scale_y_log10(breaks = c(1,5,10,30,100)) + 
    scale_x_log10(breaks = c(30, 60, 300, 600, 2000)) + 
        
    facet_wrap(~triage_acuity_code); p13.ed.and.ad.los.age.groups

# save output 
ggsave(here("results", 
            "output from src", 
            "03_ed-date.jpeg"))  # by default ggsave saves the last plot that was called

Acute LOS vs CTAS level:

p14.los.vs.ctas <- df3.ed.data %>% 
    
    mutate(age.group = case_when(
        age < 20 ~ 1, 
        age >=20 & age <60 ~ 2, 
        age >= 60 ~ 3) %>% 
            as.factor) %>% 
    
    # remove NA value of age
    filter(!is.na(age)) %>% 
    
    mutate(triage_acuity_code = as.factor(triage_acuity_code)) %>% 
    
    ggplot(aes(x = triage_acuity_code, 
               y = ad_los_days)) + 
    
    geom_boxplot() + 
    
    stat_summary(fun.y = mean, 
                 geom = "point", 
                 colour = "firebrick") + 
    
    # take logarithm of y-axis: 
    scale_y_log10(breaks = c(1, 2, 3, 4, 5, 10, 30, 60)) +  # by default, labels are in original units, not logged units
    
    # zoom in on y-axis: 
    # coord_cartesian(ylim = c(0.1, 60)) + 
    
    facet_wrap(~age.group); p14.los.vs.ctas

Note that this kind of plot helps us to determine what counts as an outlier, based on past data. E.g. for CTAS 5 patients under 20 years, acute LOS of over 5 days is an outlier, but for patients 20-60 years at CTAS 5, LOS would have to be over 30 days before it counted as an outlier.

See you next time

Well, that’s the end of Day 1. Here’s the link to the material we’ll cover next time: https://rpubs.com/nayefahmad/day-2_time-series

Day 1: Basic data manipulation and visualization

Nayef Ahmad - VCH Decision Support

September 14, 2018