Part 1

Why are we talking about R/Python?

They are the industry standard…

Set up your project

  1. Create a folder where you’ll save project files. I recommend following a folder structure similar to the one mentioned here
  2. Open RStudio, create a new project, save it in the root folder.
  3. Open a new R script. Now let’s get started with the code!

     

Install the packages you need

You will only need to do this once. I recommend running these lines in the console instead of the script editor.

# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("here")
# install.packages("readr")
# install.packages("GGally")
# install.packages("gapminder")

Load the packages for the current session

library("dplyr")  # for manipulating data frames 
library("ggplot2")  # for data viz
library("here")  # for simplifying folder references 
library("readr")  # reading and writing data files 
library("GGally")  # extension to ggplot2
library("gapminder")  # gapminder dataset
library("janitor")  # for cleaning up dataframes 

     

Built-in datasets

R comes with some built-in datasets. Let’s start by taking a look at a few of them.

mtcars dataset

mtcars %>% 
    
    # don't use the following lines; they're specifically for Rmarkdown files
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Summary functions:

These functions are good for getting a quick sense of a dataset.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
head(mtcars)  # by default shows top 10 rows 
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

 

Mini-exercise 1

Try doing the same with the following datasets:

  • iris
  • airquality
  • diamonds

     

Outputting data from R

We’re going to save mtcars as a csv file somewhere in our project folder. Note that with the here( ) package, we don’t have to specify a full file path - just a path relative to the project root. This means you can copy the whole project to another location on your computer, or a completely different computer, and all file references should still work.

# Where is the root folder right now? 
here()  # result will be different for everyone 
## [1] "D:/r-for-data-analysis-and-modelling"
# save to the sub-folder "results" >> "output from src"
write_csv(cbind(rownames(mtcars), mtcars),    # save WHAT?
          here("results",                     # save WHERE?    
               "output from src",      
               "mtcars-original.csv"))

     

Side-by-side analysis in R and Excel

Take at look at the basic data manipulation vocabulary

We’re going to do the same steps in R and Excel. Note that once you do the steps in R, you will never have to do them again - your work is “conserved”. When the data changes, you don’t have to do the work all over again, just run the code, and it will repeat all the steps you specified. The same is not true in Excel.

This may seem trivial when you’re thinking of doing analysis one file at a time in a “manual” fashion. However, to multiply the productivity and effectiveness of a data science team, it is necessary to automate these tasks so that they can be performed on dozens or even hundreds of files simultaneously without human intervention.

df1.mtcars.modified <- mtcars %>% 
    # Notes on notation: 
    # "x <- 10" means "save the number 10 with object name x"
    # "%>%" translates as "then". That is, first do x %>% do y
    
    # select certain COLUMNS
    select(cyl, 
           mpg, 
           disp) %>% 
    
    # filter out certain ROWS
    filter(mpg <= 30) %>%  # let's say these are outliers 
    
    rename(cylinders = cyl, 
           miles.per.gallon = mpg, 
           displacement = disp) %>% 
    
    # let's say one of the entries of mpg was a known data error: 
    mutate(miles.per.gallon = case_when(
        miles.per.gallon == 15 ~ 15.5,  # "~" is like the "then" statement in SQL
        TRUE ~ miles.per.gallon
        )) %>% 
    
    # Wait, what kind of savages use units like miles and gallons? 
    # let's create a new column with proper civilized units: 
    mutate(kilometres.per.litre = (1.609*miles.per.gallon)/3.785) %>% 
    group_by(cylinders) %>% 
    
    summarise(avg.kpl = mean(kilometres.per.litre) %>% round(1), 
              avg.disp = mean(displacement) %>% round(1))
# show the output
df1.mtcars.modified %>% 
    kable %>%  # use this to print 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cylinders avg.kpl avg.disp
4 10.1 119.4
6 8.4 183.3
8 6.4 353.1

     

Output the modified dataset

write_csv(df1.mtcars.modified, 
          here("results",
               "output from src",      
               "mtcars-summarized.csv"))

Read in data:

Although we already have mtcars built-in, let’s practice reading in the csv file we created.

df2.mtcars.original <- read_csv(here("results",
                                     "output from src",
                                     "mtcars-original.csv"))

     

Data visualization with ggplot

Scatterplots:

p1.overall.mean <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable (aka "factor"): 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    # geom_point creates a scatterpolot
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # overall mean: 
    stat_smooth(method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
    
# print: 
p1.overall.mean

p2.group.means <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
# print: 
p2.group.means

p3.group.trends <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm") + 
    
    theme_classic()
    
# print: 
p3.group.trends

p4.overall.trend <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = kpl)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth() +  # also try "lm"
    
    theme_classic()
    
# print: 
p4.overall.trend

Boxplot:

p5.box <- df2.mtcars.original %>% 
    
    # let's add in the kpl column again: 
    mutate(kpl = (1.609*mpg)/3.785) %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = cyl,  # specify x and y axis
               y = kpl)) + 
    
    geom_boxplot() + 
    
    # by default, boxplot shows only median, not mean
    # we'll add in the mean here: 
    stat_summary(fun.y = mean, 
                 geom = "point", 
                 colour = "firebrick") + 
    
    theme_classic()

# print: 
p5.box

Summary of all important variables in the dataset

p6.pairs <- df2.mtcars.original %>% 
    select(mpg, 
           cyl, 
           hp, 
           disp) %>% 
    
    mutate(cyl = as.factor(cyl)) %>% 
    
    ggpairs()
    
# print: 
p6.pairs

Two ways of saving plots:

Individually with ggsave():

ggsave(here("results", 
            "output from src", 
            "data-summary-plot.pdf"), 
       p6.pairs, 
       width = 10, 
       units = "in")

Multiple plots using the pdf() device:

pdf(here("results", 
            "output from src", 
            "all-plots.pdf"), 
    width = 10)

p1.overall.mean
p2.group.means
p3.group.trends
p4.overall.trend
p5.box
p6.pairs
dev.off()
## png 
##   2

     

Let’s talk about models for data

Raw data is pretty much useless for decision-making in any complex environment. Simply “presenting the data” is first, not really possible, and second, not desirable. It’s information overload and rarely allows the audience to make an intelligent conclusion.

It is the job of the data analyst to convert the raw data into a form that is useful for decision-making. This means developing models that abstract away from the raw data by doing one of three things:

  1. Summarizing the data
  2. Making inferences from the data
  3. Making predictions from the data

This is a good summary of the difference between 2 and 3

We’ll be covering several different statistical models in the next few days. For now, I encourage you to think about the models we have already developed here, and how you might develop them further.


 
   

Visualizing the Gapminder dataset

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
p7.gapminder.pairs <- gapminder %>% 
    select(-c(country)) %>% 
    ggpairs()

p7.gapminder.pairs

GDP per capita growth over time

p8.gapminder.gdppercap <- gapminder %>% 
    ggplot(aes(x=year, 
               y=gdpPercap)) + 
    geom_jitter(aes(colour = continent) ,
                alpha = 0.2) + 
    stat_smooth(aes(group = continent, 
                    colour = continent)) + 
    # scale_y_log10() + 
    
    theme_classic()

p8.gapminder.gdppercap

How can we spread out the y-axis to see differences between the three lowest lines?

p9.gdppercap.log <- p8.gapminder.gdppercap + 
    scale_y_log10()

p9.gdppercap.log

Relationship between life expectancy and gdp per capita

p10.life.gdp <- gapminder %>% 
    ggplot(aes(x = log(gdpPercap), 
               y = lifeExp)) + 
    
    geom_point(aes(colour = year), 
               alpha = 0.5) + 
    
    facet_wrap(~continent) + 
    
    geom_smooth(method = "lm")

p10.life.gdp


     

Part 2

Exercise: Exploring LGH ED data

Use dplyr and gpplot2 (and any other packages you want to) to explore the LGH ED data saved in this shared folder:

Start by saving the data in the “data” sub-folder of your project. Then use the following command.

# change the dataframe's name if you want to  
df3.ed.data <- read_csv(here::here("data", 
                             "ed-data.csv"), 
                        na = "NULL") %>% 
    
    # convert column names to lowercase, replace space with "_", etc.
    clean_names() %>%  # this is from the "janitor" package 
    
    # change data types: 
    mutate(ed_los_minutes = as.character(ed_los_minutes) %>% 
               as.integer, 
           ad_los_days = as.character(ad_los_days) %>% 
               as.integer, 
           age = as.character(age) %>% 
               as.integer)

Questions for investigation:

  1. We are interested in predicting acute LOS in order to manage patient flow. Can we use ED LOS to predict Acute LOS?

  2. Can we “adjust for” CTAS?

  3. Relationship between age and acute LOS?

  4. Relationship between ED LOS and CTAS.

  5. Think about how quantifying these relationships can be useful for identifying anomalies, designing interventions, etc.

Let’s start by taking a quick look at some of the relationships between the variables.

str(df3.ed.data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    29881 obs. of  11 variables:
##  $ facility_short_name        : chr  "LGH" "LGH" "LGH" "LGH" ...
##  $ patient_id                 : int  18127870 18026225 16774828 2576271 3416991 16872795 16589263 17198994 129272 17313785 ...
##  $ visit_id                   : int  14284344 14284015 14284031 14284167 14284034 14284386 14284051 14284068 14284182 14284319 ...
##  $ start_date                 : Date, format: "2015-01-01" "2015-01-01" ...
##  $ start_date_fiscal_year     : chr  "14/15" "14/15" "14/15" "14/15" ...
##  $ age                        : int  19 19 22 27 34 42 49 60 62 69 ...
##  $ triage_acuity_code         : int  2 2 2 2 3 3 2 2 2 2 ...
##  $ charlson_index_total_weight: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ admission_nursing_unit_code: chr  "EIP" "EIP" "EIP" "MIU" ...
##  $ ed_los_minutes             : int  210 2233 850 152 795 1626 585 839 384 949 ...
##  $ ad_los_days                : int  3 1 4 25 1 1 5 6 5 1 ...
p11.pairs <- df3.ed.data %>%
    select(ed_los_minutes, 
           ad_los_days, 
           age, 
           triage_acuity_code) %>% 
    
    ggpairs(); p11.pairs

# save output 
ggsave(here("results", 
            "output from src", 
            "01_ed-date.jpeg"))  # by default ggsave saves the last plot that was called 

Two types of correlation: Pearson’s and Spearman’s

Let’s look at the correlation between ED LOS and acute LOS. Note that Pearson’s correlation specifically measures the strength of linear relationships between 2 variables.

On the other hand, Spearman’s correlation is more general, and is able to quantify the strength of monotonic relationships. For more, see here

cor(df3.ed.data$ed_los_minutes, 
    df3.ed.data$ad_los_days, 
    method = "pearson",  # measures strength of *linear* relationship 
    use = "complete.obs")  # remove NA values 
## [1] 0.03470739
cor(df3.ed.data$ed_los_minutes, 
    df3.ed.data$ad_los_days, 
    method = "spearman",  # more general than pearson
    use = "complete.obs")
## [1] 0.05261599
cor(df3.ed.data$age, 
    df3.ed.data$ad_los_days, 
    method = "spearman", 
    use = "complete.obs")  # 0.28
## [1] 0.2892739

Plotting Acute LOS vs ED LOS

p12.ed.and.ad.los <- df3.ed.data %>% 
    ggplot(aes(x = ed_los_minutes, 
               y = ad_los_days)) + 
    geom_point(alpha = 0.1) +  # alpha sets transparency of points 
    
    # data is too bunched together to see properly. 
    # let's convert to log-scales: 
    
    scale_y_log10(breaks = c(1, 5, 10, 30, 100)) + 
    scale_x_log10(breaks = c(30, 60, 600, 1000, 2000, 10000)) + 
    
    geom_smooth() +  # by default this uses method = GAM 
    
    geom_smooth(method = "lm", 
                colour = "skyblue", 
                se = FALSE) + 
    
    facet_wrap(~triage_acuity_code); p12.ed.and.ad.los

# save output 
ggsave(here("results", 
            "output from src", 
            "02_ed-data.jpeg"))  # by default ggsave saves the last plot that was called 

Note that for CTAS 2-4, the overall pattern is that as ED LOS increases, mean acute LOS also increases. This is not true for CTAS 1 patients. Also interesting is the “wall” at 600 minutes for CTAS 3 and CTAS 4 patients. The mean acute LOS actually seems to decrease slightly for patients who just cross the 600 min mark - it’s like this is a completely separate patient population.

     

Acute LOS vs ED LOS by age

p13.ed.and.ad.los.age.groups <- df3.ed.data %>%
    
    # remove CTAS levels with too few observations: 
    filter(!triage_acuity_code %in% c("5", NA)) %>% 
    
    mutate(age.group = case_when(
        age < 20 ~ "under 20 yr", 
        age >=20 & age <60 ~ "20-60 yr", 
        age >= 60 ~ "over 60 yr") %>% 
            factor(levels = c("under 20 yr", 
                              "20-60 yr", 
                              "over 60 yr"))) %>%
    
    ggplot(aes(x = ed_los_minutes, 
               y = ad_los_days)) + 
    geom_point(alpha = 0.1) + 
    geom_smooth(aes(group = age.group, 
                    colour = age.group), 
                se = TRUE, 
                alpha = 0.2) + 
    
    # zoom in on the x-axis: 
    # coord_cartesian(xlim = c(0, 600)) +
    scale_y_log10(breaks = c(1,5,10,30,100)) + 
    scale_x_log10(breaks = c(30, 60, 300, 600, 2000)) + 
        
    facet_wrap(~triage_acuity_code); p13.ed.and.ad.los.age.groups

# save output 
ggsave(here("results", 
            "output from src", 
            "03_ed-date.jpeg"))  # by default ggsave saves the last plot that was called 

Acute LOS vs CTAS level:

p14.los.vs.ctas <- df3.ed.data %>% 
    
    mutate(age.group = case_when(
        age < 20 ~ 1, 
        age >=20 & age <60 ~ 2, 
        age >= 60 ~ 3) %>% 
            as.factor) %>% 
    
    # remove NA value of age
    filter(!is.na(age)) %>% 
    
    mutate(triage_acuity_code = as.factor(triage_acuity_code)) %>% 
    
    ggplot(aes(x = triage_acuity_code, 
               y = ad_los_days)) + 
    
    geom_boxplot() + 
    
    stat_summary(fun.y = mean, 
                 geom = "point", 
                 colour = "firebrick") + 
    
    # take logarithm of y-axis: 
    scale_y_log10(breaks = c(1, 2, 3, 4, 5, 10, 30, 60)) +  # by default, labels are in original units, not logged units
    
    # zoom in on y-axis: 
    # coord_cartesian(ylim = c(0.1, 60)) + 
    
    facet_wrap(~age.group); p14.los.vs.ctas

Note that this kind of plot helps us to determine what counts as an outlier, based on past data. E.g. for CTAS 5 patients under 20 years, acute LOS of over 5 days is an outlier, but for patients 20-60 years at CTAS 5, LOS would have to be over 30 days before it counted as an outlier.

     

See you next time

Well, that’s the end of Day 1. Here’s the link to the material we’ll cover next time: https://rpubs.com/nayefahmad/day-2_time-series