Ok, let’s get started
You will only need to do this once. I recommend running these lines in the console instead of the script editor.
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("here")
# install.packages("readr")
# install.packages("GGally")
# install.packages("gapminder")
library("dplyr") # for manipulating data frames
library("ggplot2") # for data viz
library("here") # for simplifying folder references
library("readr") # reading and writing data files
library("GGally") # extension to ggplot2
library("gapminder") # gapminder dataset
library("janitor") # for cleaning up dataframes
R comes with some built-in datasets. Let’s start by taking a look at a few of them.
mtcars %>%
# don't use the following lines; they're specifically for Rmarkdown files
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
| Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
| Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
These functions are good for getting a quick sense of a dataset.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
head(mtcars) # by default shows top 10 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Try doing the same with the following datasets:
We’re going to save mtcars as a csv file somewhere in our project folder. Note that with the here( ) package, we don’t have to specify a full file path - just a path relative to the project root. This means you can copy the whole project to another location on your computer, or a completely different computer, and all file references should still work.
# Where is the root folder right now?
here() # result will be different for everyone
## [1] "D:/r-for-data-analysis-and-modelling"
# save to the sub-folder "results" >> "output from src"
write_csv(cbind(rownames(mtcars), mtcars), # save WHAT?
here("results", # save WHERE?
"output from src",
"mtcars-original.csv"))
Take at look at the basic data manipulation vocabulary
We’re going to do the same steps in R and Excel. Note that once you do the steps in R, you will never have to do them again - your work is “conserved”. When the data changes, you don’t have to do the work all over again, just run the code, and it will repeat all the steps you specified. The same is not true in Excel.
This may seem trivial when you’re thinking of doing analysis one file at a time in a “manual” fashion. However, to multiply the productivity and effectiveness of a data science team, it is necessary to automate these tasks so that they can be performed on dozens or even hundreds of files simultaneously without human intervention.
df1.mtcars.modified <- mtcars %>%
# Notes on notation:
# "x <- 10" means "save the number 10 with object name x"
# "%>%" translates as "then". That is, first do x %>% do y
# select certain COLUMNS
select(cyl,
mpg,
disp) %>%
# filter out certain ROWS
filter(mpg <= 30) %>% # let's say these are outliers
rename(cylinders = cyl,
miles.per.gallon = mpg,
displacement = disp) %>%
# let's say one of the entries of mpg was a known data error:
mutate(miles.per.gallon = case_when(
miles.per.gallon == 15 ~ 15.5, # "~" is like the "then" statement in SQL
TRUE ~ miles.per.gallon
)) %>%
# Wait, what kind of savages use units like miles and gallons?
# let's create a new column with proper civilized units:
mutate(kilometres.per.litre = (1.609*miles.per.gallon)/3.785) %>%
group_by(cylinders) %>%
summarise(avg.kpl = mean(kilometres.per.litre) %>% round(1),
avg.disp = mean(displacement) %>% round(1))
# show the output
df1.mtcars.modified %>%
kable %>% # use this to print
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| cylinders | avg.kpl | avg.disp |
|---|---|---|
| 4 | 10.1 | 119.4 |
| 6 | 8.4 | 183.3 |
| 8 | 6.4 | 353.1 |
write_csv(df1.mtcars.modified,
here("results",
"output from src",
"mtcars-summarized.csv"))
Although we already have mtcars built-in, let’s practice reading in the csv file we created.
df2.mtcars.original <- read_csv(here("results",
"output from src",
"mtcars-original.csv"))
p1.overall.mean <- df2.mtcars.original %>%
# let's add in the kpl column again:
mutate(kpl = (1.609*mpg)/3.785) %>%
# let's recode cyl as a discrete variable (aka "factor"):
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = disp, # specify x and y axis
y = kpl)) +
# geom_point creates a scatterpolot
geom_point(aes(colour = cyl,
size = hp)) +
# overall mean:
stat_smooth(method = "lm",
formula = y ~ 1) +
theme_classic()
# print:
p1.overall.mean
p2.group.means <- df2.mtcars.original %>%
# let's add in the kpl column again:
mutate(kpl = (1.609*mpg)/3.785) %>%
# let's recode cyl as a discrete variable:
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = disp, # specify x and y axis
y = kpl)) +
geom_point(aes(colour = cyl,
size = hp)) +
# examine three different ways of summarizing behaviour within
# each level of cyl:
# mean by group:
stat_smooth(aes(group = cyl,
colour = cyl),
method = "lm",
formula = y ~ 1) +
theme_classic()
# print:
p2.group.means
p3.group.trends <- df2.mtcars.original %>%
# let's add in the kpl column again:
mutate(kpl = (1.609*mpg)/3.785) %>%
# let's recode cyl as a discrete variable:
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = disp, # specify x and y axis
y = kpl)) +
geom_point(aes(colour = cyl,
size = hp)) +
# examine three different ways of summarizing behaviour within
# each level of cyl:
# mean by group:
stat_smooth(aes(group = cyl,
colour = cyl),
method = "lm") +
theme_classic()
# print:
p3.group.trends
p4.overall.trend <- df2.mtcars.original %>%
# let's add in the kpl column again:
mutate(kpl = (1.609*mpg)/3.785) %>%
# let's recode cyl as a discrete variable:
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = disp, # specify x and y axis
y = kpl)) +
geom_point(aes(colour = cyl,
size = hp)) +
# examine three different ways of summarizing behaviour within
# each level of cyl:
# mean by group:
stat_smooth() + # also try "lm"
theme_classic()
# print:
p4.overall.trend
p5.box <- df2.mtcars.original %>%
# let's add in the kpl column again:
mutate(kpl = (1.609*mpg)/3.785) %>%
# let's recode cyl as a discrete variable:
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = cyl, # specify x and y axis
y = kpl)) +
geom_boxplot() +
# by default, boxplot shows only median, not mean
# we'll add in the mean here:
stat_summary(fun.y = mean,
geom = "point",
colour = "firebrick") +
theme_classic()
# print:
p5.box
p6.pairs <- df2.mtcars.original %>%
select(mpg,
cyl,
hp,
disp) %>%
mutate(cyl = as.factor(cyl)) %>%
ggpairs()
# print:
p6.pairs
ggsave(here("results",
"output from src",
"data-summary-plot.pdf"),
p6.pairs,
width = 10,
units = "in")
pdf(here("results",
"output from src",
"all-plots.pdf"),
width = 10)
p1.overall.mean
p2.group.means
p3.group.trends
p4.overall.trend
p5.box
p6.pairs
dev.off()
## png
## 2
Raw data is pretty much useless for decision-making in any complex environment. Simply “presenting the data” is first, not really possible, and second, not desirable. It’s information overload and rarely allows the audience to make an intelligent conclusion.
It is the job of the data analyst to convert the raw data into a form that is useful for decision-making. This means developing models that abstract away from the raw data by doing one of three things:
This is a good summary of the difference between 2 and 3
We’ll be covering several different statistical models in the next few days. For now, I encourage you to think about the models we have already developed here, and how you might develop them further.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
p7.gapminder.pairs <- gapminder %>%
select(-c(country)) %>%
ggpairs()
p7.gapminder.pairs
p8.gapminder.gdppercap <- gapminder %>%
ggplot(aes(x=year,
y=gdpPercap)) +
geom_jitter(aes(colour = continent) ,
alpha = 0.2) +
stat_smooth(aes(group = continent,
colour = continent)) +
# scale_y_log10() +
theme_classic()
p8.gapminder.gdppercap
How can we spread out the y-axis to see differences between the three lowest lines?
p9.gdppercap.log <- p8.gapminder.gdppercap +
scale_y_log10()
p9.gdppercap.log
p10.life.gdp <- gapminder %>%
ggplot(aes(x = log(gdpPercap),
y = lifeExp)) +
geom_point(aes(colour = year),
alpha = 0.5) +
facet_wrap(~continent) +
geom_smooth(method = "lm")
p10.life.gdp
Use dplyr and gpplot2 (and any other packages you want to) to explore the LGH ED data saved in this shared folder:
Start by saving the data in the “data” sub-folder of your project. Then use the following command.
# change the dataframe's name if you want to
df3.ed.data <- read_csv(here::here("data",
"ed-data.csv"),
na = "NULL") %>%
# convert column names to lowercase, replace space with "_", etc.
clean_names() %>% # this is from the "janitor" package
# change data types:
mutate(ed_los_minutes = as.character(ed_los_minutes) %>%
as.integer,
ad_los_days = as.character(ad_los_days) %>%
as.integer,
age = as.character(age) %>%
as.integer)
We are interested in predicting acute LOS in order to manage patient flow. Can we use ED LOS to predict Acute LOS?
Can we “adjust for” CTAS?
Relationship between age and acute LOS?
Relationship between ED LOS and CTAS.
Think about how quantifying these relationships can be useful for identifying anomalies, designing interventions, etc.
Let’s start by taking a quick look at some of the relationships between the variables.
str(df3.ed.data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 29881 obs. of 11 variables:
## $ facility_short_name : chr "LGH" "LGH" "LGH" "LGH" ...
## $ patient_id : int 18127870 18026225 16774828 2576271 3416991 16872795 16589263 17198994 129272 17313785 ...
## $ visit_id : int 14284344 14284015 14284031 14284167 14284034 14284386 14284051 14284068 14284182 14284319 ...
## $ start_date : Date, format: "2015-01-01" "2015-01-01" ...
## $ start_date_fiscal_year : chr "14/15" "14/15" "14/15" "14/15" ...
## $ age : int 19 19 22 27 34 42 49 60 62 69 ...
## $ triage_acuity_code : int 2 2 2 2 3 3 2 2 2 2 ...
## $ charlson_index_total_weight: int 0 0 0 0 0 0 0 0 0 1 ...
## $ admission_nursing_unit_code: chr "EIP" "EIP" "EIP" "MIU" ...
## $ ed_los_minutes : int 210 2233 850 152 795 1626 585 839 384 949 ...
## $ ad_los_days : int 3 1 4 25 1 1 5 6 5 1 ...
p11.pairs <- df3.ed.data %>%
select(ed_los_minutes,
ad_los_days,
age,
triage_acuity_code) %>%
ggpairs(); p11.pairs
# save output
ggsave(here("results",
"output from src",
"01_ed-date.jpeg")) # by default ggsave saves the last plot that was called
Let’s look at the correlation between ED LOS and acute LOS. Note that Pearson’s correlation specifically measures the strength of linear relationships between 2 variables.
On the other hand, Spearman’s correlation is more general, and is able to quantify the strength of monotonic relationships. For more, see here
cor(df3.ed.data$ed_los_minutes,
df3.ed.data$ad_los_days,
method = "pearson", # measures strength of *linear* relationship
use = "complete.obs") # remove NA values
## [1] 0.03470739
cor(df3.ed.data$ed_los_minutes,
df3.ed.data$ad_los_days,
method = "spearman", # more general than pearson
use = "complete.obs")
## [1] 0.05261599
cor(df3.ed.data$age,
df3.ed.data$ad_los_days,
method = "spearman",
use = "complete.obs") # 0.28
## [1] 0.2892739
p12.ed.and.ad.los <- df3.ed.data %>%
ggplot(aes(x = ed_los_minutes,
y = ad_los_days)) +
geom_point(alpha = 0.1) + # alpha sets transparency of points
# data is too bunched together to see properly.
# let's convert to log-scales:
scale_y_log10(breaks = c(1, 5, 10, 30, 100)) +
scale_x_log10(breaks = c(30, 60, 600, 1000, 2000, 10000)) +
geom_smooth() + # by default this uses method = GAM
geom_smooth(method = "lm",
colour = "skyblue",
se = FALSE) +
facet_wrap(~triage_acuity_code); p12.ed.and.ad.los
# save output
ggsave(here("results",
"output from src",
"02_ed-data.jpeg")) # by default ggsave saves the last plot that was called
Note that for CTAS 2-4, the overall pattern is that as ED LOS increases, mean acute LOS also increases. This is not true for CTAS 1 patients. Also interesting is the “wall” at 600 minutes for CTAS 3 and CTAS 4 patients. The mean acute LOS actually seems to decrease slightly for patients who just cross the 600 min mark - it’s like this is a completely separate patient population.
p13.ed.and.ad.los.age.groups <- df3.ed.data %>%
# remove CTAS levels with too few observations:
filter(!triage_acuity_code %in% c("5", NA)) %>%
mutate(age.group = case_when(
age < 20 ~ "under 20 yr",
age >=20 & age <60 ~ "20-60 yr",
age >= 60 ~ "over 60 yr") %>%
factor(levels = c("under 20 yr",
"20-60 yr",
"over 60 yr"))) %>%
ggplot(aes(x = ed_los_minutes,
y = ad_los_days)) +
geom_point(alpha = 0.1) +
geom_smooth(aes(group = age.group,
colour = age.group),
se = TRUE,
alpha = 0.2) +
# zoom in on the x-axis:
# coord_cartesian(xlim = c(0, 600)) +
scale_y_log10(breaks = c(1,5,10,30,100)) +
scale_x_log10(breaks = c(30, 60, 300, 600, 2000)) +
facet_wrap(~triage_acuity_code); p13.ed.and.ad.los.age.groups
# save output
ggsave(here("results",
"output from src",
"03_ed-date.jpeg")) # by default ggsave saves the last plot that was called
p14.los.vs.ctas <- df3.ed.data %>%
mutate(age.group = case_when(
age < 20 ~ 1,
age >=20 & age <60 ~ 2,
age >= 60 ~ 3) %>%
as.factor) %>%
# remove NA value of age
filter(!is.na(age)) %>%
mutate(triage_acuity_code = as.factor(triage_acuity_code)) %>%
ggplot(aes(x = triage_acuity_code,
y = ad_los_days)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = "point",
colour = "firebrick") +
# take logarithm of y-axis:
scale_y_log10(breaks = c(1, 2, 3, 4, 5, 10, 30, 60)) + # by default, labels are in original units, not logged units
# zoom in on y-axis:
# coord_cartesian(ylim = c(0.1, 60)) +
facet_wrap(~age.group); p14.los.vs.ctas
Note that this kind of plot helps us to determine what counts as an outlier, based on past data. E.g. for CTAS 5 patients under 20 years, acute LOS of over 5 days is an outlier, but for patients 20-60 years at CTAS 5, LOS would have to be over 30 days before it counted as an outlier.
Well, that’s the end of Day 1. Here’s the link to the material we’ll cover next time: https://rpubs.com/nayefahmad/day-2_time-series