Week 1 exercises for BIOS 26318

Author

Dmitry Kondrashov and Stefano Allesina

Palmer penguin data

Let us use the library palmerpenguins to produce visualizations of the measurements of the three different species on three different islands.

Let us use the library palmerpenguins to produce visualizations of the measurements of the three different species on three different islands. The data are in the tibble penguins that contains 8 variables, as you can see below:

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Make a barplot to show the counts of the penguins of different species.

ggplot(data = penguins) + aes(x = species) + geom_bar()

Make a barplot of different species, with each bar divided by sex (use the option fill).

ggplot(data = penguins) + aes(x = species, fill = sex) + geom_bar()

Make boxplots of the distributions of body masses for different years; calculate the means and variances of body mass for each year and print out (use either base R subsetting or tidyverse if you know it.) (PD - In this order, I printed 2007 mean, 2007 var, 2008 mean, 2008 var, 2009 mean, 2009 var)

ggplot(data = penguins) + aes(x = as.character(year), y = body_mass_g) + geom_boxplot()

Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

penguins_2007 <- subset(penguins, penguins$year == 2007)
penguins_2008 <- subset(penguins, penguins$year == 2008)
penguins_2009 <- subset(penguins, penguins$year == 2009)

y2007_mean <- mean(penguins_2007$body_mass_g, na.rm = TRUE)
y2007_var <- var(penguins_2007$body_mass_g, na.rm = TRUE)

y2008_mean <- mean(penguins_2008$body_mass_g, na.rm = TRUE)
y2008_var <- var(penguins_2008$body_mass_g, na.rm = TRUE)

y2009_mean <- mean(penguins_2009$body_mass_g, na.rm = TRUE)
y2009_var <- var(penguins_2009$body_mass_g, na.rm = TRUE)

y2007_mean

[1] 4124.541

y2007_var

[1] 628090.1

y2008_mean

[1] 4266.667

y2008_var

[1] 623259.6

y2009_mean

[1] 4210.294

y2009_var

[1] 677176

Make boxplots of the distributions of body masses for different years as a grid facetted by species and sex.

ggplot(data = penguins) + aes(x = as.character(year), y = body_mass_g) + geom_boxplot() + facet_grid(species ~ sex)

Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Make scatterplots of the bill length as a function of flipper length colored by sex and faceted by species.

ggplot(data = penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = sex) + geom_point() + facet_wrap(~species)

Warning: Removed 2 rows containing missing values (`geom_point()`).

Make scatterplots of the bill depth as a function of body mass colored by species and faceted by sex.

ggplot(data = penguins) + aes(x = body_mass_g, y = bill_depth_mm, color = species) + geom_point() + facet_wrap(~sex)

Warning: Removed 2 rows containing missing values (`geom_point()`).

Analysis of emergency trauma calls in Chicago

The data set below contains records of all emergency medical services runs within the city of Chicago related to traumatic injury for the years 2018 and 2019. The data set was used in a recent publication to assess the impact of opening of the Trauma Center at UChicago Hospital on emergency response times.

The data set contains 23 variables and over forty thousand observations, corresponding to each EMS call. The variables include demographic characteristics of the patient, location of the call, etc. - these are the explanatory variables for the data set, while the response variables are the amount of time it took to respond to each call, recorded in three variables: disp_scene (time from dispatch to arrival on the scene), scene_arrive_dep (time from arrival on the scence to departure), and depart_dest (time from departure to arrival at the destination/hospital).

You can download the meta-data as an Excel file that describes all the variables here.

Read in the data and show all the variables:

trauma <- read_csv("https://raw.githubusercontent.com/StefanoAllesina/BIOS_26318/master/data/Chicago_trauma_dataset.csv")

New names:
Rows: 40644 Columns: 23
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(10): causeofinjury, gender, incidentpatientdisposition, providersprimar... dbl
(12): ...1, age, scene_zip, destination.zipcode, chi, tod, month, year, ... lgl
(1): transported
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

glimpse(trauma)

Rows: 40,644
Columns: 23
$ ...1                       <dbl> 280382, 280407, 280412, 280425, 280467, 280…
$ age                        <dbl> 43, 18, 48, 57, 32, 4, 63, 55, 48, 56, 35, …
$ causeofinjury              <chr> "Not Known", "Not Known", "Falls (E88X.0)",…
$ gender                     <chr> "Male", "Male", "Female", "Male", "Male", "…
$ scene_zip                  <dbl> 60629, 60629, 60656, 60628, 60643, 60628, 6…
$ incidentpatientdisposition <chr> "Treated, Transported by EMS", "Treated, Tr…
$ providersprimaryimpression <chr> "Not Known", "312.90- Behavioral / psychiat…
$ primarysymptom             <chr> "Pain", "None", "None", "Pain", "None", "No…
$ destination.id             <chr> "6028", "6028", "6050", "6052", "0174", "60…
$ destination.zipcode        <dbl> 60629, 60629, 60631, 60628, 60406, 60628, 6…
$ raceth                     <chr> "Hispanic White", "Hispanic White", "Non-Hi…
$ transported                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ coi.class                  <chr> "Unspecified", "Unspecified", "Fall", "Fall…
$ chi                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ tod                        <dbl> 3, 4, 0, 0, 0, 1, 0, 1, 0, 3, 4, 3, 2, 2, 0…
$ day_of_week                <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday",…
$ month                      <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ year                       <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
$ median_income              <dbl> 42019, 42019, 61599, 40261, 64477, 40261, 2…
$ income_quintile            <chr> "second", "second", "third", "second", "thi…
$ disp_scene                 <dbl> 540, 120, 180, 660, 660, 360, 360, 180, 360…
$ scene_arrive_dep           <dbl> 180, 900, 900, 360, 1080, 600, 840, 180, 24…
$ depart_dest                <dbl> 120, 180, 600, 240, 360, 0, 540, 360, 180, …

Comparing distributions

Are the distributions of the response variables different for different gender of caller? Make boxplots for any one of the response variables specified above, with gender on the x axis. Calculate and print out the mean and variance of the chosen response time for different gender (response var: disp_scene; order of values: mean male, var male, mean female, var female, mean unknown, var unknown).

ggplot(data = trauma) + aes(x = gender, y = disp_scene) + geom_boxplot() + ylim(c(0,1000))

Warning: Removed 1334 rows containing non-finite values (`stat_boxplot()`).

trauma_m <- subset(trauma, trauma$gender == 'Male')
trauma_f <- subset(trauma, trauma$gender == 'Female')
trauma_u <- subset(trauma, trauma$gender == 'Unknown')

mean_m <- mean(trauma_m$disp_scene, na.rm = TRUE)
var_m <- var(trauma_m$disp_scene, na.rm = TRUE)

mean_f <- mean(trauma_f$disp_scene, na.rm = TRUE)
var_f <- var(trauma_f$disp_scene, na.rm = TRUE)

mean_u <- mean(trauma_u$disp_scene, na.rm = TRUE)
var_u <- var(trauma_u$disp_scene, na.rm = TRUE)

mean_m

[1] 417.457

var_m

[1] 82174.56

mean_f

[1] 425.2821

var_f

[1] 68618.99

mean_u

[1] 574.1905

var_u

[1] 405590.9

Are the distributions of the response variables different for different race/ethnicity of caller? Make density plots for any one of the response variables specified above, overlaying them for different race/ethnicity (raceth) and use partially transparent density plots so they can be compared. Calculate and print out the mean and variance of the chosen response time for different race/ethnicity. It looks like there is not much difference from the plot or calculated values (order: black mean, black var, hispanic white mean, hispanic white var, non-hispanic white mean, non-hispanic white var, other mean, other var, unknown mean, unknown var.

ggplot(data = trauma) + aes(x = disp_scene, fill = raceth) + geom_density(alpha = 0.5) + xlim(c(0,1000))

Warning: Removed 1334 rows containing non-finite values (`stat_density()`).

trauma_b <- subset(trauma, trauma$raceth == 'Black')
trauma_h <- subset(trauma, trauma$raceth == 'Hispanic White')
trauma_w <- subset(trauma, trauma$raceth == 'Non-Hispanic White')
trauma_o <- subset(trauma, trauma$raceth == 'Other')
trauma_u_r <- subset(trauma, trauma$raceth == 'Unknown')

mean_b <- mean(trauma_b$disp_scene, na.rm = TRUE)
var_b <- var(trauma_b$disp_scene, na.rm = TRUE)

mean_h <- mean(trauma_h$disp_scene, na.rm = TRUE)
var_h <- var(trauma_h$disp_scene, na.rm = TRUE)

mean_w <- mean(trauma_w$disp_scene, na.rm = TRUE)
var_w <- var(trauma_w$disp_scene, na.rm = TRUE)

mean_o <- mean(trauma_o$disp_scene, na.rm = TRUE)
var_o <- var(trauma_o$disp_scene, na.rm = TRUE)

mean_u_r <- mean(trauma_u_r$disp_scene, na.rm = TRUE)
var_u_r <- var(trauma_u_r$disp_scene, na.rm = TRUE)

mean_b

[1] 413.8855

var_b

[1] 73128.83

mean_h

[1] 429.8889

var_h

[1] 83564.99

mean_w

[1] 438.3689

var_w

[1] 90459.42

mean_o

[1] 420.1061

var_o

[1] 60927.76

mean_u_r

[1] 409.3336

var_u_r

[1] 59101.5

Relationships between numeric variables

Are the response times different for patients based on age? Make a scatterplot with the variable age on the x axis and one of the response times on the y axis. Try using a logarithmic scale on one or both of the axes to see if it helps identify a visual relationship between the variables. It looks like there is no trend using, even when using log transformation on either one or both of the variables. (Graphs in order are: no log transform, x log transform, y log transform, x and y log transform)

ggplot(data = trauma) + aes(x = age, y = disp_scene) + geom_point()

Warning: Removed 257 rows containing missing values (`geom_point()`).

ggplot(data = trauma) + aes(x = age, y = disp_scene) + geom_point() + scale_x_log10()

Warning: Transformation introduced infinite values in continuous x-axis
Removed 257 rows containing missing values (`geom_point()`).

ggplot(data = trauma) + aes(x = age, y = disp_scene) + geom_point() + scale_y_log10()

Warning: Transformation introduced infinite values in continuous y-axis
Removed 257 rows containing missing values (`geom_point()`).

ggplot(data = trauma) + aes(x = age, y = disp_scene) + geom_point() + scale_x_log10() + scale_y_log10()

Warning: Transformation introduced infinite values in continuous x-axis

Warning: Transformation introduced infinite values in continuous y-axis

Warning: Removed 257 rows containing missing values (`geom_point()`).

Is there a relationship between the arrival time disp_scene and the time from the scene to the hospital depart_dest? Make a scatterplot with the first variable on the x axis and the second on the y axis. Try using a logarithmic scale on one or both of the axes to see if it helps identify a visual relationship between the variables. When both variables are log transformed, it seems that there is a correlation between the two. This makes logical sense, the longer it takes to arrive is likely largely due to distance, so it should take longer to return as well.

ggplot(data = trauma) + aes(x = disp_scene, y = depart_dest) + geom_point() + scale_x_log10() + scale_y_log10()

Warning: Transformation introduced infinite values in continuous x-axis

Warning: Transformation introduced infinite values in continuous y-axis

Warning: Removed 308 rows containing missing values (`geom_point()`).

Breakdown the data by gender and race/ethnicity

Make a barplot of the counts of the number of calls by race.

ggplot(data = trauma) + aes(x = raceth) + geom_bar()

Make a barplot of the counts of the number of calls by race, each one divided by gender (using fill).

ggplot(data = trauma) + aes(x = raceth, fill = gender) + geom_bar()

Make a barplot of the counts of the number of calls by race, each one divided by gender (using fill), and faceted by month.

ggplot(data = trauma) + aes(x = raceth, fill = gender) + geom_bar() + facet_wrap(~month)