SDS 164 Supplementary Code

Author

Carlos Chavez

Introduction

Hello!

This document is where I will be dumping any supplementary code that is not provided in the in class activities or in the textbook. I will try my best to organize this document by content, and to be as clear as possible with examples and functions.

The purpose of this is to provide some hopefully useful and relevant examples of data tidying that you may encounter in your own work.

Take for example, this data set I found on bears in Alaska.

Variable Description
BearPop Identifies the population to which the bear was assigned based on capture location (GAAR = Gates of the Arctic, LACL = Lake Clark, KOD = Kodiak, KATM = Katmai)
Sex Sex of individual bear (M = male or F = female).
BearNo Individual number assigned to bear. The first number is the last digit of calendar year (i.e. 4 = 2014) and the second and third numbers are the sequential number of the bears as they were captured (i.e., 01 = the first bear captured that calendar year).
Estimated Age Age in years of bear estimated from tooth wear. 0.26 +/- 4.69 (1 SD) was determined by comparing paired estimates from tooth wear to age measured from cementum lab analyses.
NoseVent Length of the body from the tip of the nose to the base of the tail following the body contour with the bear in sternal recumbancy in cm. Measurement made with tape measure.
HeadTotal Sum of the HeadWidth and HeadLength in cm. Measurement made with calipers and tape measure.
LeanMass Lean mass of individual bear in kg. This is a calculation from body mass measurement using an electronic load cell and lean mass estimation from bioelectrial impedance analyses

Dataset:

brownBear_Data.csv

Table with bear age, population, sex, lean body mass, body length, and skull size for individual bears that were used to assess growth rates. Presented in a Comma Separated Value (CSV) formatted table.

Part A - ggplot

When creating plots, sometimes we may want to display multiple types of plots in a grid without faceting. One way of accomplishing this is through the patchwork package.

library(readr) # lets us use the read_csv function
library(tidyverse)
library(patchwork) # we need to load this to use the patchwork package
library(ggridges) # allows us to use the geom_density_ridges function

brownbears <- read_csv("~/Sds_164_F25/Class/Code/supplementary code/brownBear_Data.csv")

The general workflow for using wrap_plots is to assign the ggplot code for one plot to a named object. This is done for each plot we want to wrap or grid. Then when we call upon the function, we simply specify that we want to place Plot 1 and Plot 2 together using the + symbol.

plot1 <- brownbears |> 
  ggplot(aes(x = BearPop,
             fill = Sex)) +
  geom_bar(aes(color = Sex),
           position = position_dodge(preserve = "single")) +
  theme_classic() +
  scale_fill_brewer(palette = 2) +
  scale_color_brewer(palette = 2)

plot2 <- brownbears |> 
  ggplot(aes(x = LeanMass,
             y = BearPop,
             fill = BearPop)) +
  geom_density_ridges(aes(color = BearPop), alpha = .5) +
  theme_classic() + 
  scale_fill_brewer(palette = 2) +
  scale_color_brewer(palette = 2)

wrap_plots(plot1 + plot2) 

More advanced functionality exists depending on what you are trying to achieve such as combining legends, axis, etc. For instance, both plots share the population of bears, we can combine those two plots by modifying the plot aesthetics and specifying an additional argument in the wrap_plot function.

Part B - Data Transformation

For the most part, we focused on transforming or mutating columns when dealing with numerical variables. However, there will often be times when we want to mutate categorical variables or mutate numerical variables into categorical variables.

To do this, we can make use of the case_when function. We may, for example, ask “how can I change certain values in one of my columns to a corresponding value in a new column?” Well, case_when lets us specify the value using our logical operators and then specify the corresponding value we want it to have in our new column. Aptly, it is named such that when you have cases that match a certain value ~ assign it a new specified value.

Let’s take the BearPop variable in our brown bear data set. We can mutate the variable values so that they are easier to understand.

bears <- brownbears |> 
  mutate(BearPop = case_when(BearPop == "GAAR" ~ "Gates of the Arctic",
                             BearPop == "KATM" ~ "Katmai",
                             BearPop == "KOD" ~ "Kodiak",
                             BearPop == "LACL" ~ "Lake Clark"),
         Sex = case_when(Sex == "F" ~ "Female",
                         Sex == "M" ~ "Male"))

head(bears)
# A tibble: 6 × 7
  BearPop             Sex    BearNo EstimatedAge NoseVent HeadTotal LeanMass
  <chr>               <chr>   <dbl>        <dbl>    <dbl>     <dbl>    <dbl>
1 Gates of the Arctic Female  14001          5.5      156        46     58.8
2 Gates of the Arctic Female  14004         11         NA        50     NA  
3 Gates of the Arctic Female  14005         10        173        54     83.4
4 Gates of the Arctic Female  14008         10        160        51     70.6
5 Gates of the Arctic Female  14011          7.5      148        50     56.7
6 Gates of the Arctic Female  14016          8        168        50     79.2

Above we can see that we want to change the values in BearPop to the new values. So, when a row in BearPop is equal to the value “GAAR” we want to assign it the new value “Gates of the Arctic.”

Now let’s try to make a numeric variable into a dichotomous variable (a variable with only two possible values or categories). Below I am taking any row values that are above the mean estimated age and assigning them the value “Old” to the new column, Age_Old. Similarly, anything below the average estimated age gets assigned the value “Young.”

bears <- bears |> 
  mutate(Age_Old = case_when(EstimatedAge > mean(EstimatedAge) ~ "Old",
                             EstimatedAge < mean(EstimatedAge) ~ "Young"))
head(bears)
# A tibble: 6 × 8
  BearPop          Sex   BearNo EstimatedAge NoseVent HeadTotal LeanMass Age_Old
  <chr>            <chr>  <dbl>        <dbl>    <dbl>     <dbl>    <dbl> <chr>  
1 Gates of the Ar… Fema…  14001          5.5      156        46     58.8 Young  
2 Gates of the Ar… Fema…  14004         11         NA        50     NA   Young  
3 Gates of the Ar… Fema…  14005         10        173        54     83.4 Young  
4 Gates of the Ar… Fema…  14008         10        160        51     70.6 Young  
5 Gates of the Ar… Fema…  14011          7.5      148        50     56.7 Young  
6 Gates of the Ar… Fema…  14016          8        168        50     79.2 Young  

Now I can run analyses, create plots, or continue making more variables based on this newly mutated variable!

plot3 <- bears |> 
  ggplot(aes(x = Sex, fill = BearPop, color = BearPop)) +
  geom_bar(position = position_dodge(preserve = "single")) +
  theme_classic() +
  scale_fill_brewer(palette = 2) +
  scale_color_brewer(palette = 2) +
  labs(y = "Count",
       x = "Sex",
       color = "Population Species",
       fill = "Population Species")

plot4 <- bears |> 
  ggplot(aes(x = Age_Old, fill = BearPop, color = BearPop)) +
  geom_bar(position = position_dodge(preserve = "single")) +
  theme_classic() +
  scale_fill_brewer(palette = 2) +
  scale_color_brewer(palette = 2) +
  labs(y = "Count",
       x = "Age",
       color = "Population Species",
       fill = "Population Species")


wrap_plots(plot3, plot4, guides = "collect", axis_titles = "collect_y")

Modifying Factors

One confusing function is the difference between fct_reorder and fct_reorder2 in modifying the order of the factor levels.

gss_cat <- forcats::gss_cat # assign gss_cat data to object

# create a new data set where we filter any NA values in the age column, count the number of rows for the intersection of marital and age, group the resulting data frame by rows that share the same age, and then create a new column for the proportion of marital status across each age group. 
by_age <- gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(
    prop = n / sum(n)
  )

# Here, I want to remove any previous grouping and then create new variables for marital status ordered by age, and then age and proportion. 
by_age2 <- by_age |> 
  ungroup() |> 
  mutate(marital_age = fct_reorder(marital, age),
         marital2 = fct_reorder2(marital, age, prop))

# Lastly, I will create a data frame containing the levels of each marital status column. 
marital_levels <- data.frame(marital = levels(by_age2$marital),
                             marital_age = levels(by_age2$marital_age),
                             marital_age_prop = levels(by_age2$marital2))
# Now i can more clearly see what the levels are in each specification.
marital_levels
        marital   marital_age marital_age_prop
1     No answer     No answer          Widowed
2 Never married     Separated          Married
3     Separated Never married         Divorced
4      Divorced      Divorced    Never married
5       Widowed       Married        No answer
6       Married       Widowed        Separated

If we created summary data sets using the count function and each of the modified factor variables, we can more clearly see the impact of modifying the factor order and how the data are arranged.

df1 <- by_age2 |> 
  group_by(age) |> 
  filter(age %in% c(18, 31, 40, 77))|> 
  count(marital, prop) 
df2 <- by_age2 |> 
  group_by(age) |> 
  filter(age %in% c(18, 31, 40, 77))|> 
  count(marital_age, prop) 
df3 <- by_age2 |> 
  group_by(age) |> 
  filter(age %in% c(18, 31, 40, 77))|> 
  count(marital2, prop)