More Data Manipulations

Sometimes, data is ugly.

Often you are in the position of having ugly data. If you’re lucky, it’s not your fault. In most cases, R functions prefer data in tall, aka melted or gathered format. What this means is that each row is a unique observation. Let’s take a look at some messy data, and how to manipulate it using functions in the Tidyverse (magritrr,dplyr, etc.).

if (!require(tidyverse)) { install.packages("tidyverse"); library(tidyverse) } # If "tidyverse" isn't installed, install it; and load it.
messy <- data.frame(
  fish = c("Fish1", "Fish2", "Fish3"),
  light = c(3.2, 1.8, 2.3),
  dark = c(1.1, 0.6, 1.3)
)
messy
##    fish light dark
## 1 Fish1   3.2  1.1
## 2 Fish2   1.8  0.6
## 3 Fish3   2.3  1.3


messy is a made-up dataset describing the swimming speeds of three larval fish in “light” and “dark” treatments. Right now messy is not in tall format. Each row contains multiple observations. This is an easy way to record data, but it’s not always convenient for analyzing or plotting it. What to do?


Gather

We will use gather to make this wide (aka cast or spread) dataset, tall. ?gather.

messy %>%
  gather(treatment, speed, light:dark) # The KEY in this case is "treatment" (light or dark), and the VALUE is "speed" (the values of the table cells)
##    fish treatment speed
## 1 Fish1     light   3.2
## 2 Fish2     light   1.8
## 3 Fish3     light   2.3
## 4 Fish1      dark   1.1
## 5 Fish2      dark   0.6
## 6 Fish3      dark   1.3


Voila! Importantly, your original dataset is unchanged.


But why stop there?

messy %>%
  gather(treatment, speed, light:dark) %>%
  ggplot(aes(x = fish, y = speed, colour = treatment), data = .) +
  geom_point() +
  theme_bw()


Here’s another quick example. messy2 describes the swimming speed of three larval fish measured hourly for three hours, starting just after sunrise.

messy <- data.frame(
  fish = c("a","b","c"),
  time1 = runif(3,.5,1.5),
  time2 = runif(3,1.0,2.0),
  time3 = runif(3,1.5,2.5)
)


Challenge 1: Use pipe functions, gather(), and ggplot() to show the change in swimming speed for the three fish over the sampling period.

A more advanced example

Here we will create a complicated dataset. The dataset describes some environmental variables of three sites with 10 permanent transects at each site. The sites were visited twice - once before environmental mitigation, and once after. We will use tidyr (dplyr) functions, pipe functions (magrittr), and ggplot2 to manipulate, investigate, and plot the data.


Varibale names: site, transect, plant density (at time 1 and time 2), plant height (transect mean, at time 1 and time 2), elevation of the transect, slope of the transect, distance of the transect from human activity.

This is a completely imaginary dataset

if (!require(gridExtra)) { install.packages("gridExtra",repos = "http://cran.us.r-project.org"); library(gridExtra) } # If "gridExtra" isn't installed, install it; and load it.
# Create dataset
set.seed(1) # Allows reproducable results

messy3 <- data.frame( # Create dataframe
  site = c(rep(1, times = 10),rep(2, times = 10),rep(3, times = 10)), # Establish the site ID for 3 sites
  transect = rep(1:10, times = 3), # Establish the transect ID for 10 transects/site
  density.t1 = runif(30,5,10), # Randomly selected plant density at T1 from a uniform distribution
  height.t1 = runif(30,.5,2), # Randomly selected mean plant height at T1 at each transect from a uniform distribution
  elevation = runif(30,15,25), # Randomly selected elevation of each transect from a uniform distribution
  slope = runif(30,5,25) # Randomly selected slope of each transect from a uniform distribution
)

messy3 %>% # Use the pipe function and mutate() to add new columns to the fake dataset
  mutate(density.t2 = density.t1 * runif(30,0.9,2), height.t2 = height.t1+runif(30,.2,.7),distance = density.t1*100 *runif(30,.9,2) + runif(30,-10,10)) -> messy3 # Adding density measurements for T2 - in this case, we are creating a positive change in plant density over time. Same for plant height. We are also creating an artificial positive relationship between distance from human activity and plant density.


What are some issues with this data set? For some variables, there are multiple observations per row (density and height). Other variables are only recorded once. We need to get the data in a format suitable for plotting.

STOP HERE. LET’S TALK STRATEGY.


…………………………………………………………………….
……………………………………………………………………
…………………………………………………………………..
………………………………………………………………….
…………………………………………………………………
………………………………………………………………..
……………………………………………………………….
………………………………………………………………
……………………………………………………………..
…………………………………………………………….
……………………………………………………………
…………………………………………………………..
………………………………………………………….
…………………………………………………………
………………………………………………………..
……………………………………………………….
………………………………………………………
……………………………………………………..
…………………………………………………….
……………………………………………………
…………………………………………………..
………………………………………………….
…………………………………………………
………………………………………………..
……………………………………………….
………………………………………………
……………………………………………..
…………………………………………….
……………………………………………
…………………………………………..
………………………………………….
…………………………………………
………………………………………..
………………………………………
……………………………………..
…………………………………….
……………………………………
…………………………………..
………………………………….
…………………………………
………………………………..
……………………………….
………………………………
……………………………..
…………………………….
……………………………
…………………………..
………………………….
…………………………

Below is just one way to tackle this problem. Can you think of others? Better ways?
First, let’s gather the entire dataset. Is that useful?

messy3 %>%
  gather(key,value,-site,-transect)


Maybe not. What if we gather density and height into a single column?

messy3 %>%
  gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) # Gather the density and height measurements into a single column


This feels like we’re getting somewhere. We can now use separate() to separate that column by T1 and T2…

messy3 %>%
  gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) %>% # Gather the density and height measurements into a single column
  separate(MeasureTime, into = c("Measure", "time"), sep = "\\.") # Separate that column by T1 and T2


What else needs to happen? We need to spread() the stacked density/height column back out.

messy3 %>%
  gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) %>% # Gather the density and height measurements into a single column
  separate(MeasureTime, into = c("Measure", "time"), sep = "\\.") %>% # Separate that column by T1 and T2  
  spread(Measure, value) -> cleaner # Spread the stacked density/height column into two columns, and save this new, cleaner dataset


Alright, our dataset is looking a bit cleaner. Let’s group by site and time, and use summarise() to look at how density and height of plants change in our study.

cleaner %>%
  group_by(site,time) %>%
  summarise(meanden = mean(density), sdden = sd(density), meanhght = mean(height), sdhght = sd(height)) # Look at some summary statistics by site and sampling time.
## Source: local data frame [6 x 6]
## Groups: site [?]
## 
##    site  time   meanden    sdden meanhght    sdhght
##   <dbl> <chr>     <dbl>    <dbl>    <dbl>     <dbl>
## 1     1    t1  7.757569 1.578277 1.294162 0.3653170
## 2     1    t2 10.909074 1.909361 1.680681 0.4093190
## 3     2    t1  7.794102 1.352358 1.407299 0.3539017
## 4     2    t2 11.714910 3.335733 1.865323 0.4209216
## 5     3    t1  7.091615 1.535027 1.114341 0.3646625
## 6     3    t2 10.150809 2.930553 1.658753 0.4084966


Time to do some plotting. See if you can pick these apart and make them unique.

# A plot of change in plant density over the two sampling periods. Includes plotted data points to show the relationship of plant density and distance from human activity.
p1 <- ggplot(aes(x = site, y = density, group = interaction(site, time), fill = time), data = cleaner) +
    geom_boxplot(position = "dodge") +
    geom_point(aes(colour = distance), position = position_dodge(.75)) +
    theme_bw() +
    xlab("Site") +
    ylab(expression("Plant density (individuals *" ~m^-2 ~ ")")) +
    scale_y_continuous(limits = c(0,20)) +
    theme(aspect.ratio = 1) +
    scale_fill_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))

# A plot of change in plant height over the two sampling periods.
p2 <- ggplot(aes(x = site, y = height, group = interaction(site, time), fill = time), data = cleaner) +
    geom_boxplot(position = "dodge") +
    theme_bw() +
    xlab("Site") +
    ylab("Plant height (m)") +
    scale_y_continuous(limits=c(0,3)) +
    theme(aspect.ratio = 1) +
    scale_fill_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))

grid.arrange(p1,p2,ncol = 1) # Stack the two plots

# A plot further investigating the relationship of distance and plant density
ggplot(aes(x = distance, y = density, shape = as.factor(site), colour = time), data = cleaner) +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 1, legend.key = element_blank()) +
  ylab(expression("Plant density (indivudals *" ~m^-2 ~ ")")) +
  xlab("Distance from bogey monster (m)") +
  scale_shape_discrete(name = "Site", labels = c("Site 1", "Site 2", "Site 3")) +
  scale_color_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))


Challenge: Add regression lines for sampling time 1 and 2.


That’s it for this session - check out the figure creation workshop session for more.