Often you are in the position of having ugly data. If you’re lucky, it’s not your fault. In most cases, R functions prefer data in tall, aka melted or gathered format. What this means is that each row is a unique observation. Let’s take a look at some messy data, and how to manipulate it using functions in the Tidyverse (magritrr,dplyr, etc.).
if (!require(tidyverse)) { install.packages("tidyverse"); library(tidyverse) } # If "tidyverse" isn't installed, install it; and load it.
messy <- data.frame(
fish = c("Fish1", "Fish2", "Fish3"),
light = c(3.2, 1.8, 2.3),
dark = c(1.1, 0.6, 1.3)
)
messy
## fish light dark
## 1 Fish1 3.2 1.1
## 2 Fish2 1.8 0.6
## 3 Fish3 2.3 1.3
messy is a made-up dataset describing the swimming speeds of three larval fish in “light” and “dark” treatments. Right now messy is not in tall format. Each row contains multiple observations. This is an easy way to record data, but it’s not always convenient for analyzing or plotting it. What to do?
We will use gather to make this wide (aka cast or spread) dataset, tall. ?gather.
messy %>%
gather(treatment, speed, light:dark) # The KEY in this case is "treatment" (light or dark), and the VALUE is "speed" (the values of the table cells)
## fish treatment speed
## 1 Fish1 light 3.2
## 2 Fish2 light 1.8
## 3 Fish3 light 2.3
## 4 Fish1 dark 1.1
## 5 Fish2 dark 0.6
## 6 Fish3 dark 1.3
Voila! Importantly, your original dataset is unchanged.
But why stop there?
messy %>%
gather(treatment, speed, light:dark) %>%
ggplot(aes(x = fish, y = speed, colour = treatment), data = .) +
geom_point() +
theme_bw()
Here’s another quick example. messy2 describes the swimming speed of three larval fish measured hourly for three hours, starting just after sunrise.
messy <- data.frame(
fish = c("a","b","c"),
time1 = runif(3,.5,1.5),
time2 = runif(3,1.0,2.0),
time3 = runif(3,1.5,2.5)
)
Challenge 1: Use pipe functions, gather(), and ggplot() to show the change in swimming speed for the three fish over the sampling period.
Here we will create a complicated dataset. The dataset describes some environmental variables of three sites with 10 permanent transects at each site. The sites were visited twice - once before environmental mitigation, and once after. We will use tidyr (dplyr) functions, pipe functions (magrittr), and ggplot2 to manipulate, investigate, and plot the data.
Varibale names: site, transect, plant density (at time 1 and time 2), plant height (transect mean, at time 1 and time 2), elevation of the transect, slope of the transect, distance of the transect from human activity.
This is a completely imaginary dataset
if (!require(gridExtra)) { install.packages("gridExtra",repos = "http://cran.us.r-project.org"); library(gridExtra) } # If "gridExtra" isn't installed, install it; and load it.
# Create dataset
set.seed(1) # Allows reproducable results
messy3 <- data.frame( # Create dataframe
site = c(rep(1, times = 10),rep(2, times = 10),rep(3, times = 10)), # Establish the site ID for 3 sites
transect = rep(1:10, times = 3), # Establish the transect ID for 10 transects/site
density.t1 = runif(30,5,10), # Randomly selected plant density at T1 from a uniform distribution
height.t1 = runif(30,.5,2), # Randomly selected mean plant height at T1 at each transect from a uniform distribution
elevation = runif(30,15,25), # Randomly selected elevation of each transect from a uniform distribution
slope = runif(30,5,25) # Randomly selected slope of each transect from a uniform distribution
)
messy3 %>% # Use the pipe function and mutate() to add new columns to the fake dataset
mutate(density.t2 = density.t1 * runif(30,0.9,2), height.t2 = height.t1+runif(30,.2,.7),distance = density.t1*100 *runif(30,.9,2) + runif(30,-10,10)) -> messy3 # Adding density measurements for T2 - in this case, we are creating a positive change in plant density over time. Same for plant height. We are also creating an artificial positive relationship between distance from human activity and plant density.
What are some issues with this data set? For some variables, there are multiple observations per row (density and height). Other variables are only recorded once. We need to get the data in a format suitable for plotting.
STOP HERE. LET’S TALK STRATEGY.
…………………………………………………………………….
……………………………………………………………………
…………………………………………………………………..
………………………………………………………………….
…………………………………………………………………
………………………………………………………………..
……………………………………………………………….
………………………………………………………………
……………………………………………………………..
…………………………………………………………….
……………………………………………………………
…………………………………………………………..
………………………………………………………….
…………………………………………………………
………………………………………………………..
……………………………………………………….
………………………………………………………
……………………………………………………..
…………………………………………………….
……………………………………………………
…………………………………………………..
………………………………………………….
…………………………………………………
………………………………………………..
……………………………………………….
………………………………………………
……………………………………………..
…………………………………………….
……………………………………………
…………………………………………..
………………………………………….
…………………………………………
………………………………………..
………………………………………
……………………………………..
…………………………………….
……………………………………
…………………………………..
………………………………….
…………………………………
………………………………..
……………………………….
………………………………
……………………………..
…………………………….
……………………………
…………………………..
………………………….
…………………………
Below is just one way to tackle this problem. Can you think of others? Better ways?
First, let’s gather the entire dataset. Is that useful?
messy3 %>%
gather(key,value,-site,-transect)
Maybe not. What if we gather density and height into a single column?
messy3 %>%
gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) # Gather the density and height measurements into a single column
This feels like we’re getting somewhere. We can now use separate() to separate that column by T1 and T2…
messy3 %>%
gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) %>% # Gather the density and height measurements into a single column
separate(MeasureTime, into = c("Measure", "time"), sep = "\\.") # Separate that column by T1 and T2
What else needs to happen? We need to spread() the stacked density/height column back out.
messy3 %>%
gather(MeasureTime,value, c(density.t1,density.t2, height.t1,height.t2),-site,-transect) %>% # Gather the density and height measurements into a single column
separate(MeasureTime, into = c("Measure", "time"), sep = "\\.") %>% # Separate that column by T1 and T2
spread(Measure, value) -> cleaner # Spread the stacked density/height column into two columns, and save this new, cleaner dataset
Alright, our dataset is looking a bit cleaner. Let’s group by site and time, and use summarise() to look at how density and height of plants change in our study.
cleaner %>%
group_by(site,time) %>%
summarise(meanden = mean(density), sdden = sd(density), meanhght = mean(height), sdhght = sd(height)) # Look at some summary statistics by site and sampling time.
## Source: local data frame [6 x 6]
## Groups: site [?]
##
## site time meanden sdden meanhght sdhght
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 t1 7.757569 1.578277 1.294162 0.3653170
## 2 1 t2 10.909074 1.909361 1.680681 0.4093190
## 3 2 t1 7.794102 1.352358 1.407299 0.3539017
## 4 2 t2 11.714910 3.335733 1.865323 0.4209216
## 5 3 t1 7.091615 1.535027 1.114341 0.3646625
## 6 3 t2 10.150809 2.930553 1.658753 0.4084966
Time to do some plotting. See if you can pick these apart and make them unique.
# A plot of change in plant density over the two sampling periods. Includes plotted data points to show the relationship of plant density and distance from human activity.
p1 <- ggplot(aes(x = site, y = density, group = interaction(site, time), fill = time), data = cleaner) +
geom_boxplot(position = "dodge") +
geom_point(aes(colour = distance), position = position_dodge(.75)) +
theme_bw() +
xlab("Site") +
ylab(expression("Plant density (individuals *" ~m^-2 ~ ")")) +
scale_y_continuous(limits = c(0,20)) +
theme(aspect.ratio = 1) +
scale_fill_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))
# A plot of change in plant height over the two sampling periods.
p2 <- ggplot(aes(x = site, y = height, group = interaction(site, time), fill = time), data = cleaner) +
geom_boxplot(position = "dodge") +
theme_bw() +
xlab("Site") +
ylab("Plant height (m)") +
scale_y_continuous(limits=c(0,3)) +
theme(aspect.ratio = 1) +
scale_fill_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))
grid.arrange(p1,p2,ncol = 1) # Stack the two plots
# A plot further investigating the relationship of distance and plant density
ggplot(aes(x = distance, y = density, shape = as.factor(site), colour = time), data = cleaner) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1, legend.key = element_blank()) +
ylab(expression("Plant density (indivudals *" ~m^-2 ~ ")")) +
xlab("Distance from bogey monster (m)") +
scale_shape_discrete(name = "Site", labels = c("Site 1", "Site 2", "Site 3")) +
scale_color_discrete(name = "Sampling time", labels = c("Time 1", "Time 2"))
Challenge: Add regression lines for sampling time 1 and 2.
That’s it for this session - check out the figure creation workshop session for more.