IN THIS LAB YOU WILL LEARN:
1.) How to deal with dates and time in R (using lubridate)
2.) How to subset, filter, and trim data
3.) Practice with the pipe %>%
4.) Optimizing (and possibly over-engineering) plots
Need more help? Chat with instructors and also try googling it! Learning how to effectively search for help online is a great tool for learning and mastering R!
dat<-read.csv('https://raw.githubusercontent.com/jbaumann3/Intro-to-R-for-Ecology/main/final_bucket_mesocosm_apex_data.csv')head(dat) #take a look at the data to see how it is formatted
To do this we just need to recognize the order of or date/time. For example, we might have year, month, day, hours, minutes OR day, month, year, hours, minutes in order from left to right.
In this case we have: 07/01/2021 00:00:00 or month/day/year hours:minutes:seconds. We care about the order of these. So to simply, we have mdy_hms Lubridate has functions for all combinations of these formats. So, mdy_hms() is one. You may also have ymd_hm() or any other combo. You just enter your date info followed by an underscore and then your time info. Here’s how you apply this!
dat$date<-mdy_hms(dat$date) #converts our date column into a date/time object based on the format (order) of our date and time str(dat)# date is no longer a factor but is now a POSIXct object, which means it is in date/time format and can be used for plots and time series!
The package ’Tidyverse” in R is a really nice all encompassing package that actually contains many other packages you’ve likely used in the past (dplyr, plyr, and ggplot2 are all included). List of packages within tidyverse.
Tidyverse is great because all of the packages like the same kinds of data. That means we can learn the tidyverse methods and apply them to nearly any analysis we want as long as we understand the format of our data. To make this all easier to understand, Tidyverse likes data formatted as columns and rows. Just like Excel would. This tends to be an easy way for us to think of data storage, especially if we are new to programming. In short, we can read data from excel (or a .csv) into R and use Tidyverse to organize, trim, graph, and analyze. Since Tidyverse is so versatile and relatively simple, it is what we are going to be learning in this course. If you have programming experience beyond this course and would like to use other methods that is ok with me. Just recognize that any skills, examples, graphs, or analysis pipelines I will show you in class are likely to be based on Tidyverse.
This section contains some worked examples of Tidyverse best practices for data manipulation. If you just want a quick refresher, you can take a look at the cheat sheet below!
We can mess with a few data sets that are built into R or into R packages.
A common one is mtcars, which is part of base R (attributes of a bunch of cars)
Another fun one is CO2, which is also part of base R (CO2 uptake from different plants). Note: co2 (no caps) is also a dataset in R. It’s just the CO2 concentration at Maona Loa observatory every year (as a list).
You are welcome to use these to practice with or you can choose from any of the datasets in the ‘datasets’ or ‘MASS’ packages (you have to load the package to get the datasets).
You can also load in your own data or pick something from online, as we learned how to do last time.
For example, I am fond of the ‘penguins’ data from TidyTuesday.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
Let’s look at penguins
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
Now let’s say we only really care about species and bill length. We can select those columns to keep and remove the rest of the columns because they are just clutter at this point. There are two ways we can do this: 1.) Select the columns we want to keep 2.) Select the columns we want to remove
Here are two ways to do that:
Base R example For those with some coding experience you may like this method as this syntax is common in other coding languages
Step 1.) Count the column numbers. Column 1 is the left most column. Remember we can use ncol() to count the total number of columns (useful when we have a huge number of columns)
ncol(penguins) # we have 8 columns
[1] 8
Species is column 1 and bill length is column 3. Those are the only columns we want!
Step 2.) Select columns we want to keep using bracket syntax. Here we wil use this basic syntax: df[rows, columns] We can input the rows and/or columns we want inside our brackets. If we want more than 1 row or column we will need to use a ‘c()’ for concatenate (combine). To select just species and bill length we would do the following:
head(penguins[,c(1,3)]) #Selecting NO specific rows and 2 columns (numbers 1 and 3)
species bill_length_mm
1 Adelie 39.1
2 Adelie 39.5
3 Adelie 40.3
4 Adelie NA
5 Adelie 36.7
6 Adelie 39.3
IMPORTANT When we do this kind of manipulation it is super helpful to NAME the output. In the above example I didn’t do that. If I don’t name the output I cannot easily call it later. If I do name it, I can use it later and see it in my ‘Environment’ tab. So, I should do this:
pens<-penguins[,c(1,3)]head(pens)
species bill_length_mm
1 Adelie 39.1
2 Adelie 39.5
3 Adelie 40.3
4 Adelie NA
5 Adelie 36.7
6 Adelie 39.3
Now, here’s how you do the same selection step by removing the columns you DO NOT want.
pens2<-penguins[,-c(2,4:8)] #NOTE that ':' is just shorthand for all columns between 4 and 8. I could also use -c(2,4,5,6,7,8)head(pens2)
species bill_length_mm
1 Adelie 39.1
2 Adelie 39.5
3 Adelie 40.3
4 Adelie NA
5 Adelie 36.7
6 Adelie 39.3
Tidyverse example (select())
Perhaps that example above was a little confusing? This is why we like Tidyverse! We can do the same thing using the select() function in Tidyverse and it is easier!
I still want just species and bill length. Here’s how I select them:
head(select(penguins, species, bill_length_mm))
species bill_length_mm
1 Adelie 39.1
2 Adelie 39.5
3 Adelie 40.3
4 Adelie NA
5 Adelie 36.7
6 Adelie 39.3
EASY. Don’t forget to name the output for use later :)
species bill_length_mm
1 Adelie 39.1
2 Adelie 39.5
3 Adelie 40.3
4 Adelie NA
5 Adelie 36.7
6 Adelie 39.3
Sometimes we only want to look at data from a subset of the data frame
For example, maybe we only want to examine data from chinstrap penguins in the penguins data. OR perhaps we only care about 4 cylinder cars in mtcars. We can filter out the data we don’t want easily using Tidyverse (filter) or base R (subset)
Tidyverse example - Using filter()
Let’s go ahead and filter the penguins data to only include chinstraphs and the mtcars data to only include 4 cylinder cars
The syntax for filter is: filter(df, column =><== number or factor)
#filter penguins to only contain chinstrapchins<-filter(penguins, species=='Chinstrap')head(chins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Chinstrap Dream 46.5 17.9 192 3500
2 Chinstrap Dream 50.0 19.5 196 3900
3 Chinstrap Dream 51.3 19.2 193 3650
4 Chinstrap Dream 45.4 18.7 188 3525
5 Chinstrap Dream 52.7 19.8 197 3725
6 Chinstrap Dream 45.2 17.8 198 3950
sex year
1 female 2007
2 male 2007
3 male 2007
4 female 2007
5 male 2007
6 female 2007
#confirm that we only have chinstrapschins$species
cars4cyl$cyl #shows us only the observations in the cyl column!
[1] 4 4 4 4 4 4 4 4 4 4 4
Base R example (subset) In this case, the subset() function that is in base R works almost exactly like the filter() function. You can essentially use them interchangably.
#subset mtcars to include only 4 cylinder carscars4cyl2.0<-subset(mtcars, cyl=='4')cars4cyl2.0
Adding a new column Sometimes we may want to do some math on a column (or a series of columns). Maybe we want to calculate a ratio, volume, or area. Maybe we just want to scale a variable by taking the log or changing it from cm to mm. We can do all of this with the mutate() function in Tidyverse!
#convert bill length to cm (and make a new column)head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year bill_length_cm
1 male 2007 3.91
2 female 2007 3.95
3 female 2007 4.03
4 <NA> 2007 NA
5 female 2007 3.67
6 male 2007 3.93
Change existing column The code above makes a new column in which bill length in cm is added as a new column to the data frame. We could have also just done the math in the original column if we wanted. That would look like this:
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 3.91 18.7 181 3750
2 Adelie Torgersen 3.95 17.4 186 3800
3 Adelie Torgersen 4.03 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 3.67 19.3 193 3450
6 Adelie Torgersen 3.93 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
NOTE This is misleading because now the values in bill_length_mm are in cm. Thus, it was better to just make a new column in this case. But you don’t have to make a new column every time if you would prefer not to. Just be careful.
Column math in Base R Column manipulation is easy enough in base R as well. We can do the same thing we did above without Tidyverse like this:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year bill_length_cm
1 male 2007 3.91
2 female 2007 3.95
3 female 2007 4.03
4 <NA> 2007 NA
5 female 2007 3.67
6 male 2007 3.93
‘Pivoting’ data means changing the format of the data. Tidyverse and ggplot in particular tend to like data in ‘long’ format. Long format means few columns and many rows. Wide format is the opposite- many columns and fewer rows.
Wide format is usually how the human brain organizes data. For example, a spreadsheet in which every species is in its own column is wide format. You might take this sheet to the field and record present/absence or count of each species at each site or something. This is great but it might be easier for us to calculate averages or do group based analysis in R if we have a column called ‘species’ in which every single species observation is a row. This leads to A LOT of repeated categorical variables (site, date, etc), which is fine.
Example of Long Format The built in dataset ‘fish_encounters’ is a simple example of long format data. Penguins, iris, and others are also in long format but are more complex
head(fish_encounters) # here we see 3 columns that track each fish (column 1) across MANY stations (column 2)
# A tibble: 6 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
Converting from long to wide using pivot_wider (Tidyverse) Although we know that long format is preferred for working in Tidyverse and doing graphing and data analysis in R, we sometimes do want data to be in wide format. There are certain functions and operations that may require wide format. This is also the format that we are most likely to use in the field. So, let’s convert fish_encounters back to what it likely was when the data were recorded in the field…
#penguins long to wide using pivot_widerwidefish<-fish_encounters %>%pivot_wider(names_from= station, values_from = seen)head(widefish)
# A tibble: 6 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
The resulting data frame above is a wide version of the orignal in which each station now has its own column. This is likely how we would record the data in the field!
Example of Wide Format Data Let’s just use widefish for this since we just made it into wide format :)
head(widefish)
# A tibble: 6 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
Converting from Wide to Long using pivot_longer (Tidyverse)
# A tibble: 6 × 3
fish station seen
<fct> <chr> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
And now we are back to our original data frame! The ‘!fish’ means simply that we do not wish to pivot the fish column. It remains unchanged. A ‘!’ before something in code usually means to exclude or remove. We’ve used names_to and values_to to give names to our new columns. pivot_longer will look for facotrs and put those in the names_to column and it will look for values (numeric) to pupt in the values_to column.
NOTES There are MANY other ways to modify pivot_wider() and pivot_longer(). I encourage you to look in the help tab, the tidyR/ Tidyverse documentation online, and for other examples on google and stack overflow.
4.) Combining functions with the pipe (%>%) syntax
The pipe, denoted as ‘|’ in most programming languages but as ‘%>%’ in R, is used to link functions together. This is an oversimplification, but it works for our needs.
A pipe (%>%) is useful when we want to do a sequence of actions to an original data frame. For example, maybe we want to select() some columns and then filter() the resulting selection before finally calculating an average (or something). We can do all of those steps individually or we can use pipes to do them all at once and create one output.
We can think of the pipe as the phrase “and then.” I will show examples in the next section.
When not to use a pipe: 1.) When you want to do manipulate multiple data frames at the same time 2.) When there are meanginful intermediate objects (aka we want an intermediate step to produce a named data frame)
The pipe is coded as ‘%>%’ and should have a single space on either side of it at all times.
Let’s do an example with penguins. Here we will select only species and bill length and then we will filter so that we only have chinstrap penguins.
Remember that we think of pipe as the phrase ‘and then’
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year bill_length_cm
1 male 2007 3.91
2 female 2007 3.95
3 female 2007 4.03
4 <NA> 2007 NA
5 female 2007 3.67
6 male 2007 3.93
#pseudocode / logic: look at dataframe penguins AND THEN (%>%) select() species and bill length AND THEN (%>%) filter by chinstrappipepen<- penguins %>%#first step of the pipe is to call the orignal dataframe so we can modify it!select(species, bill_length_mm)%>%#selected our columnsfilter(species =='Chinstrap') #filtered for chinstraphead(pipepen) #it worked! We didn't have to mess with intermediate dataframes and we got exactly what we needed :)
Now we will learn how to use the pipe to do calculations that are more meaningful for us!
The pipe becomes especially useful when we are interesting in calculating averages. This is something you’ll almost certainly be doing at some point for graphs and statistics! Pipes make this pretty easy.
When thinking about scientific hypotheses and data analysis, we often consider how groups or populations vary (both within the group and between groups). As such, a simple statistical analysis that is common is called analysis of variance (ANOVA). We often also use linear models to assess differences between groups. We will get into statistical theory later, but this does mean that it is often meaningful to graph population and group level means (with error) for the sake of comparison. So let’s learn how to calculate those!
There are three steps: 1.) Manipulate the data as needed (correct format, select what you need, filter if necessary, etc)
2.) Group the data as needed (so R know how to calculate the averages)
3.) Do your calculatiuons!
Here’s what that looks like in code form:
Let’s use mtcars and calculate the mean miles per gallon (mpg) of cars by cylinder.
mpgpercyl<-mtcars%>%group_by(cyl)%>%#group = cylinder summarize(mean=mean(mpg),error=sd(mpg)) # a simple summarize with just mean and standard deviationhead(mpgpercyl)
Now, maybe we want something more complex. Let’s say we want to look only at 4 cylinder cars that have more than 100 horsepower. Then we want to see the min, max, and mean mpg in addition to some error.
mpgdf<-mtcars%>%filter(cyl=='4' , hp >100) %>%#filters mtcars to only include cars w/ 4 cylinders and hp greater than 100summarize(min =min(mpg), max =max(mpg), mean =mean(mpg), error=sd(mpg))head(mpgdf)
min max mean error
1 21.4 30.4 25.9 6.363961
Let’s do one more using penguins. This time, I want to know how bill length various between species, islands, and sex. I also prefer to use standard error of the mean in my error bars over standard deviation. So I want to calculate that in my summarize function.
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year bill_length_cm
1 male 2007 3.91
2 female 2007 3.95
3 female 2007 4.03
4 <NA> 2007 NA
5 female 2007 3.67
6 male 2007 3.93
sumpens<- penguins %>%group_by(species, island, sex) %>%summarize(meanbill=mean(bill_length_mm), sd=sd(bill_length_mm), n=n(), se=sd/sqrt(n))%>%na.omit() #removes rows with NA values (a few rows would otherwise have NA in 'sex' due to sampling error in the field)
`summarise()` has grouped output by 'species', 'island'. You can override using
the `.groups` argument.
sumpens
# A tibble: 10 × 7
# Groups: species, island [5]
species island sex meanbill sd n se
<chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
1 Adelie Biscoe female 37.4 1.76 22 0.376
2 Adelie Biscoe male 40.6 2.01 22 0.428
3 Adelie Dream female 36.9 2.09 27 0.402
4 Adelie Dream male 40.1 1.75 28 0.330
5 Adelie Torgersen female 37.6 2.21 24 0.451
6 Adelie Torgersen male 40.6 3.03 23 0.631
7 Chinstrap Dream female 46.6 3.11 34 0.533
8 Chinstrap Dream male 51.1 1.56 34 0.268
9 Gentoo Biscoe female 45.6 2.05 58 0.269
10 Gentoo Biscoe male 49.5 2.72 61 0.348
As you can see, this is complex but with just a few lines we have all of the info we might need to make some pretty cool plots and visually inspect for differences.
Some notes on the pieces of the summarize function I used up there: meanbill is just a mean() calculation. sd is just a standard deviation calculation- sd(). n=n() calculate the sample size for each group. Standard error cannot be calculated with a built in function in R (without packages that we aren’t using here) so I wrote the formula for it myself. Standard Error = standard deviation / squareroot(sample size) in other words: se=sd/sqrt(n)
PS: here’s the payoff… we can use the dataframe we just made to build a really nice plot, like the one below. You will be learning ggplot next time! NOTE: this plot is about as complex as we’d ever expect you to get. So don’t worry, we aren’t starting with this kind of plot.