Adapted and extended from Chapter 2 of Getting Started with R by Beckerman, Childs and Petchey.

The bird dataset in the exercise is from Mark Gardener’s Statistics for Ecologists using R and Excel

Introduction

In the following you are intended to write a script of your own, using the snippets of code in this instruction set. As you come to each one, just copy and paste it into your own script. If you feel that writing them out yourself would make you better able to remember them, do that instead, but remember that you dont need to remember them. Once you’re done with this exercise, you will have this worksheet and your own script which you can copy and adapt when you need to do something similar again.

The very useful data manipulation (dplyr) and plotting (ggplot2) packages within tidyverse work much more effectively when the data is tidy.

Tidy means - each variable appears only once per row, and each observation appears only once per column.

Tidy data typically looks ‘long and thin’ whereas untidy data, which is often how we like it to be for a human to read, or how it would likely be if a human had recorded it, is typically ‘short and fat’ ie fewer rows and more columns than the equivalent data set once it has been tidied.

The scenario in this exercise is that we have data in a spreadsheet that is as we might have wanted it as a data collection sheet. This means that we have entered the same type of information into several columns.

Let’s get started


Full version

Before you begin

Somewhere in your computer universe you should by now have an organised folder where you do your R work for this course. You may have named it something else, but here we will refer to it as the RStuff folder. Inside that folder there could be several sub-folders including at least these two - one called scripts and the other called data. You can guess what is supposed to go in each of these.

Projects

If you did not do so when setting it up, go now and make your RStuff folder into a Project. You will notice that an RStuff.Rproj file appears in the top level of the folder. In future, you can open the project by navigating to and then clicking on this from the File/Open Project menu. Using and working with Projects turns out to have several advantages that become more apparent the more you use them. In particular, it makes much easier the business of having an organised RStuff folder while also being able to read in or write out data or other stuff from particular sub-folders without having to have a Tech Guru’s knowledge of obscure pathway syntax. It also keeps your R environment tidy.

Eventually, in your R life, you may well end up using the widely known Github web host, which comes into its own when you want to collaborate with other people and share your Project contents with them. If all the files of your collaboration - data, scripts, manuscripts etc are bundled into a Project, then RStudio makes it very easy to ‘push’ the latest version of your work up to GitHub and to ‘pull’ down whatever your colleagues have done since you last made any changes, in such a way that none of you overwrite what anyone else has done.

But that’s for another day…..

Download the data into the data folder.

Go to Teams, then from the Files/Class Materials/Rstuff/data folder download the data file called nasty_format.csv and put it into your Rtuff/data folder. Or, get this file some other way.

Create a new notebook

Go to File/New File/R Notebook (the second option in the menu).

Create a new notebook and save it into your RStuff/scripts folder with the namedata_tidy. RStudio will automatically give it a .Rmd extension, so you don’t have to do that yourself.

Add and complete author: and date: lines to the yaml and delete all the exemplar text. Now you are ready to go.

Your script window should now be empty, with a tab data_tidy.Rmd, coloured black because you have no unsaved changes. If you did, the tab would be coloured red.

Check what is the working directory

Check that the Rstuff folder is the working directory. You can do this either by typing getwd() into your console window, or, in the Files tab of the bottom right hand window, click on the gear wheel then Go to Working Directory, which should appear in the window beneath.

You should find that the working directory is your RStuff folder, which is handy. Notice too that your Environment window (top-right) is empty, which is nice and neat, and that if you click on the Files tab in the bottom right window you see the internal structure of your RStuff folder. That is, it is showing you your working directory. These are all nice features of using Projects. It leaves you with less to do or worry about.

In particular, we don’t really need to worry about the whole idea of the working directory, as long as we use the here package, as we will see, shortly.

Our data

The data we will look at is an instance of where observations of the same thing, here bacterial densities in different bottles held at different temperatures, have been recorded on different dates and the observation from each date is placed in a separate column. This means that each row contains many observations - the bacterial densities on each of the dates where these were recorded. We would like each row to contain only a single observation.

Write the script

So, now, to work, or, as American academics like to say, let’s get started….

Follow through the rest and copy each of the blocks of code into code chunks in your script in turn. As you enter each block, remember to add a header and whatever commentary you think is suficient to make it clear what the code-chunk does., Then select, the code lines within the chunk line by line sequentially and implement them using Ctrl-Enter or Cmd-Enter, or run the whole chunk at once by pressing the green arrow at the top right of the chunk..

You should see each line or block being implemented in the Console pane and any objects created or removed will either appear or disappear in the Environment pane.

The set-up chunk

This set-up chunk is a bit opaque but is useful at the top of every notebook you write:

```{r,echo = FALSE}
knitr::opts_chunk$set(message = FALSE,warning = FALSE)
```

The arguments to the knitr::opts_chunk$set function are ‘chunk options’. In any other chunk these affect aspects of the way that chunk will behave. In the set-up chunk they act globally. This saves you from having to include common chunk options in every chunk where you might want them. Here, we just have two, to stop messages and warnings from cluttering up the knitted version of our script. You can add other chunk options. as you learn about them.

Load any packages you need

Packages in R are like apps on your phone. They add extra capabilities to R. In fact they are what makes R so powerful.

With phone apps, you install them once, then afterwards you just press on them to make them work.

It works in a similar way with packages in R. So, if you have never used a package before on the machine you are using, you need to install it.

Let us do this with the package tidyverse. We will use this very important package almost all the time.

Installing a package for the first time

If you have already installed tidyverse, here and lubridate onto your machine, you can skip this section.

In the Console pane (not in your script) type this:

#install.packages("tidyverse") #uncomment this line if you need to run it.

You should now see this package being installed in the Console window. tidyverse is a big package, so a lot of red ink rolls by. Just wait until the > prompt reappears.

Now that we have done this, we do not need to do it again, just like you only install an app on your phone once. That is why we wrote the instruction in the Console pane and not in a script.

Loading a package into the current R session

Each time you want to use an app on your phone, you have to press on it. So it is with packages in R, they don’t just work once they are installed, you have to tell R that you want to use them. We do this with the library() command, which you write in your script.

Enter a code chunk with these lines in your script and implement it.

library(tidyverse)
library(here)
library(lubridate)
library(kableExtra)

You will see a list of the packages that tidyverse brings together under one roof. We will make frequent use of most of these,particularly readr, dplyr and ggplot2. lubridate is another package, also part of tidyverse but hwich has to be separately installed. It

If all has gone well, you should just see the prompt again in the Console window.

Read the data into a data frame called nasty_format using read_csv()

Stop! Before you do this, look at the data file in Excel/Numbers/GoogleSheets. What do you notice about one of the columns?

read_csv() is a command from the dplyr package within tidyverse. You can guess what it is for.

filepath <- here("data","nasty_format.csv")
nasty_format <- read_csv(filepath)

Notice the use here of the package here(). Since we are working with a Project, the working directory is by default the top level of the project. That is where here() thinks of as ‘here’, so to speak. From that start point it can then be told to look down through whatever heirarchy of subfolders it needs to in order to find the data file you want to read into R.

The object nasty_format is a data frame (actually, it is a tibble, which the tidyverse people say is just like a data frame, but without some of the annoying properties. I don’t think this new name has caught on yet) - R’s word for a container that you can best think of as a spreadsheet. It has columns of data, where the data can be almost anything, including text, dates, whole numbers, decimal numbers, and so on.

Inspect the data using glimpse().

Having read your data into R, it is a good idea to inspect it in various ways. I habitually use the glimpse() command immediately.

glimpse(nasty_format)
## Rows: 36
## Columns: 11
## $ species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle    <chr> "7-P.c", "8-P.c", "9-P.c", "22-P.c", "23-P.c", "24-P.c", "37…
## $ temp      <dbl> 22, 22, 22, 20, 20, 20, 15, 15, 15, 22, 22, 22, 20, 20, 20, …
## $ `1.2.13`  <dbl> 100.0, 62.5, 75.0, 75.0, 50.0, 87.5, 75.0, 50.0, 75.0, 37.5,…
## $ `2.2.13`  <dbl> 58.8, 71.3, 72.5, 73.8, NA, NA, NA, NA, NA, 52.5, 63.7, 70.0…
## $ `3.2.13`  <dbl> 67.5, 67.5, 62.3, 76.3, 81.3, 62.5, 90.0, 78.8, 78.3, 23.8, …
## $ `4.2.13`  <dbl> 6.80, 7.90, 7.90, 31.30, 32.50, 28.80, 72.50, 92.50, 77.50, …
## $ `6.2.13`  <dbl> 0.93, 0.90, 0.88, 3.12, 3.75, 3.12, 102.50, 68.75, 116.20, 0…
## $ `8.2.13`  <dbl> 0.39, 0.36, 0.25, 1.01, 1.06, 1.00, 67.50, 72.50, 60.00, 0.9…
## $ `10.2.13` <dbl> 0.19, 0.16, 0.23, 0.56, 0.49, 0.41, 37.50, 52.50, 60.00, 0.3…
## $ `12.2.13` <dbl> 0.46, 0.34, 0.31, 0.50, 0.38, 0.46, 41.25, 45.00, 36.25, 0.9…

What does glimpse() tell you about a data set? How many rows are there? How many columns? What are the columns called? What kind of data do they contain?

Some other ways to Inspect your data

Other useful ‘quick look’ functions are:

head(nasty_format) 

This shows you the first few rows. Does everything look OK?

#view(nasty_format)

You can achieve the same thing as view() by clicking on the name of the variable in the Environment pane. That is what I do. I find it annoying to have view() implemented each time I run a script, which is why I have commented it out!

summary(nasty_format)
##    species             bottle               temp        1.2.13      
##  Length:36          Length:36          Min.   :15   Min.   : 25.00  
##  Class :character   Class :character   1st Qu.:15   1st Qu.: 50.00  
##  Mode  :character   Mode  :character   Median :20   Median : 75.00  
##                                        Mean   :19   Mean   : 79.17  
##                                        3rd Qu.:22   3rd Qu.:100.00  
##                                        Max.   :22   Max.   :150.00  
##                                                                     
##      2.2.13          3.2.13           4.2.13           6.2.13       
##  Min.   :52.50   Min.   : 23.80   Min.   :  1.13   Min.   :  0.760  
##  1st Qu.:63.77   1st Qu.: 66.88   1st Qu.: 23.57   1st Qu.:  2.603  
##  Median :71.90   Median : 77.90   Median : 72.50   Median : 43.125  
##  Mean   :71.51   Mean   : 74.30   Mean   : 57.59   Mean   : 40.828  
##  3rd Qu.:78.12   3rd Qu.: 85.00   3rd Qu.: 83.45   3rd Qu.: 70.000  
##  Max.   :97.50   Max.   :108.80   Max.   :111.30   Max.   :116.200  
##  NA's   :16                                                         
##      8.2.13          10.2.13           12.2.13      
##  Min.   : 0.250   Min.   : 0.1600   Min.   :  0.30  
##  1st Qu.: 1.048   1st Qu.: 0.5475   1st Qu.:  0.50  
##  Median :26.685   Median :16.2500   Median : 15.19  
##  Mean   :34.403   Mean   :28.0172   Mean   : 30.33  
##  3rd Qu.:69.062   3rd Qu.:60.0000   3rd Qu.: 50.94  
##  Max.   :91.250   Max.   :91.3000   Max.   :102.50  
## 

summary() gives you descriptive statistics of each column. Notice all the NAs in the 2.2.13 column.

Clean up the data

Almost always, with real data, there is some cleaning up to do before you can analyse it. Here, we have a column with a lot of missing data.

Remove one of the columns

We notice in looking at the data that one of the dates, 2.2.13, seems to be problematic. There are many NAs in the column for that date. We have a number of choices as to what do about this, but suppose we decide to ignore data from that date, perhaps because we realise from our log book that our procedures were faulty on that day.

We can remove or select columns that meet some criterion using the select() command from dplyr:

nasty_format <- select(nasty_format,-"2.2.13")

The minus sign here has the effect of selecting all columns except the problematic one for the data 2.2.13.

Tidy the data

Our dataset is not tidy since columns 4 to 10 contain values of the same variable abundance. The names for these columns are different dates. Thus they are actually different values of the same variable date.

To tidy the data, we need to ‘gather’ these columns into a pair of columns which we will name date and abundance, one of which contains the dates, the other of which contains the abundance values measured on those dates.

This should give us a tidy data set since in any one row there will now be just one example of each variable: species, bottle, temp, date, abundance.

As we will see, this means that our final, tidy dataset will contain more rows and fewer columns than the original, untidy data set.

We can either do this using:

Method One - tidy your data in Excel

I don’t recommend it, but you can tidy data in Excel. With simple data sets this is straightforward - but it gets trickier, more error prone and more tedious with larger data sets. . Your goal is to put the data into ‘long and thin’ form in which each variable appears only once in each row.
You should have five columns with names: “species”, “bottle”, “temp”, “date” and “abundance” No cells should be empty.This means you may have to fill down and copies some values into other cells.

Once you have done this, save the data as a .csv file called ‘nice_format.csv’

Now load this data into R, into an object called tidy_data, like this:

#filepath <- here("data","nice_format.csv")
#tidy_data <- read_csv(filepath)

or using…

Method Two - Use the pivot_longer() function from the package tidyr, which is part of tidyverse.

The pivot_longer() function supersedes the gather() function which you may come across but can otherwise forget about. In making datasets tidy it makes them longer by increasing the number of rows and decreasing the number of columns.

In this code chunk we will tidy the data and read it into an object (a tibble, in fact) called tidy_data.

tidy_data <- nasty_format %>%
  pivot_longer(4:10,names_to = 'date',values_to = 'abundance')
glimpse(tidy_data)
## Rows: 252
## Columns: 5
## $ species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle    <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp      <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date      <chr> "1.2.13", "3.2.13", "4.2.13", "6.2.13", "8.2.13", "10.2.13",…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…
  • The first argument of pivot_longer()is the dataset to reshape, nasty_format. Here it is not included explicitly since it is fed-through from the previous line by the pipe operator %>%.
  • The second argument describes which columns need to be reshaped, In this case it is columns 4:10.
  • The names_to argument gives the name of the variable that will be created from the data stored in the column names, i.e. date.
  • The values_to argument gives the name of the variable that will be created from the data stored in the cell value, i.e. abundance.

The reshaped dataset tidy_data should now be tidy. Is it?

Clean the dates

Dates and times commonly need to be dealt with in some way. In particular, when first read into R, R often doesn’t even recognise them as dates, but just thinks of them as any old text. Look at the output of the glimpse(tidy_data) command you have just implemented, and check that this is indeed the case with our data.

So we would like them to be recognised as dates and/or times and to be given a standardised format. This will enqble us to use them in ways that we could not do with text, for example to do arithmetic on them and find the time gap between two dates, and so on.

To do this we look at the dates in the Date column in tidy_data and see how they have been formatted. The format appears to be day.month.year.

We will use mutate() and the command dmy() from the lubridate package, which is part of tidyverse and which we have already loaded up. Why dmy(), do you think? There are many other similar commands for use with dates formatted in other ways.

mutate() is one of several very useful functionss within dplyr that is used to alter data within columns, or to create new ones. mutate - geddit? Here we use it together with dmy() to alter the format of the data in the date column.

glimpse(tidy_data) # before
## Rows: 252
## Columns: 5
## $ species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle    <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp      <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date      <chr> "1.2.13", "3.2.13", "4.2.13", "6.2.13", "8.2.13", "10.2.13",…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…
tidy_data <- mutate(tidy_data,date = dmy(date)) # amend the date column
glimpse(tidy_data) # after
## Rows: 252
## Columns: 5
## $ species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle    <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp      <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date      <date> 2013-02-01, 2013-02-03, 2013-02-04, 2013-02-06, 2013-02-08,…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…

What do you notice about the change in the data type of the date column after the operation of dmy()?

Grouping the data - doing what pivot tables do

This next chunk of code is very useful. It shows how we can use a pair of commands from dplyr : group_by() and summarise() to, (guess what?) group and summarise our data set. In this way they do what pivot tables do in Excel.

It also illustrates the use of the pipe operator %>% which takes the result of each line and feeds it into the next. You can read it as saying, roughly speaking, “and then”. If you get the hang of it it makes for neater code and avoids you having to create all sorts of intermediate objects on your way to some goal.

Let us find the mean and standard error of the mean of the abundances for each species of bacteria and save them to a new data frame called bacteria_summary:

bacteria_summary <- tidy_data %>%
  group_by(species) %>%
  summarise(mean = mean(abundance),se.species = sqrt(var(abundance)/n()))

Now we will present this table in a neater form, with sensible level of precison for the numerical data:

bacteria_summary %>%
  kbl(digits = 2) %>%
  kable_styling(full_width = F) # from the kableExtra package
species mean se.species
Colpidium 48.18 4.46
P.caudatum 38.94 4.56
S. fonticola 79.59 2.13
Tetrahymena 30.22 4.79

Now let’s plot the data

We won’t learn properly how to do this here, but this next part is just included as a motivation for getting to grips with ggplot2, the very powerful plotting package that comes with tidyverse.

Just type in the following code. You can leave out the comments.

tidy_data %>% 
  ggplot(aes(x = temp,y = abundance,colour = species)) + # what we want to plot
  geom_point() + # how we want to plot it
  labs(title = 'Abundace vs Temperature', # title, axis labels
       x = 'Temperature',
       y = 'Abundance') + 
  facet_wrap(~species) + # one graph for each species
  theme_classic() + # a different look
  theme(legend.position = "none") # get rid of the legend.

What is this plot showing you? Could you have made it without having first tidied the data? No, you could not.

Try adapting the code above to show abundance vs date, for each different bottle.

tldr


The complete script

Start a new R notebook and save it in your scripts folder under a suitable, R-legal name. Make sure your script includes the yaml at the top and delete all the exemplar material below it. Amend the yaml as necessary to include author and date lines. Change the title if you want.

---
title: "Using R to read in, inspect and tidy up data"
author: Michael Hunt
date: 07-02-2022
output:
  html_document:
    df_print: paged
---

Now include code chunks for each of the following sections of code. Include suitable headers before each code chunk as illustrated below, making sure they are formatted as headers in the rendered document, just like these..

This is the set-up chunk

# This is the set-up chunk
knitr::opts_chunk$set(warning = FALSE,message = FALSE)

Load packages

# Load packages
library(tidyverse)
library(here)
library(lubridate)

Load data

# load data
filepath <- here("data","nasty_format.csv")
nasty_format <- read_csv(filepath)
glimpse(nasty_format)

Inspect the data

# inspect the data
head(nasty_format)
summary(nasty_format)

Remove the 2.2.13 column - the one with loads of NAs

# Remove the 2.2.13 column - the one with loads of NAs
nasty_format <- select(nasty_format,-"2.2.13")

Tidy the dataset

# tidy the dataset
tidy_data <- nasty_format %>%
  pivot_longer(4:10,names_to = 'date',values_to = 'abundance')
glimpse(tidy_data)

Clean the data

# Clean the dates
tidy_data <- mutate(tidy_data,date = dmy(date)) # dmy() is from the lubridate package
glimpse(tidy_data)

Summarise the data by species

# Group the data by species, and find mean and standard error of the mean of abundance for each one
bacteria_summary <- tidy_data %>%
  group_by(species) %>%
  summarise(mean = mean(abundance),st.dev = sd(abundance))
bacteria_summary

Make a nice table of the summary

bacteria_summary %>%
  kbl(digits = 2) %>%
  kable_styling(full_width = F) # from the kableExtra package
species mean se.species
Colpidium 48.18 4.46
P.caudatum 38.94 4.56
S. fonticola 79.59 2.13
Tetrahymena 30.22 4.79

Plot the data

tidy_data %>%
  ggplot(aes(x = temp,y = abundance,colour = species)) + # what we want to plot
  geom_point() + # how we want to plot it
  labs(title = 'Abundace vs Temperature', # title, axis labels
       x = 'Temperature',
       y = 'Abundance') + 
  facet_wrap(~species) + # one graph for each species
  theme_classic() + # a different look
  theme(legend.position = "none") # get rid of the legend.

Could you have made this plot without having first tidied up the data? No, you could not.

Extension exercise.


This exercise uses a set of data on birds from Mark Gardener:
Gardener, M. (2017), Statistics for Ecologists Using R and Excel

Create a new R notebook and save it in your scripts folder with the name birds. If you do not already have it, put the birds.csv data file into your RStuff/data folder. You should find this in the RStuff folder on the Teams page.

In your script:

  • Load the packages tidyverse and here.
  • Read data from the bird.csv file into an object called untidy_birds.
  • What kind of an object is this?
  • Use head(untidy_birds) to inspect the first few lines of this dataset.
  • Confirm from this or by direct inspection that the dataset is indeed untidy.
  • Use pivot_longer() to tidy the dataset into a new object called tidy_birds. This should have three columns: Species, Habitat, Sightings
  • Group the data by species and find the minimum and maximum number of Sightings across the different habitats. Use the max() and min() functions for this.
  • Group the data by habitat and find the most commonly sighted species in each habitat. You may need to peek at the answer for this one.

If you are familar with the language of ANOVA analysis, you will recognise that the original untidy dataset had one column of data for each of the different ‘treatments’ or manifestations of the single factor ‘habitat’. Those treatments were Garden, Hedgerow, Parkland, Pasture and Woodland. In the same way, “Male and”Female” are two manifestations of the single factor “Sex”. In tidying the data, we are gathering together the data from all the treatment columns and putting them in a single column with a title that describes what the data is, in this case “Sightings”, with alongside it a column with a title which describes the factor, in this case habitat, within which each row has a value according to which type of treatment ie habitat that sighting was made in.

Exercise solution


Start a new R notebook and save it in your scripts folder under a suitable, R-legal name. Make sure your script includes the yaml at the top and delete all the exemplar material below it. Amend the yaml as necessary to include author and date lines. Change the title if you want.

---
title: "Tidying up the bird sightings data set"
author: Your name
date: The date
output:
  html_document:
    df_print: paged
---

Now include code chunks for each of the following sections of code. Include suitable headers before each code chunk as illustrated below, making sure they are formatted as headers in the rendered document, just like these..

This is the set-up chunk

# This is the set-up chunk
knitr::opts_chunk$set(warning = FALSE,message = FALSE)

Load packages

library(tidyverse)
library(here)

Load and inspect the data

filepath <- here("data","bird.csv")
untidy_birds <- read_csv(filepath)

glimpse(untidy_birds)
head(untidy_birds)

We see that `untidy_birds is a tibble (the new name for new, improved data-frames) with 6 columns and six rows. One of the columns contains text and the others contain numbers, in this case counts of bird sightings.

Tidy the data

We see that the there is just one column that contains Species names, but that sightings values are contained in all the other columns, whose names are the various types of habitat. This dataset is definitely untidy. There should be just one column for Habitat, not several columns for each of the several varieties of habitat. We can use pivot_longer() to achieve this.

In the code below, !Species means ‘NOT the Species column’, ie every column but this one.

tidy_birds <- pivot_longer(untidy_birds,!Species,names_to = "Habitat",values_to = "Sightings")

now what do we have:

glimpse(tidy_birds)
head(tidy_birds)

Group by species

gp_by_species <- tidy_birds %>%
  group_by(Species) %>%
  summarise(Max = max(Sightings),Min = min(Sightings))
gp_by_species

Group by habitat

gp_by_species <- tidy_birds %>%
  group_by(Habitat) %>%
  summarise(Most_often_seen_species = Species[which.max(Sightings)],Sightings = max(Sightings))
gp_by_species

Note here the use of which.max[Column name] to find the row number that contains the maximum value in that column, and then of column_name[index] to pick out the value in column <column name> at row <index>.