Example from the powerpoint

Using the built-in “iris” dataset, let’s count how many of species “setosa” are wider than 0.2

First, load the data:

data(iris)

Using Base R

    sum(nrow(subset(iris, Species == "setosa" & Petal.Width > 0.2)))

## [1] 16

Using dylpr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

    iris  %>% 
         filter(Species == "setosa", Petal.Width > 0.2)  %>%             summarise(sum_n = n())

##   sum_n
## 1    16

1. Making a new variable

Why? You may need to add a new category, or transform variables that you would like to analyse, a new variable that is a function of existing ones, or a new set of labels

Let’s try to make a ratio, for example Sepal by Petal Length

newIris <- iris %>% 
  mutate(ratio_SL_PL = Sepal.Length/Petal.Length)

and calculate petal leaf area, which we add directly to the new dataframe:

newIris <- newIris %>%
  mutate(Petal.Area = Petal.Length * Petal.Width)

Renaming your variable/s

newIris  <- rename(newIris,  Ratio = ratio_SL_PL)

NEW name first, then = old name

1a. Making 2 variables into one

use “unite” to combine parts of your data (i.e. if you have split colums you’d like to combine: for example: Habitat and season, year/month/day as one date column), example:

df <- unite(df, “nameNEWvariable”, c(variable1, variable2), sep = “_“)

Adding a condition into this

Back to iris data: I want to group the species based on having short or long petals (which here is me arbitrarily defining long = >5 and short anything below 5)

newIris <- newIris %>%
  mutate(
    Species_PetalLength = paste(Species, ifelse(Petal.Length > 5, "LongPetal", "ShortPetal"), sep = "_")
  )

1b. Separating a long variable into 2 separate ones:

May be usieful of you have genus & species names combined in a colum and would like to separate them:

df <- separate(df, combinedVar, c(“Genus”, “species”), sep = “_“, remove = FALSE)

library(tidyr)

newIris <- separate(newIris, Species_PetalLength, c("SpeciesNew", "PL_new"), sep = "_", remove = FALSE)

2. Subsetting rows

Based on certain criteria, you can “filter” your dataset. Common logical operators that you can use

< less than
> greater than
== equal to
<= less than or equal to
>= greater than or equal to
!= not equal to
in.NA = is NA
!is.na = is not NA

type/execute:

?Comparison

to get more detailed info!

We can also sample random subsets of your dataframe (for example if your dataframe is massibe and you’d like to explora a small random subset first, or you you’re after resampling for bootstrapping)

Here we sample 100 random rows from the iris dataset, creating a new df (the number of samples needs to be smaller than the samplescontained in your df!):

iris.100 <- sample_n(newIris, 100, replace = FALSE)

Or you select the higest value (say top 3 in Petal.Area)

iris.top <- top_n(newIris, 3, Petal.Area)

2a Removing duplicate rows

df.distinct <- distinct(df)

iris.distinct <- distinct(newIris)

2b Selecting colums (by name)

IrisReduced <- select(newIris, Species, Petal.Area, PL_new)

There’s a lot we can select,see ?select

for example we want a df that only contains information on something that is specified in the names of the columns, using Petals and sepals here:

IrisPetalOnly <- select(newIris, contains("Petal"))
IrisSepalOnly <- select(newIris, contains("Sepal"))

3. Summarising data

The dataset is clean and manageable now, so we can start exploring. For example we’d like to know how large “large” compared to “short” petals are:

newIris %>% 
  filter(PL_new == "ShortPetal") %>%
  summarise(Mean.PL = mean(Petal.Length))

##    Mean.PL
## 1 3.007407

this would require a separate operation for the LongPetal, OR: use group_by:

 newIris %>%
  group_by(PL_new) %>%
  summarise(Mean.PL = mean(Petal.Length, na.rm = TRUE))

## # A tibble: 2 × 2
##   PL_new     Mean.PL
##   <chr>        <dbl>
## 1 LongPetal     5.69
## 2 ShortPetal    3.01

of course we can do much more than getting means, such as max, min, med, sd, se, IQR, n, …

This links to the ggplot 2 workshop - especially if we’d like to lot values such as means and SEs.

# summary stats required for barchart

iris.barplot <- iris %>% 
    group_by(Species) %>% 
  summarise(meanPetal.Length = mean(Petal.Length), SEPetal.Length =
  sd(Petal.Length)/sqrt(length(Petal.Length)), lowerErr =(meanPetal.Length -
SEPetal.Length), upperErr = (meanPetal.Length + SEPetal.Length))

iris.barplot

## # A tibble: 3 × 5
##   Species    meanPetal.Length SEPetal.Length lowerErr upperErr
##   <fct>                 <dbl>          <dbl>    <dbl>    <dbl>
## 1 setosa                 1.46         0.0246     1.44     1.49
## 2 versicolor             4.26         0.0665     4.19     4.33
## 3 virginica              5.55         0.0780     5.47     5.63

library(ggplot2)

ggplot(iris.barplot, aes(x = Species, y = meanPetal.Length)) + geom_col(width
= 0.5) + geom_errorbar(aes(ymin = lowerErr, ymax = upperErr), width = 0.1) +
theme_bw()

4. Combining datasets

You may have measured and recorded different things on the same specimen - they’re now in different datasets, but there’s a “unifying” variable (such as ID number).

Let’s stay with our iris dataset and split it in two, so that we have something to combine! ;-)

setosa <- subset(iris, Species == "setosa") 
other <- subset(iris, Species != "setosa")

# Add a common identifier 
setosa$ID <- 1:nrow(setosa)
other$ID <- 1:nrow(other)

There are many different ways to combine.

Base R: rbind()
Caveat: the order of the columns must match!! It does here so let’s try

combined_rbind <- rbind(setosa, other)

dplyr: bind_rows (order of columns doesn’t strictly matter):

combined_bind_rows <- bind_rows(setosa, other)

If we’ve created summary stats and would like to merge them back into our original dataset, let’s create a new Sepal length mean value

 meanSL <- newIris %>%
  group_by(Species) %>%
  summarise(Mean.SL = mean(Sepal.Length, na.rm = TRUE))

now let’s merge the new mean

merged_iris <- merge(iris, meanSL, by = "Species")
head(merged_iris)

##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width Mean.SL
## 1  setosa          5.1         3.5          1.4         0.2   5.006
## 2  setosa          4.9         3.0          1.4         0.2   5.006
## 3  setosa          4.7         3.2          1.3         0.2   5.006
## 4  setosa          4.6         3.1          1.5         0.2   5.006
## 5  setosa          5.0         3.6          1.4         0.2   5.006
## 6  setosa          5.4         3.9          1.7         0.4   5.006

This was actually bas R - we can do this with dplyr, too.

For this we’re using “left-join”. This function keeps all rows from the “first” dataframe, and preserves their order. This is particularly useful if you have one “main” dataframe that you’d like to add addional variables to.

merged_dplyr <- iris %>%
  left_join(meanSL, by = "Species")

Note: For particularly messy data (i.e. many of the IDs not overlapping in both datasets), you can look into

inner_join() this will drop all data where the identifier is not present in both datasets

right_join() like left join, but preserve the second dataframe

full_join() you guessed it - this preserves all data

5. Reshaping

For some visualisations we may need the change the format of the dataframe.

As we’re sticking with the iris dataset, let’s create a “long” format from the original one, after adding and ID in

irisID  <- iris %>%
  mutate(Flower_ID = 1:n())  # adding an ID


iris_long <- irisID %>%
  pivot_longer(
    cols = -c("Species", "Flower_ID"),  # Columns to pivot (all except Species and Flower ID)
    names_to = "Measurement", # Name of the new column for measurement type
    values_to = "Value"  #Name of the new column for the measurement value
  )

lib what could that be useful for?

ggplot(iris_long, aes(x = Measurement, y = Value, fill = Species)) +
  geom_violin() +
  labs(title = "Measurement Type and Species",
       x = "Measurement Type",
       y = "Value (cm)",
       fill = "Species") +
  theme_minimal()

The plot combines ALL measurements and shows species by colour; a similar visualisation in the original “iris” dataset would require 3 different plots.

… and finally, let’s practice to get from a long to a wide format:

iris_wide <- iris_long %>%
  pivot_wider(
    id_cols = c(Species, Flower_ID),          # Columns that will remain as identifiers
    names_from = Measurement,    # Column whose unique values will become new column names
    values_from = Value         # Column whose values will populate the new columns
  )

Bot3 wrangling

Susi Zajitschek

2025-05-08