This homework introduces you to R, R
Markdown, and the tidyverse — a collection of
R packages designed around a consistent, readable grammar for data
manipulation and visualization. You will use the built-in
iris dataset, which contains measurements of 150 flowers
across three species of iris.
By the end of this assignment you will be able to:
.Rmd documentdplyr verbs: filter(),
select(), mutate(), arrange(),
group_by(), and summarise()ggplot2Submit both files to Moodle:
.Rmd source file — named
asn0_YOUR_NAME.Rmd.html output — named
asn0_YOUR_NAME.htmlTo knit, click the Knit button at the top of the
RStudio editor (or press Ctrl/Cmd + Shift + K). If your
document knits without errors, you are good to go.
Getting help: If a code chunk produces a red error message, read the last line first — it usually tells you exactly what went wrong. Common issues are a missing package (
could not find function) or a typo in a variable name (object not found).
An R Markdown file (.Rmd) weaves together three
things:
--- lines. It sets the title, author, date, and output
format. You already filled in your name above.**double asterisks**, italic with
*single asterisks*, and create headers with #
symbols.```{r} and end with ```. R executes the code
and prints the result directly below the chunk in your output.The chunk below prints the text “Hello from R!” Run it by clicking
the green play button on the right side of the chunk, or pressing
Ctrl/Cmd + Enter with your cursor inside it.
## [1] "Hello from Tan :)"
✏️ Task 1.2 — Edit the chunk above to print your own name instead of “Hello from R!”. Knit the document and confirm your name appears in the output.
R evaluates expressions directly. Run the chunk below to see basic arithmetic.
# Lines starting with # are comments — R ignores them
# They are for human readers only
2 + 2 # addition## [1] 4
## [1] 3.333333
## [1] 256
## [1] 12
## [1] 3
print("Yes, the value that I got (3) from R matched with the actual value we expect from logarithms, 10^3 = 1000")## [1] "Yes, the value that I got (3) from R matched with the actual value we expect from logarithms, 10^3 = 1000"
✏️ Task 1.3 — Add one more line to the chunk above
that calculates log10(1000). What value do you get, and
does it match what you know about logarithms?
In R, you store values in variables using the <- OR
= operator (read it as “gets”). Variable names are
case-sensitive: MyVar and myvar are
different.
species_count <- 3 # store the number 3
petal_unit <- "cm" # store a text string
# Print them
species_count## [1] 3
## [1] "cm"
## [1] "Iris has 3 species measured in cm"
## [1] "I am learning R in 2026"
✏️ Task 1.4 — Create a variable called
my_year that stores the current year as a number. Then
write a second line that prints the sentence: “I am learning R in
YEAR” using paste().
Packages extend R’s built-in capabilities. The
tidyverse is actually a bundle of several packages —
dplyr for data manipulation, ggplot2 for
plots, tidyr for reshaping data, and more.
You only need to install a package once per computer. You
need to load it every session with library().
# If you have not installed the tidyverse yet, run this line ONCE in your
# Console (not in a chunk):
# install.packages("tidyverse")
library(tidyverse)Note: You will see some messages about which packages were loaded and which functions are “masked”. This is normal — it just means tidyverse functions like
filter()now take priority over base R functions with the same name.
iris is built into R — no file to download. Just type
the name to access it.
## [1] 150 5
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
✏️ Task 2.2 — Answer the following questions as a
bulleted list in the text block below. Use the output of
dim() and str() — do not Google the
answers.
Your answers:
How many rows (observations) does the dataset have?
There are 150 rows
How many columns (variables)?
There are 5 columns
What are the names of the five columns?
Sepal length, Sepal width, Petal length, Petal width, and Species
What data type is the Species column? (hint: look at
str() output — it will say Factor,
num, chr, etc.)
“species” has a factor data type
What are the three species in the dataset? (hint:
levels(iris_data$Species))
setosa, versicolor, and virginica
## [1] "setosa" "versicolor" "virginica"
$ operator and basic summariesThe $ operator extracts a single column from a data
frame as a vector.
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
## [1] 5.843333
## [1] 5.8
## [1] 4.3 7.9
The summary() function gives you a quick overview of
every column at once.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
✏️ Task 2.3 — Using summary() output or
$ with a function, answer in the text below:
Your answers:
dplyr provides six core “verbs” that cover most data
manipulation tasks. They all take a data frame as their first argument
and return a data frame, which means you can chain them together with
the pipe operator %>% (read as
“then”).
data %>% verb1(...) %>% verb2(...) %>% verb3(...)
Each verb does one thing clearly — this makes your code readable almost like a sentence.
filter() — keep rows that match a condition# Keep only rows where the species is setosa
setosa_only <- iris_data %>%
filter(Species == "setosa")
# How many rows remain?
nrow(setosa_only)## [1] 50
# You can combine conditions with & (AND) or | (OR)
# Keep setosa flowers with sepal length greater than 5
setosa_long <- iris_data %>%
filter(Species == "setosa" & Sepal.Length > 5)
nrow(setosa_long)## [1] 22
✏️ Task 3.1 — Write a filter that keeps only
virginica flowers with a petal width greater than 2.0.
Store the result in a variable called virginica_wide. Print
the result and report how many rows it contains.
How many virginica flowers have petal width > 2.0? (write your answer here)
select() — keep or drop columns✏️ Task 3.2 — Create a new data frame called
petal_data that contains only Species,
Petal.Length, and Petal.Width. Display the
first 6 rows using head().
mutate() — add or transform columnsmutate() creates new columns or overwrites existing
ones. The new column is computed row by row.
# Create a new column: petal area (approximated as length × width)
iris_data <- iris_data %>%
mutate(Petal.Area = Petal.Length * Petal.Width)
# Check it was added
head(iris_data)# Create a categorical column: flag large vs small sepals
iris_data <- iris_data %>%
mutate(Sepal.Size = ifelse(Sepal.Length > 5.8, "large", "small"))
head(iris_data %>% select(Species, Sepal.Length, Sepal.Size))✏️ Task 3.3 — Add a new column to
iris_data called Sepal.Ratio that is
Sepal.Length divided by Sepal.Width. Then
display only Species, Sepal.Length,
Sepal.Width, and Sepal.Ratio for the first 8
rows.
arrange() — sort rows# Sort by petal length, largest first (desc = descending)
iris_data %>%
arrange(desc(Petal.Length)) %>%
head(5)✏️ Task 3.4 — Find the 5 flowers with the largest
Sepal.Ratio (the column you made in Task 3.3). Display only
their Species, Sepal.Length,
Sepal.Width, and Sepal.Ratio columns, sorted
from largest to smallest ratio.
Which species dominates the top 5? (write your answer here)
group_by() + summarise() — compute
group statisticsThis is the most powerful combination in dplyr.
group_by() splits the data into invisible groups;
summarise() collapses each group into a single row of
summary statistics.
# Mean sepal length by species
iris_data %>%
group_by(Species) %>%
summarise(mean_sepal_length = mean(Sepal.Length))# Multiple summary statistics at once
species_summary <- iris_data %>%
group_by(Species) %>%
summarise(
n = n(), # count rows per group
mean_petal_l = mean(Petal.Length),
sd_petal_l = sd(Petal.Length),
max_petal_w = max(Petal.Width)
)
species_summary✏️ Task 3.5 — Create a summary table called
sepal_summary that shows, for each species: the mean,
minimum, and maximum of Sepal.Ratio (from Task 3.3). Round
all numeric columns to 2 decimal places using
round(..., 2).
In 1–2 sentences, which species has the most variable sepal ratio, and how can you tell from your table? (write your answer here)
These verbs become powerful when chained. Here is a multi-step pipeline written as a readable sequence:
# Start with all data, then:
# 1. Keep only versicolor and virginica
# 2. Select the columns we care about
# 3. Add a petal ratio column
# 4. Summarise by species
iris_data %>%
filter(Species %in% c("versicolor", "virginica")) %>%
select(Species, Petal.Length, Petal.Width) %>%
mutate(Petal.Ratio = round(Petal.Length / Petal.Width, 2)) %>%
group_by(Species) %>%
summarise(
mean_petal_ratio = mean(Petal.Ratio),
sd_petal_ratio = sd(Petal.Ratio)
)✏️ Task 3.6 — Write your own multi-step pipeline
(minimum 3 verbs chained with %>%) that answers a
question of your choice about the iris dataset. State your question in
the text block below, write the pipeline, and write one sentence
interpreting the result.
My question: (write it here)
Interpretation: (write one sentence here)
ggplot2 builds plots layer by layer. Every plot starts
with ggplot(), which sets up the coordinate system and maps
columns to visual properties (aesthetics —
x, y, colour, size,
etc.). You then add geometry layers
(geom_point, geom_boxplot, etc.) with
+.
ggplot(data, aes(x = col1, y = col2, colour = col3)) +
geom_point() +
labs(title = "My plot")
ggplot(iris_data, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
geom_point(size = 2, alpha = 0.7) +
labs(
title = "Sepal dimensions by species",
x = "Sepal length (cm)",
y = "Sepal width (cm)",
colour = "Species"
) +
theme_minimal()✏️ Task 4.1 — Copy the scatter plot above and modify
it to plot Petal.Length vs Petal.Width instead
of sepal dimensions. Update the axis labels and title accordingly.
In one sentence, what pattern do you notice about the separation of species in petal space compared to sepal space? (write your answer here)
Box plots show the distribution of a numeric variable across groups. The box spans the interquartile range (Q1–Q3), the line is the median, and the whiskers extend to 1.5× IQR. Points beyond that are outliers.
ggplot(iris_data, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot(alpha = 0.6) +
labs(
title = "Petal length by species",
x = "Species",
y = "Petal length (cm)"
) +
theme_minimal() +
theme(legend.position = "none") # legend is redundant when x is already species✏️ Task 4.2 — Create a box plot showing the
distribution of Sepal.Ratio (from Task 3.3) by species. Add
a meaningful title and axis labels.
ggplot(iris_data, aes(x = Petal.Area, fill = Species)) +
geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
labs(
title = "Distribution of petal area by species",
x = "Petal area (cm²)",
y = "Count"
) +
theme_minimal()✏️ Task 4.3 — Create a histogram of
Sepal.Length. Set bins = 15 and colour the
bars by species. Describe the distribution in one sentence — is it
roughly normal, skewed, or multimodal?
Description: (write your answer here)
Because %>% passes a data frame forward, you can pipe
directly into ggplot() without creating intermediate
variables.
# Summarise then plot in one pipeline
iris_data %>%
group_by(Species) %>%
summarise(mean_petal_length = mean(Petal.Length),
se = sd(Petal.Length) / sqrt(n())) %>%
ggplot(aes(x = Species, y = mean_petal_length, fill = Species)) +
geom_col(alpha = 0.8) +
geom_errorbar(aes(ymin = mean_petal_length - se,
ymax = mean_petal_length + se),
width = 0.2) +
labs(
title = "Mean petal length ± SE by species",
x = "Species",
y = "Mean petal length (cm)"
) +
theme_minimal() +
theme(legend.position = "none")✏️ Task 4.4 — Adapt the pipeline above to plot mean
Sepal.Ratio ± SE by species as a bar chart with error bars.
Update the title and y-axis label.
This section practises writing well-formatted Rmd documents.
You can embed R results directly in prose using backtick-r syntax:
`r expression`. The value is computed when you knit and
inserted into the text automatically — no copy-pasting numbers.
For example, the sentence below is written as:
The iris dataset contains
nrow(iris_data)rows.
And knits to:
The iris dataset contains 150 rows.
✏️ Task 5.1 — Complete the sentence below using inline code so that the numbers are computed automatically, not typed manually.
The three species in the iris dataset are setosa, (add the other two using inline code), and (add the third). The mean sepal length across all species is (add inline code here) cm.
Chunk options control what appears in the output. They go inside the
{r} header.
| Option | Default | Effect |
|---|---|---|
echo = FALSE |
TRUE | Hides the code but shows the output |
eval = FALSE |
TRUE | Shows the code but does not run it |
include = FALSE |
TRUE | Runs silently — hides both code and output |
fig.width, fig.height |
7, 5 | Sets plot dimensions in inches |
fig.cap = "..." |
none | Adds a figure caption |
✏️ Task 5.2 — The chunk below contains a plot. Add chunk options so that: - The code is hidden from the output (only the plot shows) - The plot is 6 inches wide and 4 inches tall - It has the caption: “Figure 1: Petal length vs petal width coloured by species.”
ggplot(iris_data, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point(alpha = 0.7) +
theme_minimal()knitr::kable()R data frames print as plain text by default.
knitr::kable() renders them as formatted tables in the HTML
output.
library(knitr)
species_summary %>%
kable(
caption = "Table 1: Summary statistics by species",
digits = 2,
col.names = c("Species", "N", "Mean petal length",
"SD petal length", "Max petal width")
)| Species | N | Mean petal length | SD petal length | Max petal width |
|---|---|---|---|---|
| setosa | 50 | 1.46 | 0.17 | 0.6 |
| versicolor | 50 | 4.26 | 0.47 | 1.8 |
| virginica | 50 | 5.55 | 0.55 | 2.5 |
✏️ Task 5.3 — Take your sepal_summary
table from Task 3.5 and render it with kable(). Give it a
caption and clean column names.
Answer the following questions in complete sentences below. Each answer should be 2–4 sentences. You may reference tables or plots you produced above.
Q1. Looking at your scatter plots from Tasks 4.1, which pair of measurements (sepal or petal dimensions) better separates the three species, and why might this be biologically meaningful?
Your answer:
Q2. Based on your sepal_summary table
(Task 3.5) and your box plot (Task 4.2), which species has the most
consistent sepal shape (length-to-width ratio), and which is most
variable? What does high variability in this ratio suggest about the
flower’s morphology?
Your answer:
Q3. In your own words, explain what the pipe
operator %>% does and why it makes code easier to read
compared to nesting functions inside each other.
Your answer:
Q4. Describe one thing about working in R Markdown that you found confusing or unexpected, and explain how you resolved it (or what you would do to troubleshoot it).
Your answer:
Before submitting, confirm all of the following:
BIOL341 — Assignment 0