Overview

This homework introduces you to R, R Markdown, and the tidyverse — a collection of R packages designed around a consistent, readable grammar for data manipulation and visualization. You will use the built-in iris dataset, which contains measurements of 150 flowers across three species of iris.

By the end of this assignment you will be able to:

  • Write and run R code inside a .Rmd document
  • Load packages and inspect a dataset
  • Use core dplyr verbs: filter(), select(), mutate(), arrange(), group_by(), and summarise()
  • Create plots with ggplot2
  • Write narrative text around your code and knit to HTML

Submission instructions

Submit both files to Moodle:

  1. Your .Rmd source file — named asn0_YOUR_NAME.Rmd
  2. Your knitted .html output — named asn0_YOUR_NAME.html

To knit, click the Knit button at the top of the RStudio editor (or press Ctrl/Cmd + Shift + K). If your document knits without errors, you are good to go.

Getting help: If a code chunk produces a red error message, read the last line first — it usually tells you exactly what went wrong. Common issues are a missing package (could not find function) or a typo in a variable name (object not found).


Part 1 — R Markdown basics

1.1 What is R Markdown?

An R Markdown file (.Rmd) weaves together three things:

  • YAML header — the block at the very top between the --- lines. It sets the title, author, date, and output format. You already filled in your name above.
  • Narrative text — plain prose, written in Markdown. This sentence is an example. You can make text bold with **double asterisks**, italic with *single asterisks*, and create headers with # symbols.
  • Code chunks — fenced blocks that start with ```{r} and end with ```. R executes the code and prints the result directly below the chunk in your output.

1.2 Your first code chunk

The chunk below prints the text “Hello from R!” Run it by clicking the green play button on the right side of the chunk, or pressing Ctrl/Cmd + Enter with your cursor inside it.

print("Hello from Tan :)")
## [1] "Hello from Tan :)"

✏️ Task 1.2 — Edit the chunk above to print your own name instead of “Hello from R!”. Knit the document and confirm your name appears in the output.

1.3 R as a calculator

R evaluates expressions directly. Run the chunk below to see basic arithmetic.

# Lines starting with # are comments — R ignores them
# They are for human readers only

2 + 2          # addition
## [1] 4
10 / 3         # division (notice R gives you decimals)
## [1] 3.333333
2^8            # exponentiation: 2 to the power of 8
## [1] 256
sqrt(144)      # square root — this is a function call
## [1] 12
log10(1000) #Added line
## [1] 3
print("Yes, the value that I got (3) from R matched with the actual value we expect from logarithms, 10^3 = 1000")
## [1] "Yes, the value that I got (3) from R matched with the actual value we expect from logarithms, 10^3 = 1000"

✏️ Task 1.3 — Add one more line to the chunk above that calculates log10(1000). What value do you get, and does it match what you know about logarithms?

1.4 Variables and assignment

In R, you store values in variables using the <- OR = operator (read it as “gets”). Variable names are case-sensitive: MyVar and myvar are different.

species_count <- 3          # store the number 3
petal_unit    <- "cm"       # store a text string

# Print them
species_count
## [1] 3
petal_unit
## [1] "cm"
# Use them in an expression
paste("Iris has", species_count, "species measured in", petal_unit)
## [1] "Iris has 3 species measured in cm"
my_year = 2026
paste("I am learning R in", my_year)
## [1] "I am learning R in 2026"

✏️ Task 1.4 — Create a variable called my_year that stores the current year as a number. Then write a second line that prints the sentence: “I am learning R in YEAR” using paste().


Part 2 — Loading packages and exploring the data

2.1 Installing and loading the tidyverse

Packages extend R’s built-in capabilities. The tidyverse is actually a bundle of several packages — dplyr for data manipulation, ggplot2 for plots, tidyr for reshaping data, and more.

You only need to install a package once per computer. You need to load it every session with library().

# If you have not installed the tidyverse yet, run this line ONCE in your
# Console (not in a chunk):
#   install.packages("tidyverse")

library(tidyverse)

Note: You will see some messages about which packages were loaded and which functions are “masked”. This is normal — it just means tidyverse functions like filter() now take priority over base R functions with the same name.

2.2 Loading and inspecting the iris dataset

iris is built into R — no file to download. Just type the name to access it.

# Load the dataset into a variable
iris_data <- iris

# How many rows and columns?
dim(iris_data)
## [1] 150   5
# Look at the first 6 rows
head(iris_data)
# Show the structure: column names, types, and a preview of values
str(iris_data)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

✏️ Task 2.2 — Answer the following questions as a bulleted list in the text block below. Use the output of dim() and str() — do not Google the answers.

Your answers:

  • How many rows (observations) does the dataset have?

    There are 150 rows

  • How many columns (variables)?

    There are 5 columns

  • What are the names of the five columns?

    Sepal length, Sepal width, Petal length, Petal width, and Species

  • What data type is the Species column? (hint: look at str() output — it will say Factor, num, chr, etc.)

    “species” has a factor data type

  • What are the three species in the dataset? (hint: levels(iris_data$Species))

    setosa, versicolor, and virginica

# Use this chunk to check the species levels
levels(iris_data$Species)
## [1] "setosa"     "versicolor" "virginica"

2.3 The $ operator and basic summaries

The $ operator extracts a single column from a data frame as a vector.

# Extract the Sepal.Length column
iris_data$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
# Basic descriptive statistics
mean(iris_data$Sepal.Length)
## [1] 5.843333
median(iris_data$Sepal.Length)
## [1] 5.8
range(iris_data$Sepal.Length)
## [1] 4.3 7.9

The summary() function gives you a quick overview of every column at once.

summary(iris_data)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

✏️ Task 2.3 — Using summary() output or $ with a function, answer in the text below:

Your answers:

  • What is the mean petal length across all species?
  • Which column has the greatest range (max minus min)?

Part 3 — dplyr: manipulating data frames

dplyr provides six core “verbs” that cover most data manipulation tasks. They all take a data frame as their first argument and return a data frame, which means you can chain them together with the pipe operator %>% (read as “then”).

data %>% verb1(...) %>% verb2(...) %>% verb3(...)

Each verb does one thing clearly — this makes your code readable almost like a sentence.

3.1 filter() — keep rows that match a condition

# Keep only rows where the species is setosa
setosa_only <- iris_data %>%
  filter(Species == "setosa")

# How many rows remain?
nrow(setosa_only)
## [1] 50
# You can combine conditions with & (AND) or | (OR)
# Keep setosa flowers with sepal length greater than 5
setosa_long <- iris_data %>%
  filter(Species == "setosa" & Sepal.Length > 5)

nrow(setosa_long)
## [1] 22

✏️ Task 3.1 — Write a filter that keeps only virginica flowers with a petal width greater than 2.0. Store the result in a variable called virginica_wide. Print the result and report how many rows it contains.

# Your code here

How many virginica flowers have petal width > 2.0? (write your answer here)

3.2 select() — keep or drop columns

# Keep only three columns
iris_data %>%
  select(Species, Petal.Length, Petal.Width) %>%
  head()
# Drop columns using a minus sign
iris_data %>%
  select(-Sepal.Length, -Sepal.Width) %>%
  head()

✏️ Task 3.2 — Create a new data frame called petal_data that contains only Species, Petal.Length, and Petal.Width. Display the first 6 rows using head().

# Your code here

3.3 mutate() — add or transform columns

mutate() creates new columns or overwrites existing ones. The new column is computed row by row.

# Create a new column: petal area (approximated as length × width)
iris_data <- iris_data %>%
  mutate(Petal.Area = Petal.Length * Petal.Width)

# Check it was added
head(iris_data)
# Create a categorical column: flag large vs small sepals
iris_data <- iris_data %>%
  mutate(Sepal.Size = ifelse(Sepal.Length > 5.8, "large", "small"))

head(iris_data %>% select(Species, Sepal.Length, Sepal.Size))

✏️ Task 3.3 — Add a new column to iris_data called Sepal.Ratio that is Sepal.Length divided by Sepal.Width. Then display only Species, Sepal.Length, Sepal.Width, and Sepal.Ratio for the first 8 rows.

# Your code here

3.4 arrange() — sort rows

# Sort by petal length, smallest first
iris_data %>%
  arrange(Petal.Length) %>%
  head(10)
# Sort by petal length, largest first (desc = descending)
iris_data %>%
  arrange(desc(Petal.Length)) %>%
  head(5)

✏️ Task 3.4 — Find the 5 flowers with the largest Sepal.Ratio (the column you made in Task 3.3). Display only their Species, Sepal.Length, Sepal.Width, and Sepal.Ratio columns, sorted from largest to smallest ratio.

# Your code here

Which species dominates the top 5? (write your answer here)

3.5 group_by() + summarise() — compute group statistics

This is the most powerful combination in dplyr. group_by() splits the data into invisible groups; summarise() collapses each group into a single row of summary statistics.

# Mean sepal length by species
iris_data %>%
  group_by(Species) %>%
  summarise(mean_sepal_length = mean(Sepal.Length))
# Multiple summary statistics at once
species_summary <- iris_data %>%
  group_by(Species) %>%
  summarise(
    n             = n(),                        # count rows per group
    mean_petal_l  = mean(Petal.Length),
    sd_petal_l    = sd(Petal.Length),
    max_petal_w   = max(Petal.Width)
  )

species_summary

✏️ Task 3.5 — Create a summary table called sepal_summary that shows, for each species: the mean, minimum, and maximum of Sepal.Ratio (from Task 3.3). Round all numeric columns to 2 decimal places using round(..., 2).

# Your code here

In 1–2 sentences, which species has the most variable sepal ratio, and how can you tell from your table? (write your answer here)

3.6 Chaining it all together

These verbs become powerful when chained. Here is a multi-step pipeline written as a readable sequence:

# Start with all data, then:
# 1. Keep only versicolor and virginica
# 2. Select the columns we care about
# 3. Add a petal ratio column
# 4. Summarise by species

iris_data %>%
  filter(Species %in% c("versicolor", "virginica")) %>%
  select(Species, Petal.Length, Petal.Width) %>%
  mutate(Petal.Ratio = round(Petal.Length / Petal.Width, 2)) %>%
  group_by(Species) %>%
  summarise(
    mean_petal_ratio = mean(Petal.Ratio),
    sd_petal_ratio   = sd(Petal.Ratio)
  )

✏️ Task 3.6 — Write your own multi-step pipeline (minimum 3 verbs chained with %>%) that answers a question of your choice about the iris dataset. State your question in the text block below, write the pipeline, and write one sentence interpreting the result.

My question: (write it here)

# Your pipeline here

Interpretation: (write one sentence here)


Part 4 — Visualisation with ggplot2

ggplot2 builds plots layer by layer. Every plot starts with ggplot(), which sets up the coordinate system and maps columns to visual properties (aestheticsx, y, colour, size, etc.). You then add geometry layers (geom_point, geom_boxplot, etc.) with +.

ggplot(data, aes(x = col1, y = col2, colour = col3)) +
  geom_point() +
  labs(title = "My plot")

4.1 Scatter plot

ggplot(iris_data, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(
    title    = "Sepal dimensions by species",
    x        = "Sepal length (cm)",
    y        = "Sepal width (cm)",
    colour   = "Species"
  ) +
  theme_minimal()

✏️ Task 4.1 — Copy the scatter plot above and modify it to plot Petal.Length vs Petal.Width instead of sepal dimensions. Update the axis labels and title accordingly.

# Your modified scatter plot here

In one sentence, what pattern do you notice about the separation of species in petal space compared to sepal space? (write your answer here)

4.2 Box plot

Box plots show the distribution of a numeric variable across groups. The box spans the interquartile range (Q1–Q3), the line is the median, and the whiskers extend to 1.5× IQR. Points beyond that are outliers.

ggplot(iris_data, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.6) +
  labs(
    title = "Petal length by species",
    x     = "Species",
    y     = "Petal length (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")   # legend is redundant when x is already species

✏️ Task 4.2 — Create a box plot showing the distribution of Sepal.Ratio (from Task 3.3) by species. Add a meaningful title and axis labels.

# Your box plot here

4.3 Histogram

ggplot(iris_data, aes(x = Petal.Area, fill = Species)) +
  geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
  labs(
    title = "Distribution of petal area by species",
    x     = "Petal area (cm²)",
    y     = "Count"
  ) +
  theme_minimal()

✏️ Task 4.3 — Create a histogram of Sepal.Length. Set bins = 15 and colour the bars by species. Describe the distribution in one sentence — is it roughly normal, skewed, or multimodal?

# Your histogram here

Description: (write your answer here)

4.4 Combining dplyr and ggplot2

Because %>% passes a data frame forward, you can pipe directly into ggplot() without creating intermediate variables.

# Summarise then plot in one pipeline
iris_data %>%
  group_by(Species) %>%
  summarise(mean_petal_length = mean(Petal.Length),
            se = sd(Petal.Length) / sqrt(n())) %>%
  ggplot(aes(x = Species, y = mean_petal_length, fill = Species)) +
  geom_col(alpha = 0.8) +
  geom_errorbar(aes(ymin = mean_petal_length - se,
                    ymax = mean_petal_length + se),
                width = 0.2) +
  labs(
    title = "Mean petal length ± SE by species",
    x     = "Species",
    y     = "Mean petal length (cm)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

✏️ Task 4.4 — Adapt the pipeline above to plot mean Sepal.Ratio ± SE by species as a bar chart with error bars. Update the title and y-axis label.

# Your pipeline + plot here

Part 5 — R Markdown formatting

This section practises writing well-formatted Rmd documents.

5.1 Inline code

You can embed R results directly in prose using backtick-r syntax: `r expression`. The value is computed when you knit and inserted into the text automatically — no copy-pasting numbers.

For example, the sentence below is written as:

The iris dataset contains nrow(iris_data) rows.

And knits to:

The iris dataset contains 150 rows.

✏️ Task 5.1 — Complete the sentence below using inline code so that the numbers are computed automatically, not typed manually.

The three species in the iris dataset are setosa, (add the other two using inline code), and (add the third). The mean sepal length across all species is (add inline code here) cm.

5.2 Chunk options

Chunk options control what appears in the output. They go inside the {r} header.

Option Default Effect
echo = FALSE TRUE Hides the code but shows the output
eval = FALSE TRUE Shows the code but does not run it
include = FALSE TRUE Runs silently — hides both code and output
fig.width, fig.height 7, 5 Sets plot dimensions in inches
fig.cap = "..." none Adds a figure caption

✏️ Task 5.2 — The chunk below contains a plot. Add chunk options so that: - The code is hidden from the output (only the plot shows) - The plot is 6 inches wide and 4 inches tall - It has the caption: “Figure 1: Petal length vs petal width coloured by species.”

ggplot(iris_data, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point(alpha = 0.7) +
  theme_minimal()

5.3 Tables with knitr::kable()

R data frames print as plain text by default. knitr::kable() renders them as formatted tables in the HTML output.

library(knitr)

species_summary %>%
  kable(
    caption = "Table 1: Summary statistics by species",
    digits  = 2,
    col.names = c("Species", "N", "Mean petal length",
                  "SD petal length", "Max petal width")
  )
Table 1: Summary statistics by species
Species N Mean petal length SD petal length Max petal width
setosa 50 1.46 0.17 0.6
versicolor 50 4.26 0.47 1.8
virginica 50 5.55 0.55 2.5

✏️ Task 5.3 — Take your sepal_summary table from Task 3.5 and render it with kable(). Give it a caption and clean column names.

# Your kable table here

Part 6 — Written interpretation

Answer the following questions in complete sentences below. Each answer should be 2–4 sentences. You may reference tables or plots you produced above.

Q1. Looking at your scatter plots from Tasks 4.1, which pair of measurements (sepal or petal dimensions) better separates the three species, and why might this be biologically meaningful?

Your answer:


Q2. Based on your sepal_summary table (Task 3.5) and your box plot (Task 4.2), which species has the most consistent sepal shape (length-to-width ratio), and which is most variable? What does high variability in this ratio suggest about the flower’s morphology?

Your answer:


Q3. In your own words, explain what the pipe operator %>% does and why it makes code easier to read compared to nesting functions inside each other.

Your answer:


Q4. Describe one thing about working in R Markdown that you found confusing or unexpected, and explain how you resolved it (or what you would do to troubleshoot it).

Your answer:


Submission checklist

Before submitting, confirm all of the following:


BIOL341 — Assignment 0