1 Getting Started

R is a powerful language for statistics and data analysis. This section introduces core ideas so students can continue independently.

1.1 What is R?

R is an open‑source language designed for: - Statistics - Data analysis - Data visualization - Reproducible science

It is widely used in biology, ecology, genomics, and environmental sciences.

1.2 Installing R and RStudio

Download R from CRAN and RStudio as an IDE. RStudio makes coding, visualization, and documentation easier.

1.3 How to use the Console

The console evaluates expressions immediately.

1 + 1
print("Hello World")

R can also be used as a calculator, but its real power comes from vectorized operations, plotting, and statistical analysis.

1.4 Comments

Use # for comments.

# This is a comment
x <- 4  # inline comment

2. Variables & Types

R stores information in objects. Objects have types that determine how R treats the data.

2.1 Basic Types in R

  • numeric: real numbers
  • integer: whole numbers
  • character: text
  • logical: TRUE/FALSE values
x <- 10        # numeric
class(x)

y <- "hello"  # character
class(y)

z <- TRUE      # logical
class(z)

R will automatically choose a type when assigning values.

2.2 Type Coercion

R tries to convert mixed types to the most flexible type.

v <- c(1, "a", 3)
v      # all become characters
class(v)

Use explicit coercion:

as.numeric("3.5")
as.character(42)

2.3 Variables Are Vectors of Length 1

Even single values are stored as vectors of length 1.

x <- 5
length(x)   # 1

What is R?

R is a programming language for statistical computing and graphics.

Installing R and RStudio

  • Download R from CRAN
  • Install RStudio for IDE support

Using the Console

1 + 1
print("Hello World")

2. Variables & Types

Basic Types

  • numeric
  • integer
  • character
  • logical
x <- 10
class(x)
y <- "hello"
class(y)
z <- TRUE
class(z)

Coercion

as.numeric("3.5")

3. Vectors & Recycling

Vectors are the fundamental data structure in R. Even a single value is a vector of length 1. Many R functions operate on vectors.

3.1 Creating vectors

v1 <- c(1, 2, 3)
v2 <- 5:10
v3 <- seq(0, 1, by=0.2)
v4 <- rep(4, times=5)
v1; v2; v3; v4

3.2 Named vectors

Named vectors allow meaningful labels.

heights <- c(Alex=180, Maria=165, John=175)
heights
heights["Maria"]

3.3 Vector indexing

x <- c(10, 20, 30, 40)
x[2]
x[c(1, 3)]
x[-2]      # exclude index 2

Logical (TRUE/FALSE) indexing

Logical indexing is powerful for filtering.

z <- c(5, 10, 15, 20)
idx <- z > 10
idx      # TRUE FALSE TRUE TRUE
z[idx]   # 15 20

3.4 Recycling rule (important!)

R recycles shorter vectors when operating with longer ones.

a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b    # b becomes 10,20,10,20

Be careful: this can cause subtle bugs.

3.5 Vectorized operations

R is fast because most operations are vectorized.

x <- 1:5
x * 2
x + 5
sin(x)

3.6 Lists

Lists can store objects of different types and sizes.

lst <- list(
  numbers = c(1,2,3),
  name = "oak",
  temp = 22.5,
  misc = list(inner = 5)
)
lst
lst$numbers
lst[["name"]]
lst$misc$inner

Lists are extremely flexible and often appear in results from statistical models.

3.7 Exercises

  1. Create a named vector of 5 species abundance values.
  2. Filter values greater than 10 using logical indexing.
  3. Create a list containing a vector, a matrix, and another list.

4. Matrices

Matrices are two‑dimensional rectangular data structures containing elements all of the same type. They are useful in many biological analyses (e.g., gene expression matrices, distance matrices).

4.1 Creating matrices

m <- matrix(1:12, nrow = 3, ncol = 4)
m

R fills matrices column‑wise by default.

m2 <- matrix(1:12, nrow = 3, byrow = TRUE)
m2

4.2 Naming rows/columns

rownames(m) <- paste0("R", 1:3)
colnames(m) <- paste0("C", 1:4)
m

4.3 Indexing matrices

m[1, 2]      # row 1, column 2
m[ , 3]      # whole column
m[2, ]       # whole row
m[1:2, 2:3]

4.4 Matrix operations

m * 2
m + m2[1:3,]
t(m)         # transpose

4.5 Biological example: gene expression matrix

Imagine rows are genes and columns are samples.

expr <- matrix(rnorm(500), nrow=50, ncol=10)
rownames(expr) <- paste0("Gene_", 1:50)
colnames(expr) <- paste0("Sample_", 1:10)
expr[1:5, 1:5]

Find genes with high variance:

vars <- apply(expr, 1, var)
head(sort(vars, decreasing=TRUE))

4.6 Exercises

  1. Create a 5×5 matrix and multiply it by 3.
  2. Extract rows 2–4 and columns 1–3.
  3. Build a fake gene expression matrix and find genes w/ mean > 1.

5. Data Frames & Reading Files

Data frames are one of the most important structures in R. They are used to store tabular data, where columns may be of different types (numeric, character, logical, etc.). This makes them ideal for biological and environmental datasets where variables can include quantitative measurements (temperature, pH) alongside categorical values (species, region).

5.1 What is a data frame?

A data frame is similar to a spreadsheet: - Each row is an observation (e.g., one sampling site) - Each column is a variable (e.g., temperature) - Different columns can store different data types

df <- data.frame(
  id = 1:3,
  temp = c(20.1, 19.5, 22.3),
  species = c("oak", "pine", "elm")
)
head(df)

5.2 Inspecting data frames

head(df)
tail(df)
str(df)
dim(df)
summary(df)

5.3 Accessing columns

df$temp

Equivalent:

df[["temp"]]

5.4 Adding/removing columns

df$region <- c("north", "south", "east")
df$region <- NULL

5.5 Reading external files

We often import data from .csv or .tsv files.

env <- read.csv("environmental_measurements.csv")
str(env)

5.6 Data cleaning and exploration

Check for missing values:

sum(is.na(env))
colSums(is.na(env))

Filter rows:

subset(env, species == "oak")

Sort values:

env[order(env$temperature), ]

5.7 Conceptual discussion

Data frames allow us to combine diverse measurements in one structure. This is essential in environmental biology, where datasets include many types of variables: climate (temp, humidity), chemical (pH), biological (species), and geographic (location).

They form the basis for: - Visualization - Statistical analysis - Modeling - Ecological inference

Understanding how to manipulate them is crucial.

5.8 Practical example: average temperature by species

tapply(env$temperature, env$species, mean)

5.9 Exercises

  1. Import a CSV file and inspect its structure.
  2. Extract all rows where temperature > 25.
  3. Compute average pH per location.
  4. Create a new column for temperature in Kelvin.

6. Visualization

Base R

Scatter

plot(data$temperature, data$humidity)

Histogram

hist(data$temperature)

Boxplot

boxplot(temperature ~ species, data=data)

ggplot2

library(ggplot2)
ggplot(data, aes(temperature, humidity, color=species)) + geom_point()

7. Intro to Data Analysis

Summary Stats

summary(data)

Group Means

tapply(data$temperature, data$species, mean)

Correlation

cor(data$temperature, data$humidity)

8. Evolutionary Toy Examples

evo <- read.csv("evolutionary_traits.csv")
boxplot(trait ~ pop, data=evo)

t-test

t.test(trait ~ pop, data=evo)

9. Biological Analyses with Simulated Data

In this section, we will explore realistic biological questions using the provided datasets. We introduce each problem, explain its relevance, reference related research when appropriate, and comment on the results.

9.1 Do environmental factors differ among species?

Research often investigates how different species occupy varying environmental niches.

Related concept: Niche differentiation (Hutchinson, 1957).

env <- read.csv("environmental_measurements.csv")
boxplot(temperature ~ species, data=env)
tapply(env$temperature, env$species, mean)

We observe differences in temperature among species, suggesting niche specialization.

9.2 Correlation between pH and humidity

Soil pH and humidity can influence plant community composition.

Reference: Brady & Weil, “The Nature and Properties of Soils”.

cor(env$ph, env$humidity)

A moderate correlation may indicate linked environmental gradients.

9.3 Population trait comparison

Trait divergence between populations is important in evolutionary biology.

Reference: Lande (1976) on quantitative genetic models.

t.test(trait ~ pop, data=evo)

A significant difference supports population divergence.

9.4 PCA to explore multivariate structure

Principal Component Analysis (PCA) is used to reduce dimensionality.

Reference: Jolliffe, “Principal Component Analysis”.

env_scaled <- scale(env[, c("temperature", "humidity", "ph")])
pc <- prcomp(env_scaled)
summary(pc)
plot(pc$x[,1], pc$x[,2], col=as.factor(env$species))

PCA can reveal clustering by species, suggesting environmental adaptation.

9.5 Exercises

  1. Test if pH differs by location.
  2. Perform ANOVA on temperature by species.
  3. Visualize PCA results with ggplot2.