πŸ“ 1. Importing Your Data

Let’s assume you have your dataset saved as trainingData.xlsx.

Load necessary packages

#install.packages("readxl")   # Run only once
#install.packages("dplyr")

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in the Excel data

plants <- read_excel("trainingData.xlsx")

Here the special symbol <- is read as β€œgets”. It is used for creating objects–in this case a data frame, which is R-speak for a spreadsheet.


πŸ‘€ 2. Inspect the Data

Before analysis, it’s good to explore the dataset.

Summary statistics

summary(plants)
##        id           size_t1           fruit          survived   
##  Min.   :  3.0   Min.   : 1.195   Min.   : 1.00   Min.   :0.00  
##  1st Qu.:134.5   1st Qu.: 3.366   1st Qu.: 4.75   1st Qu.:0.00  
##  Median :296.5   Median : 4.638   Median : 8.00   Median :0.00  
##  Mean   :271.3   Mean   : 5.152   Mean   :10.08   Mean   :0.24  
##  3rd Qu.:402.8   3rd Qu.: 6.423   3rd Qu.:13.00   3rd Qu.:0.00  
##  Max.   :500.0   Max.   :16.212   Max.   :55.00   Max.   :1.00  
##                                                                 
##     n_seeds         size_t2        sizeClass_t1   sizeClass_t2  
##  Min.   : 1.00   Min.   : 5.005   Min.   :1.00   Min.   :2.000  
##  1st Qu.: 4.75   1st Qu.: 7.272   1st Qu.:1.00   1st Qu.:2.000  
##  Median : 8.00   Median : 8.870   Median :1.00   Median :2.000  
##  Mean   :10.08   Mean   :10.015   Mean   :1.52   Mean   :2.417  
##  3rd Qu.:13.00   3rd Qu.:11.844   3rd Qu.:2.00   3rd Qu.:3.000  
##  Max.   :55.00   Max.   :20.244   Max.   :3.00   Max.   :3.000  
##                  NA's   :76                      NA's   :76

Histograms

hist(plants$size_t1,
     main = "Histogram of Initial Size (size_t1)",
     xlab = "Size at Time 1",
     col = "lightblue", border = "white")

hist(plants$n_seeds,
     main = "Histogram of Seed Count (n_seeds)",
     xlab = "Number of Seeds",
     col = "lightgreen", border = "white")

Scatter plots

plot(plants$size_t1, plants$n_seeds,
     main = "Size-seed relationship",
     xlab = "Size",ylab = "Number of Seeds")


πŸ”— 3. A Quick Note About Pipes (%>%)

The pipe operator (%>%) passes the result of one function to the next. It makes complex series of commands easier to read.

Example 1

summarise(plants, n= n())
## # A tibble: 1 Γ— 1
##       n
##   <int>
## 1   100
# in pipe form

plants %>% 
  summarise(n = n())
## # A tibble: 1 Γ— 1
##       n
##   <int>
## 1   100

This was simple, but it could be more complicated:

Example 2

# Without pipes
summarise(group_by(plants, sizeClass_t1), n = n())
## # A tibble: 3 Γ— 2
##   sizeClass_t1     n
##          <dbl> <int>
## 1            1    54
## 2            2    40
## 3            3     6
# With pipes
plants %>%
  group_by(sizeClass_t1) %>%
  summarise(n = n())
## # A tibble: 3 Γ— 2
##   sizeClass_t1     n
##          <dbl> <int>
## 1            1    54
## 2            2    40
## 3            3     6

This is easier to read β€” like giving step-by-step instructions.


πŸ” 4. Filter the Data

We’ll look at plants that:

survivors <- plants %>%
  filter(survived == 1) %>%  
  filter(sizeClass_t1 == 1)

survivors
## # A tibble: 4 Γ— 8
##      id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
##   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>        <dbl>        <dbl>
## 1     5    4.78    10        1      10    6.19            1            2
## 2   158    3.72     2        1       2    5.25            1            2
## 3   114    4.36     5        1       5    5.89            1            2
## 4   186    4.06     8        1       8    5.00            1            2

You can also write this like this:

survivors <- plants %>%
  filter(survived == 1, sizeClass_t1 == 1)

survivors
## # A tibble: 4 Γ— 8
##      id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
##   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>        <dbl>        <dbl>
## 1     5    4.78    10        1      10    6.19            1            2
## 2   158    3.72     2        1       2    5.25            1            2
## 3   114    4.36     5        1       5    5.89            1            2
## 4   186    4.06     8        1       8    5.00            1            2

Challenge 1: can you count the survivors, using nrow()?


πŸ“Š 5. Summarise by Group

Group the data by sizeClass_t1, using group_by and then calculate:

summary_stats <- plants %>%
  group_by(sizeClass_t1) %>%
  summarise(
    n = n(),
    total_seeds = sum(n_seeds, na.rm = TRUE))

summary_stats
## # A tibble: 3 Γ— 3
##   sizeClass_t1     n total_seeds
##          <dbl> <int>       <dbl>
## 1            1    54         285
## 2            2    40         499
## 3            3     6         224

βž• 6. Add New Columns with mutate()

Let’s calculate seeds as a proportion in each size class:

summary_stats <- summary_stats %>%
  mutate(
    proportion_seeds = total_seeds / sum(total_seeds)
  )

summary_stats
## # A tibble: 3 Γ— 4
##   sizeClass_t1     n total_seeds proportion_seeds
##          <dbl> <int>       <dbl>            <dbl>
## 1            1    54         285            0.283
## 2            2    40         499            0.495
## 3            3     6         224            0.222

βœ… Recap

Task Function(s) used
Import Excel read_excel()
Inspect data summary(), hist()
Filter rows filter()
Summarise by group group_by(), summarise()
Add calculated columns mutate()
Calculations. n(), sum()
Chain steps %>% (pipe)

Now you are ready to tackle some β€œreal” data.