From Excel to Summary: A quick R Tutorial

📁 1. Importing Your Data

Let’s assume you have your dataset saved as trainingData.xlsx.

Load necessary packages

#install.packages("readxl")   # Run only once
#install.packages("dplyr")

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in the Excel data

plants <- read_excel("trainingData.xlsx")

Here the special symbol <- is read as “gets”. It is used for creating objects–in this case a data frame, which is R-speak for a spreadsheet.

👀 2. Inspect the Data

Before analysis, it’s good to explore the dataset.

Summary statistics

summary(plants)

##        id           size_t1           fruit          survived   
##  Min.   :  3.0   Min.   : 1.195   Min.   : 1.00   Min.   :0.00  
##  1st Qu.:134.5   1st Qu.: 3.366   1st Qu.: 4.75   1st Qu.:0.00  
##  Median :296.5   Median : 4.638   Median : 8.00   Median :0.00  
##  Mean   :271.3   Mean   : 5.152   Mean   :10.08   Mean   :0.24  
##  3rd Qu.:402.8   3rd Qu.: 6.423   3rd Qu.:13.00   3rd Qu.:0.00  
##  Max.   :500.0   Max.   :16.212   Max.   :55.00   Max.   :1.00  
##                                                                 
##     n_seeds         size_t2        sizeClass_t1   sizeClass_t2  
##  Min.   : 1.00   Min.   : 5.005   Min.   :1.00   Min.   :2.000  
##  1st Qu.: 4.75   1st Qu.: 7.272   1st Qu.:1.00   1st Qu.:2.000  
##  Median : 8.00   Median : 8.870   Median :1.00   Median :2.000  
##  Mean   :10.08   Mean   :10.015   Mean   :1.52   Mean   :2.417  
##  3rd Qu.:13.00   3rd Qu.:11.844   3rd Qu.:2.00   3rd Qu.:3.000  
##  Max.   :55.00   Max.   :20.244   Max.   :3.00   Max.   :3.000  
##                  NA's   :76                      NA's   :76

Histograms

hist(plants$size_t1,
     main = "Histogram of Initial Size (size_t1)",
     xlab = "Size at Time 1",
     col = "lightblue", border = "white")

hist(plants$n_seeds,
     main = "Histogram of Seed Count (n_seeds)",
     xlab = "Number of Seeds",
     col = "lightgreen", border = "white")

Scatter plots

plot(plants$size_t1, plants$n_seeds,
     main = "Size-seed relationship",
     xlab = "Size",ylab = "Number of Seeds")

🔗 3. A Quick Note About Pipes (`%>%`)

The pipe operator (%>%) passes the result of one function to the next. It makes complex series of commands easier to read.

Example 1

summarise(plants, n= n())

## # A tibble: 1 × 1
##       n
##   <int>
## 1   100

# in pipe form

plants %>% 
  summarise(n = n())

## # A tibble: 1 × 1
##       n
##   <int>
## 1   100

This was simple, but it could be more complicated:

Example 2

# Without pipes
summarise(group_by(plants, sizeClass_t1), n = n())

## # A tibble: 3 × 2
##   sizeClass_t1     n
##          <dbl> <int>
## 1            1    54
## 2            2    40
## 3            3     6

# With pipes
plants %>%
  group_by(sizeClass_t1) %>%
  summarise(n = n())

## # A tibble: 3 × 2
##   sizeClass_t1     n
##          <dbl> <int>
## 1            1    54
## 2            2    40
## 3            3     6

This is easier to read — like giving step-by-step instructions.

🔍 4. Filter the Data

We’ll look at plants that:

Survived to time 2
Were in size class 1 at time 1

survivors <- plants %>%
  filter(survived == 1) %>%  
  filter(sizeClass_t1 == 1)

survivors

## # A tibble: 4 × 8
##      id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
##   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>        <dbl>        <dbl>
## 1     5    4.78    10        1      10    6.19            1            2
## 2   158    3.72     2        1       2    5.25            1            2
## 3   114    4.36     5        1       5    5.89            1            2
## 4   186    4.06     8        1       8    5.00            1            2

You can also write this like this:

survivors <- plants %>%
  filter(survived == 1, sizeClass_t1 == 1)

survivors

## # A tibble: 4 × 8
##      id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
##   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>        <dbl>        <dbl>
## 1     5    4.78    10        1      10    6.19            1            2
## 2   158    3.72     2        1       2    5.25            1            2
## 3   114    4.36     5        1       5    5.89            1            2
## 4   186    4.06     8        1       8    5.00            1            2

Challenge 1: can you count the survivors, using nrow()?

📊 5. Summarise by Group

Group the data by sizeClass_t1, using group_by and then calculate:

Number of individuals
Total number of seeds

summary_stats <- plants %>%
  group_by(sizeClass_t1) %>%
  summarise(
    n = n(),
    total_seeds = sum(n_seeds, na.rm = TRUE))

summary_stats

## # A tibble: 3 × 3
##   sizeClass_t1     n total_seeds
##          <dbl> <int>       <dbl>
## 1            1    54         285
## 2            2    40         499
## 3            3     6         224

➕ 6. Add New Columns with `mutate()`

Let’s calculate seeds as a proportion in each size class:

summary_stats <- summary_stats %>%
  mutate(
    proportion_seeds = total_seeds / sum(total_seeds)
  )

summary_stats

## # A tibble: 3 × 4
##   sizeClass_t1     n total_seeds proportion_seeds
##          <dbl> <int>       <dbl>            <dbl>
## 1            1    54         285            0.283
## 2            2    40         499            0.495
## 3            3     6         224            0.222

✅ Recap

Task	Function(s) used
Import Excel	`read_excel()`
Inspect data	`summary()`, `hist()`
Filter rows	`filter()`
Summarise by group	`group_by()`, `summarise()`
Add calculated columns	`mutate()`
Calculations.	`n()`, sum()
Chain steps	`%>%` (pipe)

Now you are ready to tackle some “real” data.