Letβs assume you have your dataset saved as
trainingData.xlsx.
#install.packages("readxl") # Run only once
#install.packages("dplyr")
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
plants <- read_excel("trainingData.xlsx")
Here the special symbol <- is read as
βgetsβ. It is used for creating objectsβin this case a data
frame, which is R-speak for a spreadsheet.
Before analysis, itβs good to explore the dataset.
summary(plants)
## id size_t1 fruit survived
## Min. : 3.0 Min. : 1.195 Min. : 1.00 Min. :0.00
## 1st Qu.:134.5 1st Qu.: 3.366 1st Qu.: 4.75 1st Qu.:0.00
## Median :296.5 Median : 4.638 Median : 8.00 Median :0.00
## Mean :271.3 Mean : 5.152 Mean :10.08 Mean :0.24
## 3rd Qu.:402.8 3rd Qu.: 6.423 3rd Qu.:13.00 3rd Qu.:0.00
## Max. :500.0 Max. :16.212 Max. :55.00 Max. :1.00
##
## n_seeds size_t2 sizeClass_t1 sizeClass_t2
## Min. : 1.00 Min. : 5.005 Min. :1.00 Min. :2.000
## 1st Qu.: 4.75 1st Qu.: 7.272 1st Qu.:1.00 1st Qu.:2.000
## Median : 8.00 Median : 8.870 Median :1.00 Median :2.000
## Mean :10.08 Mean :10.015 Mean :1.52 Mean :2.417
## 3rd Qu.:13.00 3rd Qu.:11.844 3rd Qu.:2.00 3rd Qu.:3.000
## Max. :55.00 Max. :20.244 Max. :3.00 Max. :3.000
## NA's :76 NA's :76
hist(plants$size_t1,
main = "Histogram of Initial Size (size_t1)",
xlab = "Size at Time 1",
col = "lightblue", border = "white")
hist(plants$n_seeds,
main = "Histogram of Seed Count (n_seeds)",
xlab = "Number of Seeds",
col = "lightgreen", border = "white")
plot(plants$size_t1, plants$n_seeds,
main = "Size-seed relationship",
xlab = "Size",ylab = "Number of Seeds")
%>%)The pipe operator (%>%) passes the
result of one function to the next. It makes complex series of commands
easier to read.
summarise(plants, n= n())
## # A tibble: 1 Γ 1
## n
## <int>
## 1 100
# in pipe form
plants %>%
summarise(n = n())
## # A tibble: 1 Γ 1
## n
## <int>
## 1 100
This was simple, but it could be more complicated:
# Without pipes
summarise(group_by(plants, sizeClass_t1), n = n())
## # A tibble: 3 Γ 2
## sizeClass_t1 n
## <dbl> <int>
## 1 1 54
## 2 2 40
## 3 3 6
# With pipes
plants %>%
group_by(sizeClass_t1) %>%
summarise(n = n())
## # A tibble: 3 Γ 2
## sizeClass_t1 n
## <dbl> <int>
## 1 1 54
## 2 2 40
## 3 3 6
This is easier to read β like giving step-by-step instructions.
Weβll look at plants that:
survivors <- plants %>%
filter(survived == 1) %>%
filter(sizeClass_t1 == 1)
survivors
## # A tibble: 4 Γ 8
## id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 4.78 10 1 10 6.19 1 2
## 2 158 3.72 2 1 2 5.25 1 2
## 3 114 4.36 5 1 5 5.89 1 2
## 4 186 4.06 8 1 8 5.00 1 2
You can also write this like this:
survivors <- plants %>%
filter(survived == 1, sizeClass_t1 == 1)
survivors
## # A tibble: 4 Γ 8
## id size_t1 fruit survived n_seeds size_t2 sizeClass_t1 sizeClass_t2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 4.78 10 1 10 6.19 1 2
## 2 158 3.72 2 1 2 5.25 1 2
## 3 114 4.36 5 1 5 5.89 1 2
## 4 186 4.06 8 1 8 5.00 1 2
Challenge 1: can you count the survivors, using
nrow()?
Group the data by sizeClass_t1, using
group_by and then calculate:
summary_stats <- plants %>%
group_by(sizeClass_t1) %>%
summarise(
n = n(),
total_seeds = sum(n_seeds, na.rm = TRUE))
summary_stats
## # A tibble: 3 Γ 3
## sizeClass_t1 n total_seeds
## <dbl> <int> <dbl>
## 1 1 54 285
## 2 2 40 499
## 3 3 6 224
mutate()Letβs calculate seeds as a proportion in each size class:
summary_stats <- summary_stats %>%
mutate(
proportion_seeds = total_seeds / sum(total_seeds)
)
summary_stats
## # A tibble: 3 Γ 4
## sizeClass_t1 n total_seeds proportion_seeds
## <dbl> <int> <dbl> <dbl>
## 1 1 54 285 0.283
## 2 2 40 499 0.495
## 3 3 6 224 0.222
| Task | Function(s) used |
|---|---|
| Import Excel | read_excel() |
| Inspect data | summary(), hist() |
| Filter rows | filter() |
| Summarise by group | group_by(), summarise() |
| Add calculated columns | mutate() |
| Calculations. | n(), sum() |
| Chain steps | %>% (pipe) |
Now you are ready to tackle some βrealβ data.