This is an R Markdown document to show you how to use one of my favorite tools in R, the tidyverse. Once you have loaded this package, you can use tools built into dplyr to manipulate data (picture of pliers ripping apart your spreadsheet), ggplot to make beautiful graphs, and tools to build dataframes with only one observation per row. This long, tidy format of data is not usually how you set up a spreadsheet, but is the best way to set up your data for running statistical tests and making plots.
#Tutorial to learn about tools in the tidyverse
#Based on suggested steps in https://www.datacamp.com/cheat-sheet/tidyverse-cheat-sheet-for-beginners
#Step 1: load the tidyverse
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#Step 2: load example data (similar to mtcars)
library(datasets) #Load the datasets package
#library(gapminder) #Load the gapminder package
attach(iris) #Attach iris data to the R search path
str(iris) #Use str to print structure of data to console
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#Step 3: use filter() to make smaller dataframes and str() to see new size
onlyvirginica <- iris%>%
filter(Species=="virginica")#Select iris data of species "virginica"
str(onlyvirginica)
## 'data.frame': 50 obs. of 5 variables:
## $ Sepal.Length: num 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
## $ Sepal.Width : num 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
## $ Petal.Length: num 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
## $ Petal.Width : num 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
onlylongsepals <- iris%>%
filter(Sepal.Length>6)#Select iris data that has sepal length > 6
str(onlylongsepals)
## 'data.frame': 61 obs. of 5 variables:
## $ Sepal.Length: num 7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.2 6.1 ...
## $ Sepal.Width : num 3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.2 2.8 ...
## $ Petal.Length: num 4.7 4.5 4.9 4.6 4.7 4.6 4.7 4.4 4.5 4 ...
## $ Petal.Width : num 1.4 1.5 1.5 1.5 1.6 1.3 1.4 1.4 1.5 1.3 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
#Step 4: use sort() to organize a dataframe
sorted_iris <- iris%>%
arrange(desc(Petal.Length))#Sort in descending order of petal length
#Step 5: tidy the iris data set using gather() and str() to see new size
# The arguments to gather():
# - data: name of wide dataframe
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
tidy_iris <- gather(iris, TypeofMeasurement, Measurement, Sepal.Length:Petal.Width, factor_key=TRUE)
str(tidy_iris)
## 'data.frame': 600 obs. of 3 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TypeofMeasurement: Factor w/ 4 levels "Sepal.Length",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Measurement : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...