Microbiome Course - R workshop

typeof command

typeof(2)

typeof("micro")

typeof(1+1i)

typeof(TRUE)

Data Structures

vector - Everything in the vector must be the same basic data type

The c() function creates a vector, in which all elements are the same type. In the first case, the elements are numeric, in the second, they are characters, and in the third they are characters: the numeric values are “coerced” to be characters.

c(1,2,3)
## [1] 1 2 3

concatenate command c()

c("a","b","c")
## [1] "a" "b" "c"

concatenate command c() and assign

vector_example <- c(10, 11, 15, 19) # concatenate command c()
vector_example
## [1] 10 11 15 19

findout class of the vector

class(vector_example)
## [1] "numeric"

subsetting vector data - get first element

vector_example[1]
## [1] 10

subsetting vector data - get all but first element

vector_example[-1]
## [1] 11 15 19

subsetting vector data - get elements 2 to 4

vector_example[2:4]
## [1] 11 15 19

Matrices

A matrix is a bi-dimensional collection of data:

matrix_example <- matrix(1:12, nrow=3, ncol=4) # define a matrix with 3 rows and 4 columns

We can declare a matrix full of zeros:

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

And similar to other data structures, we can ask things about our matrix:

class(matrix_example)
typeof(matrix_example)
str(matrix_example)
dim(matrix_example)
nrow(matrix_example)
ncol(matrix_example)

matrix - select a row and column

matrix_example[2, ]
## [1] 0 0 0 0 0 0
matrix_example[,2]
## [1] 0 0 0

matrix - select an element

matrix_example[2,3]
## [1] 0

transpose a matrix

t(matrix_example)
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0
## [4,]    0    0    0
## [5,]    0    0    0
## [6,]    0    0    0

List

List is ordered data structure Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:

list_example <- list(1, "a", TRUE, 1+4i)

list subset

list_example[1:2]
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"

list subset - double bracket - single element

list_example[[2]]
## [1] "a"

list subset - single bracket - returns list

list_example[2]
## [[1]]
## [1] "a"
class(list_example[2])
## [1] "list"

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Numbers"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

Arrays

They are similar to the matrices although they can have 2 or more dimensions.

z <- array(1:24, dim=c(2,3,4))

Dataframes

We said that columns in data.frames were vectors. A data frame is used for storing data tables. It is a list of vectors of equal length.

Write to file

cats <- data.frame(coat = c("calico", "black", "tabby"), 
                    weight = c(2.1, 5.0,3.2), 
                    likes_string = c(1, 0, 1))
write.csv(x = cats, file = "/home/ubuntu/r-tutorial/gapminder/data/feline-data.csv", row.names = FALSE)

read from file

cats <- read.csv(file = "/home/ubuntu/r-tutorial/gapminder/data/feline-data.csv")
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

We said that columns in data.frames were vectors:

str(cats$weight)

str(cats$likes_string)

str(cats$coat)

class command

class(cats)
## [1] "data.frame"

Acessing using the $ operator

cats$weight
cats$likes_string
cats$coat

[[ will act to extract a single column:

cats[[]]

Adding columns and rows in data frames

cbind

We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector:

age <- c(2, 3, 5)
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1
cbind(cats, age)
##     coat weight likes_string age
## 1 calico    2.1            1   2
## 2  black    5.0            0   3
## 3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the dataframe, it would fail:

age <- c(2, 3, 5, 12)
cbind(cats, age)

rbind

Now how about adding rows? We already know that the rows of a data frame are lists:

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "tortoiseshell"): invalid
## factor level, NA generated

We can also glue two data frames together with rbind:

cats <- rbind(cats, cats)
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1
## 4   <NA>    3.3            1
## 5 calico    2.1            1
## 6  black    5.0            0
## 7  tabby    3.2            1
## 8   <NA>    3.3            1

Dataframe Manipulation

realistic example

gapminder <- read.csv("/home/ubuntu/r-tutorial/gapminder/data/gapminder_data.csv")

We can do some operations using the normal base R operations:

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11

The dplyr package

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

select() filter() group_by() summarize() mutate()

If you have have not installed this package earlier, please do so:

install.packages('dplyr')

Now let’s load the package

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Using select()

If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.

year_country_gdp <- select(gapminder,year,country,gdpPercap)

using pipes

year_country_gdp <- gapminder %>% select(year,country,gdpPercap)

Using filter()

we can combine select and filter

year_country_gdp_euro <- gapminder %>%
    filter(continent=="Europe") %>%
    select(year,country,gdpPercap)

Using summarize()

The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap=mean(gdpPercap))

Using mutate()

We can also create new variables prior to (or even after) summarizing information using mutate().

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))