Data Wrangling in R

Introduction and Purpose

A few years ago, the New York Times published an article about the “janitor work” of a data scientist. They quoted: “Data scientists… spend from 50 to 80 percent of their time mired in the mudane labor of collecting and preparing data, before it can be explored for useful information.” Before you can begin searching for answers to your interesting biological questions in your datasets, your data must be clean and accessible. This involves much more than cut/paste techniques in Microsoft Excel. There are numerous handy packages in R such as dplyr, tidyr, and reshape2. Oftentimes, we refer to data cleaning, organization, and eventually visualization with ggplot2 as the tidyverse. The tidyverse is a compilation of R packages that are designed to work together towards a data analysis pipeline. A lot of the topics are discussed in Hadley Wickham’s new book R for Data Science, which is available online for free!

Another great resource for data wrangling packages, and using RStudio in general, are these cheatsheets! These are awesome quick references for Base R, R Markdown, the RStudio IDE, and data import, transformation, and visualization. A lot of the examples below come from the cheatsheets and RStudio blogs.

Setup

To use R packages:

install.packages("package")
library(package)

This will install the specific package if you haven’t already. In each new R session, you will load the package with the library function, usually at the top of your code. Let’s load the data wranglig packages:

library(dplyr)
library(tidyr)
library(reshape2)

Next, load the dataset you want to work with and inspect it a little bit:

attach(iris)
head(iris, 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

dim(iris)

## [1] 150   5

Basic Syntax

When tidying your data, the first thing to do is to convert the data to a tbl class, which are easier to examine than data frames. R will display only what will fit on the screen:

tbl_df(iris)

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows

iris_df <- tbl_df(iris)

Checking out the data some more:

glimpse(iris)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

You can also use the View() funtion to view the dataset in a spreadsheet-like display.

Pipes %>% are useful for passing an object as the first argument of the function. Using pipes makes our code more readable, such as:

iris %>% group_by(Species) %>% summarize(avg=mean(Sepal.Width)) %>% arrange(avg)

## # A tibble: 3 × 2
##      Species   avg
##       <fctr> <dbl>
## 1 versicolor 2.770
## 2  virginica 2.974
## 3     setosa 3.428

This is an example of using dplyr where pipes makes our lives much easier.

Dplyr

Subset Observations

Rows

To subset rows that match a logical criteria, use filter:

filter(iris_df, Sepal.Length > 7)

## # A tibble: 12 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##           <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
## 1           7.1         3.0          5.9         2.1 virginica
## 2           7.6         3.0          6.6         2.1 virginica
## 3           7.3         2.9          6.3         1.8 virginica
## 4           7.2         3.6          6.1         2.5 virginica
## 5           7.7         3.8          6.7         2.2 virginica
## 6           7.7         2.6          6.9         2.3 virginica
## 7           7.7         2.8          6.7         2.0 virginica
## 8           7.2         3.2          6.0         1.8 virginica
## 9           7.2         3.0          5.8         1.6 virginica
## 10          7.4         2.8          6.1         1.9 virginica
## 11          7.9         3.8          6.4         2.0 virginica
## 12          7.7         3.0          6.1         2.3 virginica

This gives us rows that contain a sepal length that is greater than 7.

To remove duplicate rows:

distinct(iris_df)

## # A tibble: 149 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 139 more rows

To randomly select a portion of rows:

sample_frac(iris_df, 0.5, replace=TRUE)

## # A tibble: 75 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##           <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
## 1           5.4         3.4          1.5         0.4     setosa
## 2           6.5         3.2          5.1         2.0  virginica
## 3           5.8         4.0          1.2         0.2     setosa
## 4           6.5         3.0          5.8         2.2  virginica
## 5           5.6         2.8          4.9         2.0  virginica
## 6           5.7         4.4          1.5         0.4     setosa
## 7           6.7         3.1          4.7         1.5 versicolor
## 8           6.8         3.2          5.9         2.3  virginica
## 9           6.3         2.3          4.4         1.3 versicolor
## 10          6.1         2.6          5.6         1.4  virginica
## # ... with 65 more rows

To randomly select “n” about of rows:

sample_n(iris_df, 10, replace=TRUE)

## # A tibble: 10 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##           <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
## 1           6.2         2.9          4.3         1.3 versicolor
## 2           6.9         3.1          4.9         1.5 versicolor
## 3           4.9         2.5          4.5         1.7  virginica
## 4           6.0         2.9          4.5         1.5 versicolor
## 5           6.3         3.3          6.0         2.5  virginica
## 6           5.7         2.9          4.2         1.3 versicolor
## 7           6.5         3.0          5.8         2.2  virginica
## 8           5.1         3.5          1.4         0.2     setosa
## 9           6.5         2.8          4.6         1.5 versicolor
## 10          5.1         3.8          1.9         0.4     setosa

To select rows by their position:

slice(iris_df, 10:15)

## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1          4.9         3.1          1.5         0.1  setosa
## 2          5.4         3.7          1.5         0.2  setosa
## 3          4.8         3.4          1.6         0.2  setosa
## 4          4.8         3.0          1.4         0.1  setosa
## 5          4.3         3.0          1.1         0.1  setosa
## 6          5.8         4.0          1.2         0.2  setosa

This is “slicing and dicing” the data into a particular subset.

Columns

The select function is used for subsetting by column name or a particular variable. Remember the column names of the iris dataset:

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

If we want a subsetted datset with only the columns Sepal.Width, Petal.Length, and Species without changing the original dataframe:

select(iris_df, Sepal.Width, Petal.Length, Species)

## # A tibble: 150 × 3
##    Sepal.Width Petal.Length Species
##          <dbl>        <dbl>  <fctr>
## 1          3.5          1.4  setosa
## 2          3.0          1.4  setosa
## 3          3.2          1.3  setosa
## 4          3.1          1.5  setosa
## 5          3.6          1.4  setosa
## 6          3.9          1.7  setosa
## 7          3.4          1.4  setosa
## 8          3.4          1.5  setosa
## 9          2.9          1.4  setosa
## 10         3.1          1.5  setosa
## # ... with 140 more rows

This is a good place to use pipes:

iris_df %>% select(Sepal.Width, Petal.Length, Species)

## # A tibble: 150 × 3
##    Sepal.Width Petal.Length Species
##          <dbl>        <dbl>  <fctr>
## 1          3.5          1.4  setosa
## 2          3.0          1.4  setosa
## 3          3.2          1.3  setosa
## 4          3.1          1.5  setosa
## 5          3.6          1.4  setosa
## 6          3.9          1.7  setosa
## 7          3.4          1.4  setosa
## 8          3.4          1.5  setosa
## 9          2.9          1.4  setosa
## 10         3.1          1.5  setosa
## # ... with 140 more rows

Summarizing Data

You can also summarize data with different functions that take a vector of values and return a single value.

summarize(iris, avg = mean(Sepal.Length))

##        avg
## 1 5.843333

#Using pipes
iris %>% summarize(avg=mean(Sepal.Length))

##        avg
## 1 5.843333

This summarize the data into a single row of values. To do so for each column:

summarize_each(iris, funs(mean))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.843333    3.057333        3.758    1.199333      NA

The summarize function can be used for lots of things:

iris %>% summarize(first(Sepal.Length))

##   first(Sepal.Length)
## 1                 5.1

iris %>% summarize(min(Sepal.Length))

##   min(Sepal.Length)
## 1               4.3

iris %>% summarize(n_distinct(Sepal.Length))

##   n_distinct(Sepal.Length)
## 1                       35

iris %>% summarize(IQR(Sepal.Length))

##   IQR(Sepal.Length)
## 1               1.3

Making New Variables

Use the mutate function to make new variables or apply window functions. These apply vectorized functions to columns, which take vectors as input and return vectors of the same length as output.

mutate(iris_df, sepal = Sepal.Length + Sepal.Width)

## # A tibble: 150 × 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <dbl>
## 1           5.1         3.5          1.4         0.2  setosa   8.6
## 2           4.9         3.0          1.4         0.2  setosa   7.9
## 3           4.7         3.2          1.3         0.2  setosa   7.9
## 4           4.6         3.1          1.5         0.2  setosa   7.7
## 5           5.0         3.6          1.4         0.2  setosa   8.6
## 6           5.4         3.9          1.7         0.4  setosa   9.3
## 7           4.6         3.4          1.4         0.3  setosa   8.0
## 8           5.0         3.4          1.5         0.2  setosa   8.4
## 9           4.4         2.9          1.4         0.2  setosa   7.3
## 10          4.9         3.1          1.5         0.1  setosa   8.0
## # ... with 140 more rows

iris_df %>% mutate(sepal = Sepal.Length + Sepal.Width)

## # A tibble: 150 × 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <dbl>
## 1           5.1         3.5          1.4         0.2  setosa   8.6
## 2           4.9         3.0          1.4         0.2  setosa   7.9
## 3           4.7         3.2          1.3         0.2  setosa   7.9
## 4           4.6         3.1          1.5         0.2  setosa   7.7
## 5           5.0         3.6          1.4         0.2  setosa   8.6
## 6           5.4         3.9          1.7         0.4  setosa   9.3
## 7           4.6         3.4          1.4         0.3  setosa   8.0
## 8           5.0         3.4          1.5         0.2  setosa   8.4
## 9           4.4         2.9          1.4         0.2  setosa   7.3
## 10          4.9         3.1          1.5         0.1  setosa   8.0
## # ... with 140 more rows

This added a new column which takes the addition of a row’s sepal length the sepal width.

To create a new column with an applied function and get rid of the original columns, use transmute:

iris_df %>% transmute(sepal = Sepal.Length + Sepal.Width)

## # A tibble: 150 × 1
##    sepal
##    <dbl>
## 1    8.6
## 2    7.9
## 3    7.9
## 4    7.7
## 5    8.6
## 6    9.3
## 7    8.0
## 8    8.4
## 9    7.3
## 10   8.0
## # ... with 140 more rows

mutate_all and mutate_each apply a function to every column when used with window functions, funs().

iris_df %>% mutate_each(funs(min_rank)) #Ranks where ties get minimum rank

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <int>       <int>        <int>       <int>   <int>
## 1            33         126           12           6       1
## 2            17          58           12           6       1
## 3            10          95            5           6       1
## 4             6          84           25           6       1
## 5            23         132           12           6       1
## 6            47         145           45          42       1
## 7             6         114           12          35       1
## 8            23         114           25           6       1
## 9             2          48           12           6       1
## 10           17          84           25           1       1
## # ... with 140 more rows

iris_df %>% mutate_all(funs(lead)) #Values shifted by 1

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1           4.9         3.0          1.4         0.2  setosa
## 2           4.7         3.2          1.3         0.2  setosa
## 3           4.6         3.1          1.5         0.2  setosa
## 4           5.0         3.6          1.4         0.2  setosa
## 5           5.4         3.9          1.7         0.4  setosa
## 6           4.6         3.4          1.4         0.3  setosa
## 7           5.0         3.4          1.5         0.2  setosa
## 8           4.4         2.9          1.4         0.2  setosa
## 9           4.9         3.1          1.5         0.1  setosa
## 10          5.4         3.7          1.5         0.2  setosa
## # ... with 140 more rows

iris_df %>% mutate_all(funs(row_number)) #Ties to first value

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <int>       <int>        <int>       <int>   <int>
## 1            33         126           12           6       1
## 2            17          58           13           7       2
## 3            10          95            5           8       3
## 4             6          84           25           9       4
## 5            23         132           14          10       5
## 6            47         145           45          42       6
## 7             7         114           15          35       7
## 8            24         115           26          11       8
## 9             2          48           16          12       9
## 10           18          85           27           1      10
## # ... with 140 more rows

iris_df %>% mutate_all(funs(cume_dist))

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##           <dbl>       <dbl>        <dbl>       <dbl>     <dbl>
## 1    0.27333333   0.8733333   0.16000000  0.22666667 0.3333333
## 2    0.14666667   0.5533333   0.16000000  0.22666667 0.3333333
## 3    0.07333333   0.7133333   0.07333333  0.22666667 0.3333333
## 4    0.06000000   0.6266667   0.24666667  0.22666667 0.3333333
## 5    0.21333333   0.9000000   0.16000000  0.22666667 0.3333333
## 6    0.34666667   0.9733333   0.32000000  0.32000000 0.3333333
## 7    0.06000000   0.8333333   0.16000000  0.27333333 0.3333333
## 8    0.21333333   0.8333333   0.24666667  0.22666667 0.3333333
## 9    0.02666667   0.3800000   0.16000000  0.22666667 0.3333333
## 10   0.14666667   0.6266667   0.24666667  0.03333333 0.3333333
## # ... with 140 more rows

You can add new columns using window functions and keep the original dataset or create a new one.

Combining Data Sets

This is probably one of my favorite features of dplyr. Combining data sets through a copy-paste mechanism in Excel can get messy very quickly when you have extremely large dataframes. Dplyr can do this through numerous join functions.

I have two csv files containing OTU information from the Bogs in Northern Wisconsin. The OTU dataframe contains each OTU along with the counts from numerous sampling dates. The reclass dataframe contains the OTU along with classification information, kingdom, phyla, class, etc. The dplyr::join function depends on the precense of an identical column in the 2 dataframes you want to join. If the identical column doesn’t already exist between the two dataframes, you can create a new column with the mutate functions such as those shown above to append a new column. I’ve converted each dataframe to a tbl class so the display won’t be the entire, very large dataframe. Checking out the two dataframes, you can tell which column will be used to join them together:

head(OTU_df, 20)

## # A tibble: 20 × 1,388
##          X CBU01JUL05 CBU02AUG05 CBU03JUN05 CBU05AUG05 CBU05JUL05
##     <fctr>      <int>      <int>      <int>      <int>      <int>
## 1  Otu0001         29         11          7          6         20
## 2  Otu0002          8         14         11         23          5
## 3  Otu0003          2          9          2          7          1
## 4  Otu0004          0          0          0          0          0
## 5  Otu0005        117          5          0          6         58
## 6  Otu0006          0          0          0          0          0
## 7  Otu0007          0          8          0          3          0
## 8  Otu0008          0          1          0          0          0
## 9  Otu0009          0          4          0          0          0
## 10 Otu0010          0          0          0          0          0
## 11 Otu0011          0          2          0          0          0
## 12 Otu0012          0          0          0          0          0
## 13 Otu0013          0          1          0          4          0
## 14 Otu0014          4         12          0         14          0
## 15 Otu0015         29          0         11          2         14
## 16 Otu0016          0          0          0          0          0
## 17 Otu0017          0          0          0          0          0
## 18 Otu0018          3          3          1          1          3
## 19 Otu0019          0          0          0          0          0
## 20 Otu0020          0          0          0          0          0
## # ... with 1382 more variables: CBU07JUN05 <int>, CBU08JUL05 <int>,
## #   CBU10AUG05 <int>, CBU10JUN05 <int>, CBU12JUL05 <int>,
## #   CBU12JUN05 <int>, CBU14JUL05 <int>, CBU14JUN05 <int>,
## #   CBU17JUN05 <int>, CBU19JUL05 <int>, CBU21JUN05 <int>,
## #   CBU24JUN05 <int>, CBU24MAY05 <int>, CBU27MAY05 <int>,
## #   CBU28JUL05 <int>, CBU28JUN05 <int>, CBU30AUG05 <int>,
## #   CBU31MAY05 <int>, CBE01OCT07.R1 <int>, CBE01OCT07.R2 <int>,
## #   CBE02AUG07.R1 <int>, CBE02AUG07.R2 <int>, CBE02AUG09 <int>,
## #   CBE02JUL07.R1 <int>, CBE02JUL07.R2 <int>, CBE03JUN09 <int>,
## #   CBE05JUL07.R1 <int>, CBE05JUL07.R2 <int>, CBE06AUG07 <int>,
## #   CBE06NOV07.R1 <int>, CBE06NOV07.R2 <int>, CBE07JUL09 <int>,
## #   CBE08JUN09 <int>, CBE09AUG07.R1 <int>, CBE09AUG07.R2 <int>,
## #   CBE10JUL07.R1 <int>, CBE10JUL07.R2 <int>, CBE10SEP07.R1 <int>,
## #   CBE10SEP07.R2 <int>, CBE12AUG09 <int>, CBE12JUL07.R1 <int>,
## #   CBE12JUL07.R2 <int>, CBE13JUL09 <int>, CBE15JUN09 <int>,
## #   CBE16JUL07.R1 <int>, CBE16JUL07.R2 <int>, CBE16OCT07.R1 <int>,
## #   CBE16OCT07.R2 <int>, CBE17SEP07.R1 <int>, CBE17SEP07.R2 <int>,
## #   CBE18AUG09 <int>, CBE18JUL07.R1 <int>, CBE18JUL07.R2 <int>,
## #   CBE19JUN07.R1 <int>, CBE19JUN07.R2 <int>, CBE20AUG07 <int>,
## #   CBE21JUL09 <int>, CBE21JUN07.R1 <int>, CBE21JUN07.R2 <int>,
## #   CBE23AUG07.R1 <int>, CBE23AUG07.R2 <int>, CBE23JUN09 <int>,
## #   CBE23OCT07.R1 <int>, CBE23OCT07.R2 <int>, CBE24AUG09 <int>,
## #   CBE25JUL07.R1 <int>, CBE25JUL07.R2 <int>, CBE25SEP07.R1 <int>,
## #   CBE25SEP07.R2 <int>, CBE27AUG07.R1 <int>, CBE27AUG07.R2 <int>,
## #   CBE27JUL07.R1 <int>, CBE27JUL07.R2 <int>, CBE27JUL09 <int>,
## #   CBE27JUN07.R1 <int>, CBE27JUN07.R2 <int>, CBE29JUN07.R1 <int>,
## #   CBE29JUN07.R2 <int>, CBE29JUN09 <int>, CBE29MAY09 <int>,
## #   CBE29OCT07 <int>, CBE31JUL07.R1 <int>, CBE31JUL07.R2 <int>,
## #   CBH01OCT07.R1 <int>, CBH01OCT07.R2 <int>, CBH02AUG09 <int>,
## #   CBH02JUL07.R1 <int>, CBH02JUL07.R2 <int>, CBH03JUN09 <int>,
## #   CBH05JUL07.R1 <int>, CBH05JUL07.R2 <int>, CBH06AUG07.R1 <int>,
## #   CBH06AUG07.R2 <int>, CBH06NOV07.R1 <int>, CBH06NOV07.R2 <int>,
## #   CBH07JUL09 <int>, CBH08JUN09 <int>, CBH09AUG07.R1 <int>,
## #   CBH09AUG07.R2 <int>, CBH10JUL07.R1 <int>, ...

head(reclass_df, 20)

## # A tibble: 20 × 8
##        OTU          Kingdom                  Phylum
##     <fctr>           <fctr>                  <fctr>
## 1  Otu0001 k__Bacteria(100) p__Verrucomicrobia(100)
## 2  Otu0002 k__Bacteria(100)  p__Proteobacteria(100)
## 3  Otu0003 k__Bacteria(100) p__Verrucomicrobia(100)
## 4  Otu0004 k__Bacteria(100)  p__Planctomycetes(100)
## 5  Otu0005 k__Bacteria(100)  p__Proteobacteria(100)
## 6  Otu0006 k__Bacteria(100)             p__OP3(100)
## 7  Otu0007 k__Bacteria(100)    p__Bacteroidetes(99)
## 8  Otu0008 k__Bacteria(100)             p__SR1(100)
## 9  Otu0009 k__Bacteria(100)             p__OP3(100)
## 10 Otu0010  k__Bacteria(97)   p__Actinobacteria(89)
## 11 Otu0011 k__Bacteria(100)  p__Verrucomicrobia(78)
## 12 Otu0012 k__Bacteria(100)  p__Proteobacteria(100)
## 13 Otu0013 k__Bacteria(100)             p__OP3(100)
## 14 Otu0014 k__Bacteria(100)  p__Verrucomicrobia(99)
## 15 Otu0015 k__Bacteria(100)   p__Bacteroidetes(100)
## 16 Otu0016 k__Bacteria(100)   p__Bacteroidetes(100)
## 17 Otu0017 k__Bacteria(100)  p__Verrucomicrobia(99)
## 18 Otu0018 k__Bacteria(100)             p__OP3(100)
## 19 Otu0019 k__Bacteria(100)       p__Firmicutes(95)
## 20 Otu0020 k__Bacteria(100) p__Verrucomicrobia(100)
## # ... with 5 more variables: Class <fctr>, Order <fctr>, Lineage <fctr>,
## #   Clade <fctr>, Tribe <fctr>

From this, you can see that the similar columns are “OTU.” However, the OTU counts dataframe wasn’t given a column name, which shows up as “X.” I could probably very easily fix this in Excel and then reload the dataframe, but you coudl also give the column a new name by:

colnames(OTU_df)[1] <- 'OTU'
head(OTU_df, 20)

## # A tibble: 20 × 1,388
##        OTU CBU01JUL05 CBU02AUG05 CBU03JUN05 CBU05AUG05 CBU05JUL05
##     <fctr>      <int>      <int>      <int>      <int>      <int>
## 1  Otu0001         29         11          7          6         20
## 2  Otu0002          8         14         11         23          5
## 3  Otu0003          2          9          2          7          1
## 4  Otu0004          0          0          0          0          0
## 5  Otu0005        117          5          0          6         58
## 6  Otu0006          0          0          0          0          0
## 7  Otu0007          0          8          0          3          0
## 8  Otu0008          0          1          0          0          0
## 9  Otu0009          0          4          0          0          0
## 10 Otu0010          0          0          0          0          0
## 11 Otu0011          0          2          0          0          0
## 12 Otu0012          0          0          0          0          0
## 13 Otu0013          0          1          0          4          0
## 14 Otu0014          4         12          0         14          0
## 15 Otu0015         29          0         11          2         14
## 16 Otu0016          0          0          0          0          0
## 17 Otu0017          0          0          0          0          0
## 18 Otu0018          3          3          1          1          3
## 19 Otu0019          0          0          0          0          0
## 20 Otu0020          0          0          0          0          0
## # ... with 1382 more variables: CBU07JUN05 <int>, CBU08JUL05 <int>,
## #   CBU10AUG05 <int>, CBU10JUN05 <int>, CBU12JUL05 <int>,
## #   CBU12JUN05 <int>, CBU14JUL05 <int>, CBU14JUN05 <int>,
## #   CBU17JUN05 <int>, CBU19JUL05 <int>, CBU21JUN05 <int>,
## #   CBU24JUN05 <int>, CBU24MAY05 <int>, CBU27MAY05 <int>,
## #   CBU28JUL05 <int>, CBU28JUN05 <int>, CBU30AUG05 <int>,
## #   CBU31MAY05 <int>, CBE01OCT07.R1 <int>, CBE01OCT07.R2 <int>,
## #   CBE02AUG07.R1 <int>, CBE02AUG07.R2 <int>, CBE02AUG09 <int>,
## #   CBE02JUL07.R1 <int>, CBE02JUL07.R2 <int>, CBE03JUN09 <int>,
## #   CBE05JUL07.R1 <int>, CBE05JUL07.R2 <int>, CBE06AUG07 <int>,
## #   CBE06NOV07.R1 <int>, CBE06NOV07.R2 <int>, CBE07JUL09 <int>,
## #   CBE08JUN09 <int>, CBE09AUG07.R1 <int>, CBE09AUG07.R2 <int>,
## #   CBE10JUL07.R1 <int>, CBE10JUL07.R2 <int>, CBE10SEP07.R1 <int>,
## #   CBE10SEP07.R2 <int>, CBE12AUG09 <int>, CBE12JUL07.R1 <int>,
## #   CBE12JUL07.R2 <int>, CBE13JUL09 <int>, CBE15JUN09 <int>,
## #   CBE16JUL07.R1 <int>, CBE16JUL07.R2 <int>, CBE16OCT07.R1 <int>,
## #   CBE16OCT07.R2 <int>, CBE17SEP07.R1 <int>, CBE17SEP07.R2 <int>,
## #   CBE18AUG09 <int>, CBE18JUL07.R1 <int>, CBE18JUL07.R2 <int>,
## #   CBE19JUN07.R1 <int>, CBE19JUN07.R2 <int>, CBE20AUG07 <int>,
## #   CBE21JUL09 <int>, CBE21JUN07.R1 <int>, CBE21JUN07.R2 <int>,
## #   CBE23AUG07.R1 <int>, CBE23AUG07.R2 <int>, CBE23JUN09 <int>,
## #   CBE23OCT07.R1 <int>, CBE23OCT07.R2 <int>, CBE24AUG09 <int>,
## #   CBE25JUL07.R1 <int>, CBE25JUL07.R2 <int>, CBE25SEP07.R1 <int>,
## #   CBE25SEP07.R2 <int>, CBE27AUG07.R1 <int>, CBE27AUG07.R2 <int>,
## #   CBE27JUL07.R1 <int>, CBE27JUL07.R2 <int>, CBE27JUL09 <int>,
## #   CBE27JUN07.R1 <int>, CBE27JUN07.R2 <int>, CBE29JUN07.R1 <int>,
## #   CBE29JUN07.R2 <int>, CBE29JUN09 <int>, CBE29MAY09 <int>,
## #   CBE29OCT07 <int>, CBE31JUL07.R1 <int>, CBE31JUL07.R2 <int>,
## #   CBH01OCT07.R1 <int>, CBH01OCT07.R2 <int>, CBH02AUG09 <int>,
## #   CBH02JUL07.R1 <int>, CBH02JUL07.R2 <int>, CBH03JUN09 <int>,
## #   CBH05JUL07.R1 <int>, CBH05JUL07.R2 <int>, CBH06AUG07.R1 <int>,
## #   CBH06AUG07.R2 <int>, CBH06NOV07.R1 <int>, CBH06NOV07.R2 <int>,
## #   CBH07JUL09 <int>, CBH08JUN09 <int>, CBH09AUG07.R1 <int>,
## #   CBH09AUG07.R2 <int>, CBH10JUL07.R1 <int>, ...

Now that the column names are the same for the OTU column, the two dataframes can be joined, and then write out the file:

joined_OTU <- left_join(reclass_df, OTU_df, by = "OTU", copy=FALSE)
joined_OTU

## # A tibble: 6,208 × 1,395
##        OTU          Kingdom                  Phylum
##     <fctr>           <fctr>                  <fctr>
## 1  Otu0001 k__Bacteria(100) p__Verrucomicrobia(100)
## 2  Otu0002 k__Bacteria(100)  p__Proteobacteria(100)
## 3  Otu0003 k__Bacteria(100) p__Verrucomicrobia(100)
## 4  Otu0004 k__Bacteria(100)  p__Planctomycetes(100)
## 5  Otu0005 k__Bacteria(100)  p__Proteobacteria(100)
## 6  Otu0006 k__Bacteria(100)             p__OP3(100)
## 7  Otu0007 k__Bacteria(100)    p__Bacteroidetes(99)
## 8  Otu0008 k__Bacteria(100)             p__SR1(100)
## 9  Otu0009 k__Bacteria(100)             p__OP3(100)
## 10 Otu0010  k__Bacteria(97)   p__Actinobacteria(89)
## # ... with 6,198 more rows, and 1392 more variables: Class <fctr>,
## #   Order <fctr>, Lineage <fctr>, Clade <fctr>, Tribe <fctr>,
## #   CBU01JUL05 <int>, CBU02AUG05 <int>, CBU03JUN05 <int>,
## #   CBU05AUG05 <int>, CBU05JUL05 <int>, CBU07JUN05 <int>,
## #   CBU08JUL05 <int>, CBU10AUG05 <int>, CBU10JUN05 <int>,
## #   CBU12JUL05 <int>, CBU12JUN05 <int>, CBU14JUL05 <int>,
## #   CBU14JUN05 <int>, CBU17JUN05 <int>, CBU19JUL05 <int>,
## #   CBU21JUN05 <int>, CBU24JUN05 <int>, CBU24MAY05 <int>,
## #   CBU27MAY05 <int>, CBU28JUL05 <int>, CBU28JUN05 <int>,
## #   CBU30AUG05 <int>, CBU31MAY05 <int>, CBE01OCT07.R1 <int>,
## #   CBE01OCT07.R2 <int>, CBE02AUG07.R1 <int>, CBE02AUG07.R2 <int>,
## #   CBE02AUG09 <int>, CBE02JUL07.R1 <int>, CBE02JUL07.R2 <int>,
## #   CBE03JUN09 <int>, CBE05JUL07.R1 <int>, CBE05JUL07.R2 <int>,
## #   CBE06AUG07 <int>, CBE06NOV07.R1 <int>, CBE06NOV07.R2 <int>,
## #   CBE07JUL09 <int>, CBE08JUN09 <int>, CBE09AUG07.R1 <int>,
## #   CBE09AUG07.R2 <int>, CBE10JUL07.R1 <int>, CBE10JUL07.R2 <int>,
## #   CBE10SEP07.R1 <int>, CBE10SEP07.R2 <int>, CBE12AUG09 <int>,
## #   CBE12JUL07.R1 <int>, CBE12JUL07.R2 <int>, CBE13JUL09 <int>,
## #   CBE15JUN09 <int>, CBE16JUL07.R1 <int>, CBE16JUL07.R2 <int>,
## #   CBE16OCT07.R1 <int>, CBE16OCT07.R2 <int>, CBE17SEP07.R1 <int>,
## #   CBE17SEP07.R2 <int>, CBE18AUG09 <int>, CBE18JUL07.R1 <int>,
## #   CBE18JUL07.R2 <int>, CBE19JUN07.R1 <int>, CBE19JUN07.R2 <int>,
## #   CBE20AUG07 <int>, CBE21JUL09 <int>, CBE21JUN07.R1 <int>,
## #   CBE21JUN07.R2 <int>, CBE23AUG07.R1 <int>, CBE23AUG07.R2 <int>,
## #   CBE23JUN09 <int>, CBE23OCT07.R1 <int>, CBE23OCT07.R2 <int>,
## #   CBE24AUG09 <int>, CBE25JUL07.R1 <int>, CBE25JUL07.R2 <int>,
## #   CBE25SEP07.R1 <int>, CBE25SEP07.R2 <int>, CBE27AUG07.R1 <int>,
## #   CBE27AUG07.R2 <int>, CBE27JUL07.R1 <int>, CBE27JUL07.R2 <int>,
## #   CBE27JUL09 <int>, CBE27JUN07.R1 <int>, CBE27JUN07.R2 <int>,
## #   CBE29JUN07.R1 <int>, CBE29JUN07.R2 <int>, CBE29JUN09 <int>,
## #   CBE29MAY09 <int>, CBE29OCT07 <int>, CBE31JUL07.R1 <int>,
## #   CBE31JUL07.R2 <int>, CBH01OCT07.R1 <int>, CBH01OCT07.R2 <int>,
## #   CBH02AUG09 <int>, CBH02JUL07.R1 <int>, CBH02JUL07.R2 <int>,
## #   CBH03JUN09 <int>, CBH05JUL07.R1 <int>, ...

write.csv(joined_OTU, "~/Desktop/OTU-samples-and-classifications-joined.csv")

To check that your dataframes merged correctly, look at the dimensions of the originals and the joined dataframe:

dim(OTU_df)

## [1] 6208 1388

dim(reclass_df)

## [1] 6208    8

dim(joined_OTU)

## [1] 6208 1395

Data-Wrangling-in-R

Elizabeth McDaniel

3/14/2017