Lab 7

Tidyverse

“Tidy” Data

Let’s say we have a file that looks like the following:

df <- read.csv("lab7.csv")
head(df)
  predictor treatment_1 treatment_2
1         1   -3.167200   37.455712
2         2    4.496217   18.670957
3         3   -1.351417   24.815985
4         4    7.995896  -32.981477
5         5    9.169120   45.364987
6         6    9.034098    4.445134

And we want to find the effect of two variables:

  1. predictor
  2. treatment_1 vs. treatment_2

(Where are our outcome values located in this table?)

“Tidy” Data (cont.)

We could attempt to create two separate linear models

model_1 <- lm(treatment_1 ~ predictor, df)
model_2 <- lm(treatment_2 ~ predictor, df)
summary(model_1)

Call:
lm(formula = treatment_1 ~ predictor, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-5.148 -2.863  1.618  2.611  4.941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.04237    1.34205   0.777    0.445    
predictor    0.91795    0.09028  10.168 5.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.255 on 23 degrees of freedom
Multiple R-squared:  0.818, Adjusted R-squared:  0.8101 
F-statistic: 103.4 on 1 and 23 DF,  p-value: 5.575e-10
summary(model_2)

Call:
lm(formula = treatment_2 ~ predictor, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-51.14 -25.44   0.90  19.88  46.59 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  17.3799    11.2528   1.544    0.136
predictor     0.1956     0.7569   0.258    0.798

Residual standard error: 27.29 on 23 degrees of freedom
Multiple R-squared:  0.002893,  Adjusted R-squared:  -0.04046 
F-statistic: 0.06674 on 1 and 23 DF,  p-value: 0.7984
  • This tells us the effect of the predictor for each of the groups (treatment 1 and treatment 2), but does not tell us the effect of the predictors as a whole.

  • We also don’t know the effect of switching between treatment 1 and 2.

“Tidy” Data (cont.)

This model needs a bit of work, but let’s see if we can try to visualize this data.

p <- ggplot(df, aes(x = predictor))
p + geom_point(aes(y = treatment_1))

p + geom_point(aes(y = treatment_2))

We can create two separate graphs for each of the conditions, but it seems difficult to plot them on the same graph.

“Tidy” Data (cont.)

Data comes in many formats in real life, which can make it difficult to deal with during data analysis.

When data is all in the same format, with columns as variables, rows as observations, and cells as values, we call this tidy data.

Tidyverse is a collection of packages that was created surrounding the philosophy of “tidy” data.

  • We’ve already used one of the packages, called ggplot

If you want to read more about the philosophy behind tidy data, feel free to read this article.

Wide vs. Long Data

Installing Packages

Let’s get started with our first tidyverse function (besides ggplot). If you haven’t already, you will want to install the tidyverse package.

# install.packages("tidyverse")
library(tidyverse)

Tidyverse will give you many included packages. You can also import the specific library you want.

library(tidyr)

pivot_longer

To fix the problem we were encountering with our earlier dataframe, let’s transform the dataframe to have the following three columns.

  • predictor: first independent variable

  • treatment: second independent variable

  • value: dependent variable

colnames(df) <- c("predictor", "1", "2")
df_long <- pivot_longer(df, c("1", "2"), names_to = "treatment", values_to = "value")
head(df_long)
# A tibble: 6 × 3
  predictor treatment value
      <int> <chr>     <dbl>
1         1 1         -3.17
2         1 2         37.5 
3         2 1          4.50
4         2 2         18.7 
5         3 1         -1.35
6         3 2         24.8 

Let’s compare that to our previous dataframe:

head(df)
  predictor         1          2
1         1 -3.167200  37.455712
2         2  4.496217  18.670957
3         3 -1.351417  24.815985
4         4  7.995896 -32.981477
5         5  9.169120  45.364987
6         6  9.034098   4.445134

Linear model

Now, we can run a linear model using both a continuous and categorical predictor.

model_long <- lm(value ~ predictor + treatment, df_long)
summary(model_long)

Call:
lm(formula = value ~ predictor + treatment, data = df_long)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.893  -8.534   0.622   7.856  44.425 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.7380     6.2909   0.912    0.366
predictor     0.5568     0.3807   1.462    0.150
treatment2    6.9463     5.4911   1.265    0.212

Residual standard error: 19.41 on 47 degrees of freedom
Multiple R-squared:  0.07368,   Adjusted R-squared:  0.03426 
F-statistic: 1.869 on 2 and 47 DF,  p-value: 0.1655

Line graph

We can also visualize both treatments simultaneously as separate lines using the color parameter in aes().

p2 <- ggplot(df_long, aes(predictor, value, color=treatment)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
p2

This automatically creates a separate color for each group for both the points and the line.

reshape - base R equivalent

You can also do the same operation without tidyverse by using the base R function reshape

reshape(df, direction = "long", idvar = "predictor", varying = c("1", "2"), v.names = "value", timevar = "treatment", times = c("1", "2"))
     predictor treatment      value
1.1          1         1  -3.167200
2.1          2         1   4.496217
3.1          3         1  -1.351417
4.1          4         1   7.995896
5.1          5         1   9.169120
6.1          6         1   9.034098
7.1          7         1   4.638596
8.1          8         1  12.187389
9.1          9         1   6.223722
10.1        10         1   8.266001
11.1        11         1   9.861950
12.1        12         1  14.668833
13.1        13         1  15.507465
14.1        14         1  15.805155
15.1        15         1  17.028237
16.1        16         1  17.418002
17.1        17         1  12.918129
18.1        18         1  20.344925
19.1        19         1  14.766201
20.1        20         1  16.874053
21.1        21         1  23.195124
22.1        22         1  20.954226
23.1        23         1  27.096226
24.1        24         1  20.210125
25.1        25         1  20.252922
1.2          1         2  37.455712
2.2          2         2  18.670957
3.2          3         2  24.815985
4.2          4         2 -32.981477
5.2          5         2  45.364987
6.2          6         2   4.445134
7.2          7         2  -6.692952
8.2          8         2  38.300509
9.2          9         2   9.839616
10.2        10         2  31.588467
11.2        11         2  -6.713653
12.2        12         2  38.469053
13.2        13         2  54.091338
14.2        14         2   7.832762
15.2        15         2   9.296931
16.2        16         2  -5.198206
17.2        17         2  33.393716
18.2        18         2  57.304227
19.2        19         2  67.687201
20.2        20         2  43.646847
21.2        21         2  -5.477597
22.2        22         2 -21.679956
23.2        23         2  -7.867105
24.2        24         2   9.542462
25.2        25         2  52.916863

Most things that can be done in Tidyverse can be done in base R, so it is up to you what you prefer. We will go over a few more tidyverse equivalents of base R functions today.

dplyr

pivot_longer comes from the tidyr portion of Tidyverse. We will now go over the dplyr portion which contains many other functions related to dataframe manipulation

Data

Let’s once again use the iris dataset

data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

You should also already have the dplyr package installed when you imported Tidyverse.

select()

Equivalent to column indexing in base R

base_select <- iris[c("Sepal.Length", "Petal.Length")]
head(base_select)
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7
dplyr_select <- select(iris, c("Sepal.Length", "Petal.Length"))
head(dplyr_select)
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

filter()

Equivalent to row indexing in base R.

base_filter <- iris[iris$Species == "setosa", ]
head(base_filter)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dplyr_filter <- filter(iris, Species == "setosa")
head(dplyr_filter)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

mutate()

Equivalent to creating a new column in R

base_mutate <- iris # creating a copy of original dataframe
base_mutate$Average.Length <- (base_mutate$Sepal.Length + base_mutate$Petal.Length) / 2
head(base_mutate)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Average.Length
1          5.1         3.5          1.4         0.2  setosa           3.25
2          4.9         3.0          1.4         0.2  setosa           3.15
3          4.7         3.2          1.3         0.2  setosa           3.00
4          4.6         3.1          1.5         0.2  setosa           3.05
5          5.0         3.6          1.4         0.2  setosa           3.20
6          5.4         3.9          1.7         0.4  setosa           3.55
dplyr_mutate <- mutate(iris, Average.Length = (Sepal.Length + Petal.Length) / 2)
head(dplyr_mutate)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Average.Length
1          5.1         3.5          1.4         0.2  setosa           3.25
2          4.9         3.0          1.4         0.2  setosa           3.15
3          4.7         3.2          1.3         0.2  setosa           3.00
4          4.6         3.1          1.5         0.2  setosa           3.05
5          5.0         3.6          1.4         0.2  setosa           3.20
6          5.4         3.9          1.7         0.4  setosa           3.55

Pipelines: %>%

dplyr also has a built in format of chaining together multiple operations on the same dataframe. Let’s say you wanted to:

  1. Filter for only setosas
  2. Select for only sepal and petal lengths
  3. Create a new average length column
iris_1 <- filter(iris, Species == "setosa")
iris_2 <- select(iris_1, c("Sepal.Length", "Petal.Length"))
iris_3 <- mutate(iris_2, Average.Length = (Sepal.Length + Petal.Length) / 2)
head(iris_3)
  Sepal.Length Petal.Length Average.Length
1          5.1          1.4           3.25
2          4.9          1.4           3.15
3          4.7          1.3           3.00
4          4.6          1.5           3.05
5          5.0          1.4           3.20
6          5.4          1.7           3.55

vs.

iris %>%
  filter(Species == "setosa") %>%
  select(c("Sepal.Length", "Petal.Length")) %>%
  mutate(iris_2, Average.Length = (Sepal.Length + Petal.Length) / 2) %>%
  head()
  Sepal.Length Petal.Length Average.Length
1          5.1          1.4           3.25
2          4.9          1.4           3.15
3          4.7          1.3           3.00
4          4.6          1.5           3.05
5          5.0          1.4           3.20
6          5.4          1.7           3.55
  • At each line you use the %>% symbol to indicate that you are passing in the same dataframe into the next function

  • Both examples produce the same output, so whichever format you prefer is up to preference.

groupby() and summarise()

Sometimes you are interested in applying a function separately across each group, rather than to the dataframe as a whole.

  • Let’s say you want to find the mean of each group.
iris %>%
  group_by(Species) %>%
  summarise(mean_sepal_length = mean(Sepal.Length))
# A tibble: 3 × 2
  Species    mean_sepal_length
  <fct>                  <dbl>
1 setosa                  5.01
2 versicolor              5.94
3 virginica               6.59
  • group_by chooses which column to group by (must be categorical)

  • summarise defines what operation you want to do on which column (in this case finding the mean of sepal length and creating a new column on it)