Data “Wrangling”

Often put off until later…
But as graduate students dealing with data, need this sooner
See Chap 13 of Mosaic Student Guide (in Google Drive)
and R for Data Science (Chap 5, and more): http://r4ds.had.co.nz/

see what data is like

Can view in editor (but not edit) by clicking on it in environment; can also click on arrow to see structure.

Other helpful commands:

d <- read_csv("data04-ex1.csv")
d

## # A tibble: 6 x 2
##   Group Response
##   <dbl>    <dbl>
## 1     1      3.5
## 2     1      3.7
## 3     1      3.8
## 4     2      3.6
## 5     2      3.2
## 6     2      2.9

str(d)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  2 variables:
##  $ Group   : num  1 1 1 2 2 2
##  $ Response: num  3.5 3.7 3.8 3.6 3.2 2.9
##  - attr(*, "spec")=
##   .. cols(
##   ..   Group = col_double(),
##   ..   Response = col_double()
##   .. )

inspect(d)

## 
## quantitative variables:  
##       name   class min    Q1 median    Q3 max mean        sd n missing
## 1    Group numeric 1.0 1.000   1.50 2.000 2.0 1.50 0.5477226 6       0
## 2 Response numeric 2.9 3.275   3.55 3.675 3.8 3.45 0.3391165 6       0

head(d)

## # A tibble: 6 x 2
##   Group Response
##   <dbl>    <dbl>
## 1     1      3.5
## 2     1      3.7
## 3     1      3.8
## 4     2      3.6
## 5     2      3.2
## 6     2      2.9

tail(d)

## # A tibble: 6 x 2
##   Group Response
##   <dbl>    <dbl>
## 1     1      3.5
## 2     1      3.7
## 3     1      3.8
## 4     2      3.6
## 5     2      3.2
## 6     2      2.9

factors

A factor is a categorical variable; basically, a text variable where the levels are defined and given an order.

Here I’ve coded Group as a numeric, 1 or 2; I need to make Group a factor!

Here I assign them to new variable names for demonstration; I could use the same name as well.

d$Group_v2 <- factor(d$Group)
d$Group_v3 <- factor(d$Group, levels=c(1,2), labels=c("Trt", "Con"))
inspect(d)

## 
## categorical variables:  
##       name  class levels n missing
## 1 Group_v2 factor      2 6       0
## 2 Group_v3 factor      2 6       0
##                                    distribution
## 1 1 (50%), 2 (50%)                             
## 2 Trt (50%), Con (50%)                         
## 
## quantitative variables:  
##       name   class min    Q1 median    Q3 max mean        sd n missing
## 1    Group numeric 1.0 1.000   1.50 2.000 2.0 1.50 0.5477226 6       0
## 2 Response numeric 2.9 3.275   3.55 3.675 3.8 3.45 0.3391165 6       0

Using tidyverse “verbs”

A better way is to use functions that specify the data set, rather than the $.

Back to the original version of the data:

d <- read_csv("data04-ex1.csv")
d

## # A tibble: 6 x 2
##   Group Response
##   <dbl>    <dbl>
## 1     1      3.5
## 2     1      3.7
## 3     1      3.8
## 4     2      3.6
## 5     2      3.2
## 6     2      2.9

mutate

mutate adds (or changes) variables in the data set, using existing variables

d <- mutate(d, 
            Group_v2=factor(Group),
            Group_v3=factor(Group, levels=c(1,2), labels=c("Trt", "Con"))
            )
inspect(d)

## 
## categorical variables:  
##       name  class levels n missing
## 1 Group_v2 factor      2 6       0
## 2 Group_v3 factor      2 6       0
##                                    distribution
## 1 1 (50%), 2 (50%)                             
## 2 Trt (50%), Con (50%)                         
## 
## quantitative variables:  
##       name   class min    Q1 median    Q3 max mean        sd n missing
## 1    Group numeric 1.0 1.000   1.50 2.000 2.0 1.50 0.5477226 6       0
## 2 Response numeric 2.9 3.275   3.55 3.675 3.8 3.45 0.3391165 6       0

could log transform (log, log10, log2)

d <- mutate(d, logResponse=log2(Response))
d

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt             1.81
## 2     1      3.7 1        Trt             1.89
## 3     1      3.8 1        Trt             1.93
## 4     2      3.6 2        Con             1.85
## 5     2      3.2 2        Con             1.68
## 6     2      2.9 2        Con             1.54

select

select(d, Group, Response)

## # A tibble: 6 x 2
##   Group Response
##   <dbl>    <dbl>
## 1     1      3.5
## 2     1      3.7
## 3     1      3.8
## 4     2      3.6
## 5     2      3.2
## 6     2      2.9

select(d, -Response, -Group)

## # A tibble: 6 x 3
##   Group_v2 Group_v3 logResponse
##   <fct>    <fct>          <dbl>
## 1 1        Trt             1.81
## 2 1        Trt             1.89
## 3 1        Trt             1.93
## 4 2        Con             1.85
## 5 2        Con             1.68
## 6 2        Con             1.54

filter

test for equality: ==, !=, %in%
test numeric: <, <=, >, >=
combine with AND/OR: & |

filter(d, Group == 1)

## # A tibble: 3 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt             1.81
## 2     1      3.7 1        Trt             1.89
## 3     1      3.8 1        Trt             1.93

filter(d, Group != 1)

## # A tibble: 3 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     2      3.6 2        Con             1.85
## 2     2      3.2 2        Con             1.68
## 3     2      2.9 2        Con             1.54

filter(d, Group %in% c(1, 2))

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt             1.81
## 2     1      3.7 1        Trt             1.89
## 3     1      3.8 1        Trt             1.93
## 4     2      3.6 2        Con             1.85
## 5     2      3.2 2        Con             1.68
## 6     2      2.9 2        Con             1.54

rename

rename(d, Treatment = "Group")

## # A tibble: 6 x 5
##   Treatment Response Group_v2 Group_v3 logResponse
##       <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1         1      3.5 1        Trt             1.81
## 2         1      3.7 1        Trt             1.89
## 3         1      3.8 1        Trt             1.93
## 4         2      3.6 2        Con             1.85
## 5         2      3.2 2        Con             1.68
## 6         2      2.9 2        Con             1.54

arrange

arrange(d, Response)

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     2      2.9 2        Con             1.54
## 2     2      3.2 2        Con             1.68
## 3     1      3.5 1        Trt             1.81
## 4     2      3.6 2        Con             1.85
## 5     1      3.7 1        Trt             1.89
## 6     1      3.8 1        Trt             1.93

arrange(d, Group, Response)

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt             1.81
## 2     1      3.7 1        Trt             1.89
## 3     1      3.8 1        Trt             1.93
## 4     2      2.9 2        Con             1.54
## 5     2      3.2 2        Con             1.68
## 6     2      3.6 2        Con             1.85

piping

These two commands are equivalent.

mutate(d, logResponse=log10(Response))

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt            0.544
## 2     1      3.7 1        Trt            0.568
## 3     1      3.8 1        Trt            0.580
## 4     2      3.6 2        Con            0.556
## 5     2      3.2 2        Con            0.505
## 6     2      2.9 2        Con            0.462

d %>% mutate(logResponse=log10(Response))

## # A tibble: 6 x 5
##   Group Response Group_v2 Group_v3 logResponse
##   <dbl>    <dbl> <fct>    <fct>          <dbl>
## 1     1      3.5 1        Trt            0.544
## 2     1      3.7 1        Trt            0.568
## 3     1      3.8 1        Trt            0.580
## 4     2      3.6 2        Con            0.556
## 5     2      3.2 2        Con            0.505
## 6     2      2.9 2        Con            0.462

This lets us chain together verbs for easier writing and reading.

d %>% mutate(logResponse=log10(Response)) %>%
  select(Group, logResponse) %>%
  filter(Group==1) %>%
  rename(Treatment="Group")

## # A tibble: 3 x 2
##   Treatment logResponse
##       <dbl>       <dbl>
## 1         1       0.544
## 2         1       0.568
## 3         1       0.580

two more verbs: group_by and summarize

d %>% group_by(Group) %>% 
  summarize(m=mean(Response),
            s=sd(Response))

## # A tibble: 2 x 3
##   Group     m     s
##   <dbl> <dbl> <dbl>
## 1     1  3.67 0.153
## 2     2  3.23 0.351

Can use these to plot mean +/- SD

dx <- d %>% mutate(Group=factor(Group)) %>% 
  group_by(Group) %>% 
  summarize(m = mean(Response),
            s = sd(Response)) %>%
  mutate(lwr = m - s, upr = m + s)
gf_pointrange(m + lwr + upr ~ Group, data=dx)

Or with error bars: piping lets us add points for mean (at larger size), error bars, and all data points to same plot

gf_point(m ~ Group, data=dx, ylab="mean (SD)", size=3) %>%
  gf_errorbar(lwr + upr ~ Group, data=dx, width=0.5) %>%
  gf_point(Response ~ Group, data=d)

and reshaping

Some more generic data; this has four columns, representing two variables (a and b) measured at two times (1 and 2)

d <- read_csv("data04-ex2.csv")
d

## # A tibble: 10 x 5
##       id   a_1   a_2   b_1   b_2
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1     0     7     6     4
##  2     2     7     1     6     6
##  3     3    -1     0     7    10
##  4     4     3     3     5     7
##  5     5     8     0     5     7
##  6     6     1     3     2     6
##  7     7     2     1     7     0
##  8     8     1    -4     7    -3
##  9     9     2     4     1    -2
## 10    10     3     2     0     3

gather goes from wide to long

Takes a selection of variables (either by inclusion or exclusion) and “stacks” the values of those variables, with the result of two columns, one with the name of the variable it came from and the other with the value.

Here “variable” is the new variable name for the name of the variable, and “value” is the new variable for the values; you can change these. The other parameters are the selection of the variables to “gather”; first by exclusion (everything but id), then by inclusion; they have the same result.

gather(d, "variable", "value", -id)

gather(d, "variable", "value", a_1, a_2, b_1, b_2)

## # A tibble: 40 x 3
##       id variable value
##    <dbl> <chr>    <dbl>
##  1     1 a_1          0
##  2     2 a_1          7
##  3     3 a_1         -1
##  4     4 a_1          3
##  5     5 a_1          8
##  6     6 a_1          1
##  7     7 a_1          2
##  8     8 a_1          1
##  9     9 a_1          2
## 10    10 a_1          3
## # … with 30 more rows

separate

So now we need to separate the “variable” variable into the two parts, a/b and 1/2. This separate verb does that, here I separate them into new variables called “var” and “time”. By default, it separates by non-alphanumeric.

dx <- gather(d, "variable", "value", -id) %>%
  separate(variable, c("var", "time"))
dx

## # A tibble: 40 x 4
##       id var   time  value
##    <dbl> <chr> <chr> <dbl>
##  1     1 a     1         0
##  2     2 a     1         7
##  3     3 a     1        -1
##  4     4 a     1         3
##  5     5 a     1         8
##  6     6 a     1         1
##  7     7 a     1         2
##  8     8 a     1         1
##  9     9 a     1         2
## 10    10 a     1         3
## # … with 30 more rows

“facetting” by multiple variables

Also see use of the bins parameter; binwidth may also be of interest.

gf_histogram(~value | var ~ time, data=dx, bins=5)

spread goes from “long” to “wide”

Can reverse the “gather” procedure by “spreading”, like this

dx2 <- dx %>% spread(var, value)
dx2

## # A tibble: 20 x 4
##       id time      a     b
##    <dbl> <chr> <dbl> <dbl>
##  1     1 1         0     6
##  2     1 2         7     4
##  3     2 1         7     6
##  4     2 2         1     6
##  5     3 1        -1     7
##  6     3 2         0    10
##  7     4 1         3     5
##  8     4 2         3     7
##  9     5 1         8     5
## 10     5 2         0     7
## 11     6 1         1     2
## 12     6 2         3     6
## 13     7 1         2     7
## 14     7 2         1     0
## 15     8 1         1     7
## 16     8 2        -4    -3
## 17     9 1         2     1
## 18     9 2         4    -2
## 19    10 1         3     0
## 20    10 2         2     3

then could scatterplot and separate by color

gf_point(a ~ b, color = ~time, data=dx2, title="By color")

Data Wrangling

Aaron Rendahl, PhD

Fall 2019