Our primary focus in this Section is on the Tidy phase.
Our primary focus in this Section is on the Tidy phase.
The important functions come from the tidyr package:
pivot_longer()
pivot_wider()
Create this data table for an imaginary study on the effects of two days of fasting.
Fasting <- data.frame(name = c("April", "Bishma", "Carl"), before = c(120, 175, 182), after = c(115, 167, 178))
Fasting
Tablename | before | after |
---|---|---|
April | 120 | 115 |
Bishma | 175 | 167 |
Carl | 182 | 178 |
Fasting
is in “wide” form: few cases, more variables for each case.
This is convenient for some purposes, for example:
How much did the weights change?
Fasting %>% mutate(diff = after - before)
## name before after diff ## 1 April 120 115 -5 ## 2 Bishma 175 167 -8 ## 3 Carl 182 178 -4
This graph is difficult to make from the wide data table.
We will use the tidyr package (loaded when we attach the tidyverse).
Use tidyr’s pivot_longer()
data verb:
Fasting_narrow <- Fasting %>% pivot_longer(cols = -name, names_to = "when", values_to = "weight")
## # A tibble: 6 × 3 ## name when weight ## <chr> <chr> <dbl> ## 1 April before 120 ## 2 April after 115 ## 3 Bishma before 175 ## 4 Bishma after 167 ## 5 Carl before 182 ## 6 Carl after 178
In pivot_longer(cols = -name, names_to = "when", values_to = "weight")
:
data = Fasting
(provided by %>%
in our code)cols
tells you which columns to gather togethernames_to
says what to call the column that will say whether the weight is a “before” or “after” weightvalues_to
says what to call the column that will hold the weightsYou can also list out the columns you want to gather:
Fasting %>% pivot_longer( cols = c(before, after), names_to = "when", values_to = "weight" )
## # A tibble: 6 × 3 ## name when weight ## <chr> <chr> <dbl> ## 1 April before 120 ## 2 April after 115 ## 3 Bishma before 175 ## 4 Bishma after 167 ## 5 Carl before 182 ## 6 Carl after 178
Fasting_narrow
is in “narrow” form: there are more cases.
Each case is now a single act of weighing a person.
ggplot(Fasting_narrow, aes(x = when, y = weight)) + geom_point(aes(color = name)) + geom_line(aes(color = name))
Fasting_narrow %>% str()
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame) ## $ name : chr [1:6] "April" "April" "Bishma" "Bishma" ... ## $ when : chr [1:6] "before" "after" "before" "after" ... ## $ weight: num [1:6] 120 115 175 167 182 178
when
is a categorical variable.geom_line()
does not know which groups to connect.ggplot()
accepts a “group” aesthetic. Set it to name
.
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) + geom_point(aes(color = name)) + geom_line(aes(color = name))
We need better labels for the axes.
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) + geom_point(aes(color = name)) + geom_line(aes(color = name)) + labs(x = "time when weight was recorded", y = "weight (pounds)")
Just so you know that it can be done …
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) + geom_point(aes(color = name)) + geom_line(aes(color = name)) + labs(x = "time when weight was recorded", y = "weight (pounds)") + coord_flip()
Spreading converts data from narrow to wide form.
Fasting_narrow %>% pivot_wider(names_from = when, values_from = weight)
## # A tibble: 3 × 3 ## name before after ## <chr> <dbl> <dbl> ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
This takes you back to the original Fasting
data.
Sometimes you want to help the reader understand what the new variables mean. You can alter their names a bit with the names_prefix
parameter:
Fasting_narrow %>% pivot_wider( names_from = when, values_from = weight, names_prefix = "weight_" )
## # A tibble: 3 × 3 ## name weight_before weight_after ## <chr> <dbl> <dbl> ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
babynames
AgainResearch Question:
Which common babynames since the year 2000 are the most gender-neutral?
Let’s get the total babies born since 2000, using only names common among both sexes
RecentBabies <- babynames %>% filter(year >= 2000) %>% group_by(name, sex) %>% summarise(total = sum(n))
name | sex | total |
---|---|---|
Aaban | M | 107 |
Aabha | F | 35 |
Aabid | M | 10 |
Aabir | M | 5 |
Aabriella | F | 32 |
It would be easier to have male and female counts side-by-side.
RecentBabiesWide <- RecentBabies %>% pivot_wider(names_from = sex, values_from = total)
name | M | F |
---|---|---|
Aaban | 107 | NA |
Aabha | NA | 35 |
Aabid | 10 | NA |
Aabir | 5 | NA |
Aabriella | NA | 32 |
We get NA
when there were no babies for a given name and sex. We would prefer to have counts of 0.
Use the values_fill
parameter:.
RecentBabiesWide <- RecentBabies %>% pivot_wider( names_from = sex, values_from = total, values_fill = list(total = 0) )
name | M | F |
---|---|---|
Aaban | 107 | 0 |
Aabha | 0 | 35 |
Aabid | 10 | 0 |
Aabir | 5 | 0 |
Aabriella | 0 | 32 |
Much better.
Let’s consider only names where at least 1000 babies of each sex have that name.
Common <- RecentBabiesWide %>% filter(M > 1000, F > 1000)
\[\min(\frac{M}{F}, \frac{F}{M})\]
The closer this is to 1, the more gender-neutral the name is!
Leslie <- Common %>% filter(name == "Leslie")
name | M | F |
---|---|---|
Leslie | 1359 | 39613 |
Use R to compute \(\min(M/F, F/M)\). What do you get?
Robin <- Common %>% filter(name == "Robin")
name | M | F |
---|---|---|
Robin | 2466 | 4223 |
Common %>% mutate(gnMeasure = pmin(M / F, F / M)) %>% arrange(desc(gnMeasure))
name | M | F | gnMeasure |
---|---|---|---|
Gentry | 1224 | 1215 | 0.9926471 |
Justice | 11267 | 10947 | 0.9715985 |
Baby | 1639 | 1573 | 0.9597315 |
Jules | 1166 | 1215 | 0.9596708 |
Marion | 2092 | 2189 | 0.9556875 |
pmin()
numbers1 <- c(8, 10, 3, -2, 0) numbers2 <- c(7, 11, 3, -5, 4) pmin(numbers1, numbers2)
## [1] 7 10 3 -5 0
pmin()
finds the minimum value of each corresponding pair of numbers.
M / F
M / F
gave the male-to-female ratio of counts, for each name.M / F
gave the female-to-male ratio of counts, for each name.pmin(M /f, F / M)
found the minimum ratio for each name.Here’s a more detailed look:
Common %>% mutate(MtoF = M / F, FtoM = F / M) %>% mutate(gnMeasure = pmin(MtoF, FtoM)) %>% select(name, MtoF, FtoM, gnMeasure)
name | MtoF | FtoM | gnMeasure |
---|---|---|---|
Addison | 0.0451761 | 22.1355765 | 0.0451761 |
Adrian | 43.2266523 | 0.0231339 | 0.0231339 |
Aidan | 52.8602784 | 0.0189178 | 0.0189178 |
Aiden | 114.3429923 | 0.0087456 | 0.0087456 |
Alex | 26.3813927 | 0.0379055 | 0.0379055 |
Alexis | 0.2250332 | 4.4437892 | 0.2250332 |
Ali | 3.4644784 | 0.2886437 | 0.2886437 |
Fasting
## name before after ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
This hypothetical study was a repeated-measure study. Each subject was weighed more than once.
Narrow format is especially convenient when working with repeated-measure data.
Access and learn about it:
data("labels", package = "tigerstats")
?tigerstats::labels
Research Questions:
head(labels, n = 6)
## jiffrating greatvaluerating sex ## 1 8 5 female ## 2 10 7 female ## 3 8 6 female ## 4 7 5 female ## 5 9 5 female ## 6 8 9 female
labels_narrow <- labels %>% pivot_longer(cols = c(jiffrating, greatvaluerating), names_to = "which", values_to = "rating") %>% mutate(better = recode(which, 'jiffrating' = 'jiff', 'greatvaluerating' = 'greatvalue'))
sex | which | rating | better |
---|---|---|---|
female | jiffrating | 8 | jiff |
female | greatvaluerating | 5 | greatvalue |
female | jiffrating | 10 | jiff |
female | greatvaluerating | 7 | greatvalue |
female | jiffrating | 8 | jiff |
Oops! No way to tell which two cases go with a given person.
Add an “id” variable:
labels_narrow <- labels %>% mutate(id = 1:nrow(labels)) %>% pivot_longer(cols = c(jiffrating, greatvaluerating), names_to = "which", values_to = "rating") %>% mutate(better = recode(which, 'jiffrating' = 'jiff', 'greatvaluerating' = 'greatvalue')) %>% arrange(id)
Replace the VAR
s below with the choices that will produce the target graph.
ggplot(labels_narrow, aes(x = VAR1, y = VAR2, group = VAR3)) + geom_point() + geom_line() + facet_grid(. ~ VAR4) + labs(x = "label on jar")
?ChickWeight View(ChickWeight)
Research Question:
Which diet produces the most growth, over time?
Mean weights of each diet-group, over time.
Fill in VAR
s to make the target graph.
ChickWeight %>% group_by(VAR1, VAR2) %>% summarize(meanWeight = mean(VAR3, na.rm = T)) %>% ggplot(aes(x = VAR4, y = VAR5)) + geom_point(aes(color = VAR6), size = 3) + geom_line(aes(color = VAR7))