Our primary focus in this Section is on the Tidy phase.
Our primary focus in this Section is on the Tidy phase.
The important functions come from the tidyr package:
pivot_longer()pivot_wider()Create this data table for an imaginary study on the effects of two days of fasting.
Fasting <- data.frame(name = c("April", "Bishma", "Carl"),
before = c(120, 175, 182),
after = c(115, 167, 178))
Fasting Table| name | before | after |
|---|---|---|
| April | 120 | 115 |
| Bishma | 175 | 167 |
| Carl | 182 | 178 |
Fasting is in “wide” form: few cases, more variables for each case.
This is convenient for some purposes, for example:
How much did the weights change?
Fasting %>% mutate(diff = after - before)
## name before after diff ## 1 April 120 115 -5 ## 2 Bishma 175 167 -8 ## 3 Carl 182 178 -4
This graph is difficult to make from the wide data table.
We will use the tidyr package (loaded when we attach the tidyverse).
Use tidyr’s pivot_longer() data verb:
Fasting_narrow <- Fasting %>% pivot_longer(cols = -name, names_to = "when", values_to = "weight")
## # A tibble: 6 × 3 ## name when weight ## <chr> <chr> <dbl> ## 1 April before 120 ## 2 April after 115 ## 3 Bishma before 175 ## 4 Bishma after 167 ## 5 Carl before 182 ## 6 Carl after 178
In pivot_longer(cols = -name, names_to = "when", values_to = "weight"):
data = Fasting (provided by %>% in our code)cols tells you which columns to gather togethernames_to says what to call the column that will say whether the weight is a “before” or “after” weightvalues_to says what to call the column that will hold the weightsYou can also list out the columns you want to gather:
Fasting %>%
pivot_longer(
cols = c(before, after),
names_to = "when",
values_to = "weight"
)
## # A tibble: 6 × 3 ## name when weight ## <chr> <chr> <dbl> ## 1 April before 120 ## 2 April after 115 ## 3 Bishma before 175 ## 4 Bishma after 167 ## 5 Carl before 182 ## 6 Carl after 178
Fasting_narrow is in “narrow” form: there are more cases.
Each case is now a single act of weighing a person.
ggplot(Fasting_narrow, aes(x = when, y = weight)) +
geom_point(aes(color = name)) +
geom_line(aes(color = name))
Fasting_narrow %>% str()
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame) ## $ name : chr [1:6] "April" "April" "Bishma" "Bishma" ... ## $ when : chr [1:6] "before" "after" "before" "after" ... ## $ weight: num [1:6] 120 115 175 167 182 178
when is a categorical variable.geom_line() does not know which groups to connect.ggplot() accepts a “group” aesthetic. Set it to name.
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
geom_point(aes(color = name)) +
geom_line(aes(color = name))
We need better labels for the axes.
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
geom_point(aes(color = name)) +
geom_line(aes(color = name)) +
labs(x = "time when weight was recorded",
y = "weight (pounds)")
Just so you know that it can be done …
ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
geom_point(aes(color = name)) +
geom_line(aes(color = name)) +
labs(x = "time when weight was recorded",
y = "weight (pounds)") +
coord_flip()
Spreading converts data from narrow to wide form.
Fasting_narrow %>% pivot_wider(names_from = when, values_from = weight)
## # A tibble: 3 × 3 ## name before after ## <chr> <dbl> <dbl> ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
This takes you back to the original Fasting data.
Sometimes you want to help the reader understand what the new variables mean. You can alter their names a bit with the names_prefix parameter:
Fasting_narrow %>%
pivot_wider(
names_from = when,
values_from = weight,
names_prefix = "weight_"
)
## # A tibble: 3 × 3 ## name weight_before weight_after ## <chr> <dbl> <dbl> ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
babynames AgainResearch Question:
Which common babynames since the year 2000 are the most gender-neutral?
Let’s get the total babies born since 2000, using only names common among both sexes
RecentBabies <- babynames %>% filter(year >= 2000) %>% group_by(name, sex) %>% summarise(total = sum(n))
| name | sex | total |
|---|---|---|
| Aaban | M | 107 |
| Aabha | F | 35 |
| Aabid | M | 10 |
| Aabir | M | 5 |
| Aabriella | F | 32 |
It would be easier to have male and female counts side-by-side.
RecentBabiesWide <- RecentBabies %>% pivot_wider(names_from = sex, values_from = total)
| name | M | F |
|---|---|---|
| Aaban | 107 | NA |
| Aabha | NA | 35 |
| Aabid | 10 | NA |
| Aabir | 5 | NA |
| Aabriella | NA | 32 |
We get NA when there were no babies for a given name and sex. We would prefer to have counts of 0.
Use the values_fill parameter:.
RecentBabiesWide <-
RecentBabies %>%
pivot_wider(
names_from = sex,
values_from = total,
values_fill = list(total = 0)
)
| name | M | F |
|---|---|---|
| Aaban | 107 | 0 |
| Aabha | 0 | 35 |
| Aabid | 10 | 0 |
| Aabir | 5 | 0 |
| Aabriella | 0 | 32 |
Much better.
Let’s consider only names where at least 1000 babies of each sex have that name.
Common <- RecentBabiesWide %>% filter(M > 1000, F > 1000)
\[\min(\frac{M}{F}, \frac{F}{M})\]
The closer this is to 1, the more gender-neutral the name is!
Leslie <- Common %>% filter(name == "Leslie")
| name | M | F |
|---|---|---|
| Leslie | 1359 | 39613 |
Use R to compute \(\min(M/F, F/M)\). What do you get?
Robin <- Common %>% filter(name == "Robin")
| name | M | F |
|---|---|---|
| Robin | 2466 | 4223 |
Common %>% mutate(gnMeasure = pmin(M / F, F / M)) %>% arrange(desc(gnMeasure))
| name | M | F | gnMeasure |
|---|---|---|---|
| Gentry | 1224 | 1215 | 0.9926471 |
| Justice | 11267 | 10947 | 0.9715985 |
| Baby | 1639 | 1573 | 0.9597315 |
| Jules | 1166 | 1215 | 0.9596708 |
| Marion | 2092 | 2189 | 0.9556875 |
pmin()numbers1 <- c(8, 10, 3, -2, 0) numbers2 <- c(7, 11, 3, -5, 4) pmin(numbers1, numbers2)
## [1] 7 10 3 -5 0
pmin() finds the minimum value of each corresponding pair of numbers.
M / FM / F gave the male-to-female ratio of counts, for each name.M / F gave the female-to-male ratio of counts, for each name.pmin(M /f, F / M) found the minimum ratio for each name.Here’s a more detailed look:
Common %>% mutate(MtoF = M / F, FtoM = F / M) %>% mutate(gnMeasure = pmin(MtoF, FtoM)) %>% select(name, MtoF, FtoM, gnMeasure)
| name | MtoF | FtoM | gnMeasure |
|---|---|---|---|
| Addison | 0.0451761 | 22.1355765 | 0.0451761 |
| Adrian | 43.2266523 | 0.0231339 | 0.0231339 |
| Aidan | 52.8602784 | 0.0189178 | 0.0189178 |
| Aiden | 114.3429923 | 0.0087456 | 0.0087456 |
| Alex | 26.3813927 | 0.0379055 | 0.0379055 |
| Alexis | 0.2250332 | 4.4437892 | 0.2250332 |
| Ali | 3.4644784 | 0.2886437 | 0.2886437 |
Fasting
## name before after ## 1 April 120 115 ## 2 Bishma 175 167 ## 3 Carl 182 178
This hypothetical study was a repeated-measure study. Each subject was weighed more than once.
Narrow format is especially convenient when working with repeated-measure data.
Access and learn about it:
data("labels", package = "tigerstats")
?tigerstats::labels
Research Questions:
head(labels, n = 6)
## jiffrating greatvaluerating sex ## 1 8 5 female ## 2 10 7 female ## 3 8 6 female ## 4 7 5 female ## 5 9 5 female ## 6 8 9 female
labels_narrow <-
labels %>%
pivot_longer(cols = c(jiffrating, greatvaluerating),
names_to = "which",
values_to = "rating") %>%
mutate(better = recode(which, 'jiffrating' = 'jiff',
'greatvaluerating' = 'greatvalue'))
| sex | which | rating | better |
|---|---|---|---|
| female | jiffrating | 8 | jiff |
| female | greatvaluerating | 5 | greatvalue |
| female | jiffrating | 10 | jiff |
| female | greatvaluerating | 7 | greatvalue |
| female | jiffrating | 8 | jiff |
Oops! No way to tell which two cases go with a given person.
Add an “id” variable:
labels_narrow <-
labels %>%
mutate(id = 1:nrow(labels)) %>%
pivot_longer(cols = c(jiffrating, greatvaluerating),
names_to = "which",
values_to = "rating") %>%
mutate(better = recode(which, 'jiffrating' = 'jiff',
'greatvaluerating' = 'greatvalue')) %>%
arrange(id)
Replace the VARs below with the choices that will produce the target graph.
ggplot(labels_narrow, aes(x = VAR1, y = VAR2, group = VAR3)) + geom_point() + geom_line() + facet_grid(. ~ VAR4) + labs(x = "label on jar")
?ChickWeight View(ChickWeight)
Research Question:
Which diet produces the most growth, over time?
Mean weights of each diet-group, over time.
Fill in VARs to make the target graph.
ChickWeight %>%
group_by(VAR1, VAR2) %>%
summarize(meanWeight = mean(VAR3, na.rm = T)) %>%
ggplot(aes(x = VAR4, y = VAR5)) +
geom_point(aes(color = VAR6), size = 3) +
geom_line(aes(color = VAR7))