Source file ⇒ lec11.Rmd
You can separate one column into multiple columns using the date verb separate()
iris_narrow %>% head()
Species | key | Value |
---|---|---|
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.9 |
setosa | Sepal.Length | 4.7 |
setosa | Sepal.Length | 4.6 |
setosa | Sepal.Length | 5.0 |
setosa | Sepal.Length | 5.4 |
iris_narrow %>% separate(key, into=c("Part", "Measure"), sep="\\.")
Species | Part | Measure | Value |
---|---|---|---|
setosa | Sepal | Length | 5.1 |
setosa | Sepal | Length | 4.9 |
setosa | Sepal | Length | 4.7 |
setosa | Sepal | Length | 4.6 |
setosa | Sepal | Length | 5.0 |
setosa | Sepal | Length | 5.4 |
Here is a data frame with a column x
we wish to split into three columns
df <- data.frame(x=c("1-2-3", "a-b-c"), y=c(1,2))
df
x | y |
---|---|
1-2-3 | 1 |
a-b-c | 2 |
Do this using into=c(“a”,“b”,“c”) and sep=“-”
df %>% separate(x,into=c("a","b","c"), sep="-")
a | b | c | y |
---|---|---|---|
1 | 2 | 3 | 1 |
a | b | c | 2 |
A data table can be presented in wide or narrow format. Each have their own advantatges.
Wide format is easier to get the difference of before and after of a test for each patient.
BP_wide
subject | before | after |
---|---|---|
BHO | 120 | 160 |
GWB | 115 | 135 |
WJC | 105 | 145 |
Narrow format is easier to include additional cases of a patient if they are tested on different days. A narrow format is sometimes called a tidy data table.
BP_narrow
subject | when | sbp |
---|---|---|
BHO | before | 160 |
GWB | before | 115 |
WJC | after | 145 |
GWB | after | 135 |
WJC | before | 105 |
BHO | after | 160 |
The data verbs ’spread()and
gather()` convert between these formats.
gather()
transforms BP_wide into BP_narrowThe key variable is the name of the new variable in the narrow format that is gathered.
BP_narrow1 <- BP_wide %>%
gather(key= when, value = sbp, before, after)
BP_narrow1
subject | when | sbp |
---|---|---|
BHO | before | 120 |
GWB | before | 115 |
WJC | before | 105 |
BHO | after | 160 |
GWB | after | 135 |
WJC | after | 145 |
spread()
transforms BP_narrow into BP_wideThe key variable is the name of the original variable in the narrow format that is spread.
BP_wide1 <- BP_narrow %>%
spread(key= when, value = sbp)
BP_wide1
subject | after | before |
---|---|---|
BHO | 160 | 160 |
GWB | 135 | 115 |
WJC | 145 | 105 |
Is the following data set narrow or wide? Convert it to the other data table format.
Baby_narrow <- BabyNames %>%
filter(name == "Sue") %>%
group_by(name,sex) %>%
summarise(total=sum(count))
Baby_wide <- Baby_narrow %>% spread(key=sex, value= total )
Baby_wide
name | F | M |
---|---|---|
Sue | 144410 | 519 |
Baby_wide %>% gather(key=sex, value=value, F, M)
name | sex | value |
---|---|---|
Sue | F | 144410 |
Sue | M | 519 |
Note that a narrow table is tidy as we defined in the first day of class. There are no column names as there are in the wide format.
Lets examine the wide iris data table:
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Suppose you want to make the following plot:
The data table iris
isn’t gyph ready. Here is the glyph ready table:
Species | Part | Measure | Value |
---|---|---|---|
setosa | Sepal | Length | 5.1 |
setosa | Sepal | Length | 4.9 |
setosa | Sepal | Length | 4.7 |
setosa | Sepal | Length | 4.6 |
setosa | Sepal | Length | 5.0 |
setosa | Sepal | Length | 5.4 |
step 1: Use gather
iris_narrow <- iris %>%
gather(key, Value, -Species) %>% #here -Species means all columns except Species
head()
iris_narrow %>% head()
Species | key | Value |
---|---|---|
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.9 |
setosa | Sepal.Length | 4.7 |
setosa | Sepal.Length | 4.6 |
setosa | Sepal.Length | 5.0 |
setosa | Sepal.Length | 5.4 |
step 2: Use the data verb separate()
iris_narrow_sep <- iris_narrow %>% separate(key, into=c("Part", "Measure"), sep="\\.")
head(iris_narrow_sep)
Species | Part | Measure | Value |
---|---|---|---|
setosa | Sepal | Length | 5.1 |
setosa | Sepal | Length | 4.9 |
setosa | Sepal | Length | 4.7 |
setosa | Sepal | Length | 4.6 |
setosa | Sepal | Length | 5.0 |
setosa | Sepal | Length | 5.4 |
Aesthetics are properties of the graph that we map to a variable.
(example col=sex
in the BabyNames
data set)
Attribute are properties of the graph that we set equal to a fixed value.
(example col=“red”)
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl)))
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red")
Note: attributes don’t have a legend since since it takes only a fixed value.
unloadNamespace('printr')
iris<- iris %>% mutate(Flower=1:nrow(iris))
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 2
## 3 4.7 3.2 1.3 0.2 setosa 3
## 4 4.6 3.1 1.5 0.2 setosa 4
## 5 5.0 3.6 1.4 0.2 setosa 5
## 6 5.4 3.9 1.7 0.4 setosa 6
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
spread(Measure, value)
iris.wide %>% head()
## Species Flower Part Length Width
## 1 setosa 1 Petal 1.4 0.2
## 2 setosa 1 Sepal 5.1 3.5
## 3 setosa 2 Petal 1.4 0.2
## 4 setosa 2 Sepal 4.9 3.0
## 5 setosa 3 Petal 1.3 0.2
## 6 setosa 3 Sepal 4.7 3.2