Suppose you’re a fan of ggplot, and suppose you have some data that you want to visualize using lines. Easy piecy so far. Now you want to maximally exploit the power of good visualizations and want to use some of the line attributes to identify that something is active or not, or the category of something - within the same line.
If this sounds abstract, look at this example of plotting the capacity of some nuclear power plants while at the same time indicating whether they’re active or not.
Disclaimer: I’m not a nuclear physicist and the data is totally fake.
So lets first generate some data:
plantdata <- data.table(day = rep(1:365,2))
plantdata[, date := now()-(365-days(day))]
plantdata[, Factory := c(rep("Plant A", 365), rep("Plant B", 365))]
plantdata[Factory == "Plant A", Capacity := (day/2+150)*10e5*(runif(19, min=0.8, max=1.3)[day %/% 20 + 1])]
plantdata[Factory == "Plant B", Capacity := (day/4+100)*10e5*(runif(19, min=0.7, max=1.2)[day %/% 20 + 1])]
plantdata[, IsActive := c("Active", "Inactive")[1+(runif(10) < 0.5)][seq(.N) %/% 80 + 1]]
Plotting the plant capacity over time is trivial:
ggplot(plantdata, aes(date, Capacity, color=Factory)) +
geom_line(size=1)+
ggtitle("Nuclear Power Plant Capacity")
Now, how to indicate whether or not both plants were active in a certain period? You may want to try
ggplot(plantdata, aes(date, Capacity, color=Factory, linetype=IsActive)) +
geom_line(size=1)+
ggtitle("Nuclear Power Plant Capacity")
Hmmm. Doesnt look too good. What we’re missing is to identify ranges in the data where the plant is active or not active. Data.table to the rescue! It provides a method to label consecutive ranges of data with a unique ID with the rleid function.
plantdata[, sequenceID := rleid(IsActive), by=Factory]
plantdata[158:163]
## day date Factory Capacity IsActive sequenceID
## 1: 158 2020-01-14 22:47:53 Plant A 278629889 Inactive 2
## 2: 159 2020-01-15 22:47:53 Plant A 279238252 Inactive 2
## 3: 160 2020-01-16 22:47:53 Plant A 237822129 Active 3
## 4: 161 2020-01-17 22:47:53 Plant A 238339134 Active 3
## 5: 162 2020-01-18 22:47:53 Plant A 238856139 Active 3
## 6: 163 2020-01-19 22:47:53 Plant A 239373143 Active 3
Now we can use this sequence ID to help ggplot to identify line segments that have the same activity status.
ggplot(plantdata, aes(date, Capacity, color=Factory, linetype=IsActive, group=paste(Factory,sequenceID))) +
geom_line(size = 1)+
ggtitle("Nuclear Power Plant Capacity")
That’s a lot better already. But there’s still something wrong: the lines are not continuous. There are sometimes gaps when the status changes from active to inactive. We can fix that by inserting an extra row of data whenever the activity status changes. So we first identify the rows where the change occurs, and for those rows, we insert a copy of that row before with the same Capacity value (or whatever your are plotting) but with the activity status of the previous row, and also the sequence ID of the previous row.
Thanks to data.table this is rather easy.
plantdata[, statusChanged := sequenceID != sequenceID[shift(seq(.N), type = "lag", fill = 1)], by=Factory]
plantdata[158:163]
## day date Factory Capacity IsActive sequenceID
## 1: 158 2020-01-14 22:47:53 Plant A 278629889 Inactive 2
## 2: 159 2020-01-15 22:47:53 Plant A 279238252 Inactive 2
## 3: 160 2020-01-16 22:47:53 Plant A 237822129 Active 3
## 4: 161 2020-01-17 22:47:53 Plant A 238339134 Active 3
## 5: 162 2020-01-18 22:47:53 Plant A 238856139 Active 3
## 6: 163 2020-01-19 22:47:53 Plant A 239373143 Active 3
## statusChanged
## 1: FALSE
## 2: FALSE
## 3: TRUE
## 4: FALSE
## 5: FALSE
## 6: FALSE
extrarows <- copy(plantdata[(statusChanged)])
extrarows[, sequenceID := sequenceID - 1]
extrarows[, IsActive := ifelse(IsActive == "Active", "Inactive", "Active")]
extrarows
## day date Factory Capacity IsActive sequenceID
## 1: 80 2019-10-28 22:47:53 Plant A 241664737 Active 1
## 2: 160 2020-01-16 22:47:53 Plant A 237822129 Inactive 2
## 3: 240 2020-04-05 22:47:53 Plant A 318669297 Active 3
## 4: 115 2019-12-02 22:47:53 Plant B 112480651 Inactive 1
## 5: 355 2020-07-29 22:47:53 Plant B 211817987 Active 2
## statusChanged
## 1: TRUE
## 2: TRUE
## 3: TRUE
## 4: TRUE
## 5: TRUE
We can now add these extra rows to the plot data
ggplot(rbind(plantdata, extrarows), aes(date, Capacity, color=Factory, linetype=IsActive, group=paste(Factory,sequenceID))) +
geom_line(size = 1)+
ggtitle("Nuclear Power Plant Capacity")
Success at last! In addition to the capacity, the lines now nicely indicate when the power plant was active.
The same trick can be applied to other attributes of course. Suppose you have data about the number of downloads of some albums, plus how they were rated over time, you might want to show that in one plot. Combining all the tricks we’ve used before, this would be something like.
genalbumdata <- function(title, releasedate)
{
diff <- as.integer(difftime(as.Date(now()), ymd(releasedate),units = "days"))
highpoint <- 0.2*diff
data <- data.table(day = seq(diff),
date = now() - days(diff - seq(diff)),
dailysales = 1000*trunc(runif(diff)*5*(1-abs(highpoint - seq(diff))/(0.8*diff))),
Album = title)
data[, Sales := cumsum(dailysales)]
data[, Rating := c("Low", "Medium", "High")[1+(runif(.N %/% 500 + 1,0,9) %/% 3)][seq(.N) %/% 500 + 1]]
return(data)
}
albumdata <- rbindlist(list(genalbumdata("Hotel California", "19760101"),
genalbumdata("Led Zeppelin IV", "19710101"),
genalbumdata("21", "20110101"),
genalbumdata("Thriller", "19820101")))
albumdata[, sequenceID := rleid(Rating), by=Album]
albumdata[, statusChanged := sequenceID != sequenceID[shift(seq(.N), type = "lag", fill = 1)], by=Album]
albumdata[, prevRating := Rating[shift(seq(.N), type = "lag", fill = 1)], by=Album]
extrarows <- copy(albumdata[(statusChanged)])
extrarows[, sequenceID := sequenceID - 1]
extrarows[, Rating := prevRating]
ggplot(rbind(albumdata, extrarows), aes(date, Sales, color=Rating, group=paste(Album, sequenceID))) +
geom_line(size=1)+
geom_label(data=albumdata[,.(date = last(date), Sales = last(Sales), sequenceID = last(sequenceID), Rating = last(Rating)),by="Album"],
mapping = aes(label = Album), hjust=1, colour="blue") +
ggtitle("Album Popularity") + ylab("Total Sales") + scale_color_manual(values = c("Red","Yellow","Green"))