I have recently come across Ryan Hafen’s fascinating trelliscope R package which helps provide a framework for detailed visualization of large complex data.

As a first dip in the water, I have used it to visualize some baseball data easily available from the Lahman package. For a much more detailed explanation of the process, please refer to the trelliscope tutorial on house prices

Firstly load required packages and make a connection

# Load the packages

#These are the ones needed for the specific subject of interest and you may need to download them first
#devtools::install_github("tesseradata/datadr")
#devtools::install_github("tesseradata/trelliscope")

library(datadr)
library(trelliscope)

# some packages for manipulation and display
library(dplyr)
library(ggplot2)
library(stringr)

# and the data
library(Lahman)

conn <- vdbConn("vdbLahman", name = "baseball")

For this example, I am calculating the WHIP - average walks and hits per inning - for each MLB pitcher by year by team For this, I need to link two data.frames the Pitching (which has the details) and the Master (which includes names and links to more detailed pages)

# limit data to major leagues since 1900
# join data.frames
# calculate and select the data required 
pitching <- tbl_df(Pitching %>% 
                     filter(yearID >= 1901 & lgID %in% c("AL", "NL")) %>% 
     left_join(Master) %>% 
      mutate(
         WHIP = round((H + BB) * 3/IPouts, 2),
         name=paste(nameFirst,nameLast, sep=" "),
         bbrefURL=paste0("http://www.baseball-reference.com/players/",str_sub(bbrefID,1,1),"/",bbrefID,".shtml")) %>% 
  select(playerID,teamID,name,WHIP,yearID,IPouts,bbrefURL))

glimpse(pitching)
## Observations: 40,460
## Variables: 7
## $ playerID (chr) "bakerbo01", "bakerbo01", "bernhbi01", "bevilbe01", "...
## $ teamID   (fctr) CLE, PHA, PHA, BOS, CLE, CLE, CLE, SLN, BLA, SLN, CH...
## $ name     (chr) "Bock Baker", "Bock Baker", "Bill Bernhard", "Ben Bev...
## $ WHIP     (dbl) 3.62, 2.00, 1.47, 1.89, 1.68, 4.00, 1.84, 2.53, 2.33,...
## $ yearID   (int) 1901, 1901, 1901, 1901, 1901, 1901, 1901, 1901, 1901,...
## $ IPouts   (int) 24, 18, 771, 27, 300, 3, 96, 45, 18, 3, 646, 972, 21,...
## $ bbrefURL (chr) "http://www.baseball-reference.com/players/b/bakerbo0...

We now use the divide function from the datadr package to create key value pairs. From my limited experience, the package appears to work best with charts created in lattice, with ggplot a second favourite but others e.g rCharts ggvis also possible

Here I have a created a ggplot of WHIP by year grouped by team played for and innings pitched The trelliscope package can then apply that in a panel for each player

byPlayer<- divide(pitching,
                  by = c("playerID","name"))

## here is what results
byPlayer
## 
## Distributed data frame backed by 'kvMemory' connection
## 
##  attribute      | value
## ----------------+-----------------------------------------------------------
##  names          | teamID(cha), WHIP(num), yearID(int), IPouts(int), and 1 more
##  nrow           | 40460
##  size (stored)  | 23.18 MB
##  size (object)  | 23.18 MB
##  # subsets      | 8097
## 
## * Other attributes: getKeys()
## * Missing attributes: splitSizeDistn, splitRowDistn, summary
## * Conditioning variables: playerID, name
## create the graph within a function

panel_ggplot <- function(x)
  x %>% 
  mutate(IP=IPouts/3) %>% 
  ggplot(aes(x=yearID,y=WHIP,color=teamID,size=IP))+
  geom_point() +
  ylab("WHIP") +
  xlab("") +
  theme_bw()


## test
panel_ggplot(byPlayer[[1]]$value)

Trelliscope offers the ability to select and filter using what are termed cognostics. So here, I might want to reduce the output based on career innings pitched or when they started or stopped playing

## create cognistics including fiters and the URL link
  whipCog <- function(x) {
    list(
      totInns=sum(x$IPouts)/3,
      minYear=min(x$yearID),
      maxYear=as.integer(max(x$yearID)),
      bbhRef = cogHref(unique(x$bbrefURL),desc="Baseball Reference link")
    )
  }

  makeDisplay(byPlayer,
              panelFn = panel_ggplot,
              cogFn   = whipCog,
              name    = "WHIP by Team by Year",
              desc    = "Pitcher Average Walks and Hits per Innings")

  ## If you uncomment, the display should appear in a browser  
 # view()

This opens a browser with a list of possible displays. Here there is just the one which if clicked on will replicate the example we viewed earlier. However, using the Table Sort/Filter tab, we can manipulate what is shown. For instance, we could sort by most innings pitched and filter to just those with more than 2000 innings who started pitching from 1990 onwards

alt text

And you can also set the layout to see several charts at once and label each with players name and a link

alt text

Have fun playing around with this either by running the code or here and constructing your own examples.

In addition to the tutorial there is also an introductory video on youtube

If interested, you can get intro to some of my shinydashboards here