When we think about data visualization, we often think of a static plot with some colors and points. New tools (spearheaded by Hans Rosling’s Gapminder project) are constantly being developed to allow us to interact dynamically with data visualizations. I’ll discuss a variety of different tools that attempt to make data come to life!

All too often statistics is thought of as a boring subject with boring plots. New software packages and tools are being developed to better understand the relationships between variables. I’ll demonstrate a lot of these different tools and packages using the R computing language. We’ll go through a variety of different examples from multiple fields to better understand anomalies and trends in data.

pkg <- c("dplyr", "ggplot2", "knitr", "readr", 
         "xts", "maps", "googleVis", "DT", "rmarkdown")

new.pkg <- pkg[!(pkg %in% installed.packages())]

if (length(new.pkg)) {
  install.packages(new.pkg, repos = "http://cran.rstudio.com")
}

if(!require(revealjs)){
  devtools::install_github("jjallaire/revealjs", ref = "a4854c017eac44d969a216103551c21d66329a74")
}

if(!require(plotly)){
  devtools::install_github("ropensci/plotly")
}

if(!require(pnwflights14)){
  devtools::install_github("ismayc/pnwflights14")
}

if(!require(dygraphs)){
  devtools::install_github("rstudio/dygraphs", ref = "778acdaeb91b754412d928ea824632bceae3078b")
}

library(dplyr)
library(ggplot2)
library(revealjs)
library(knitr)
library(readr)
library(plotly)
library(dygraphs)
library(pnwflights14)
library(xts)
library(maps)
library(googleVis)
library(DT)

options(width = 100, scipen = 99)

The Iris flower data set

Source: Wikipedia


Scatterplots


Traditional (boring) plot

with(iris, plot(x = Petal.Width, y = Sepal.Length))


Prettier (not quite as boring) plot

qplot(Petal.Width, Sepal.Length, data = iris)


Interactive plot using plotly

ggiris <- qplot(Petal.Width, Sepal.Length, data = iris)
ggplotly(ggiris)


Prettier interactive plot using plotly

ggiris_colored <- qplot(Petal.Width, Sepal.Length, data = iris, 
  color = Species)
ggplotly(ggiris_colored)


Another interactive plot

iris %>% plot_ly(x = Petal.Width, y = Sepal.Length,
  type = "scatter", color = Species, mode = "markers")


Scatterplots (Part Deux)


Reed College majors VS Total Faculty FTE by department

  • Based off analysis done by Rich Majerus in 2014 using the googleVis package

  • Data does not include 143 interdisciplinary majors and 9 undecided majors.

  • Majors like Bio/Chem are split between the two departments

  • General Lit/Lit majors are included with English

  • Dance majors and faculty are included with Theatre


major_data %>% ggplot(aes(x = Majors, y = FTE)) +
  geom_point() +
  ggtitle("Reed College Majors and FTE by Department")


# make a new data frame with only two columns to scatter plot 
keep <- c('Majors', 'FTE')
data2 <- major_data[keep]

# add names to new data frame as factor 
data2$pop.html.tooltip=major_data$Departments

# create interactive scatter plot using googleVis
Scatter1 <- gvisScatterChart(data2,                                                           
                            options=list(tooltip="{isHtml:'True'}",              # Define tooltip                            
                              legend="none", lineWidth=0, pointSize=5,                                                     
                              vAxis="{title:'Faculty (Total FTE)'}",             # y-axis label                
                              hAxis="{title:'Majors (delared and intended)'}",   # x-axis label                     
                              width=900, height=600))                            # plot dimensions  
print(Scatter1, 'chart') 

Left-click and drag to select an area of the chart to zoom on. Right-click to zoom back out.

Scatter2 <- gvisScatterChart(data2,                                                           
                            options=list(
                              explorer="{actions: ['dragToZoom', 
                                          'rightClickToReset'],
                                           maxZoomIn:0.05}",
                              #chartArea="{width:'85%',height:'80%'}",
                              tooltip="{isHtml:'True'}",              
                              crosshair="{trigger:'both'}",                         
                              legend="none", lineWidth=0, pointSize=5,                                                     
                              vAxis="{title:'Faculty (Total FTE)'}",                        
                              hAxis="{title:'Majors (delared and intended)'}",                     
                              width=900, height=600))                                                                 
print(Scatter2, 'chart') 

Alaskan departure delays in PNW

  • The pnwflights14 package provides information contains information about all flights that departed from SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total.

  • We can use this data and the dplyr package to look at daily maximum departure delays throughout the year for Alaskan Airlines.


Time series/line graphs


data("flights", package = "pnwflights14")
alaskan <- flights %>% 
  filter(carrier %in% c("AS")) %>%
  mutate(date2014 = as.Date(paste0("2014/", month, "/", day))) %>%
  group_by(date2014) %>%
  summarize(max_dep_delay = max(dep_delay, na.rm=TRUE))
alaskan %>% ggplot(aes(x = date2014, y = max_dep_delay)) +
  geom_line() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %y") +
  xlab("Date") +
  ylab("Maximum Departure Delay")


ggplotly()


Plotting the time series using dygraph

(Converting to time series format using xts)

alaskan_ts <- xts(alaskan$max_dep_delay, alaskan$date2014)
colnames(alaskan_ts) <- "Max Departure Delay"
dygraph(alaskan_ts) %>% dyRangeSelector()


Canadian and US population and geography

  • Canada is an extremely large land mass (2nd largest country in the world), but is only the 37th largest country in terms of population

  • The US ranks 4th highest in land mass and 3rd highest in population

  • We can use data in the maps package to better visualize why these rankings exist


Maps


data(canada.cities, package = "maps")
canada_plot <- ggplot(canada.cities, aes(x = long, y = lat)) +
  coord_equal() +
  geom_point(aes(size=pop, text = paste0(name, ",",
    "Pop: ", prettyNum(pop, big.mark = ",", scientific = FALSE))), 
    colour = "red", alpha = 1/2) +
  borders(regions="canada")
canada_plot


ggplotly(canada_plot)


data(us.cities, package = "maps")
us_plot <- ggplot(us.cities, aes(x = long, y = lat)) +
  coord_equal() +
  geom_point(aes(size=pop, text = paste0(name, ",",
    "Pop: ", prettyNum(pop, big.mark = ",", scientific = FALSE))), 
    colour = "red", alpha = 1/2) +
  borders(regions="usa", xlim = c(-200, -60), ylim = c(20, 80))
us_plot


ggplotly(us_plot)


3D objects


New Zealand’s highest volcano

plot_ly(z = volcano, type = "surface")


Interactive Data Tables


datatable(iris, options = list(pageLength = 5))


Another data table example

RA Duty Scheduling


What can I help you with?

  • Data analysis
  • Data wrangling/cleaning
  • Data visualization
  • Data tidying/manipulating
  • Reproducible research

When am I available?

  • Email me at cismay@reed.edu or chester.ismay@reed.edu to schedule a time to meet if office hours don’t work
  • Tentative Spring 2016 office (ETC 223) hours
    • Mondays (10 AM to 11 AM)
    • Tuesdays (2 PM to 3 PM)
    • Wednesdays (1:30 PM to 2:30 PM)
  • Sometimes available for virtual office hours via Google Hangouts (email me for details)

Thanks!


cismay@reed.edu



sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.2 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.1                  googleVis_0.5.8         maps_2.3-9              xts_0.9-7              
##  [5] zoo_1.7-12              readr_0.1.1             knitr_1.12.3            dplyr_0.4.2            
##  [9] dygraphs_0.6            pnwflights14_0.1.0.9000 plotly_2.3.0            ggplot2_2.0.0          
## [13] revealjs_0.5           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        RColorBrewer_1.1-2 formatR_1.2.1      plyr_1.8.3         base64enc_0.1-3   
##  [6] viridis_0.3.3      tools_3.2.3        digest_0.6.8       jsonlite_0.9.17    evaluate_0.8      
## [11] gtable_0.1.2       lattice_0.20-33    DBI_0.3.1          rstudioapi_0.3.1   yaml_2.1.13       
## [16] parallel_3.2.3     gridExtra_2.0.0    httr_1.0.0         stringr_1.0.0      htmlwidgets_0.5   
## [21] grid_3.2.3         R6_2.1.1           rmarkdown_0.8.1    RJSONIO_1.3-0      magrittr_1.5      
## [26] scales_0.3.0.9000  htmltools_0.2.6    assertthat_0.1     colorspace_1.2-6   labeling_0.3      
## [31] stringi_1.0-1      lazyeval_0.1.10    munsell_0.4.2