When we think about data visualization, we often think of a static plot with some colors and points. New tools (spearheaded by Hans Rosling’s Gapminder project) are constantly being developed to allow us to interact dynamically with data visualizations. I’ll discuss a variety of different tools that attempt to make data come to life!
All too often statistics is thought of as a boring subject with boring plots. New software packages and tools are being developed to better understand the relationships between variables. I’ll demonstrate a lot of these different tools and packages using the R computing language. We’ll go through a variety of different examples from multiple fields to better understand anomalies and trends in data.
pkg <- c("dplyr", "ggplot2", "knitr", "readr",
"xts", "maps", "googleVis", "DT", "rmarkdown")
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
install.packages(new.pkg, repos = "http://cran.rstudio.com")
}
if(!require(revealjs)){
devtools::install_github("jjallaire/revealjs", ref = "a4854c017eac44d969a216103551c21d66329a74")
}
if(!require(plotly)){
devtools::install_github("ropensci/plotly")
}
if(!require(pnwflights14)){
devtools::install_github("ismayc/pnwflights14")
}
if(!require(dygraphs)){
devtools::install_github("rstudio/dygraphs", ref = "778acdaeb91b754412d928ea824632bceae3078b")
}
library(dplyr)
library(ggplot2)
library(revealjs)
library(knitr)
library(readr)
library(plotly)
library(dygraphs)
library(pnwflights14)
library(xts)
library(maps)
library(googleVis)
library(DT)
options(width = 100, scipen = 99)
Introduced by Ronald Fisher in 1936
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor).
Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a model to distinguish the species from each other.
Source: Wikipedia
with(iris, plot(x = Petal.Width, y = Sepal.Length))
qplot(Petal.Width, Sepal.Length, data = iris)
plotlyggiris <- qplot(Petal.Width, Sepal.Length, data = iris)
ggplotly(ggiris)
plotlyggiris_colored <- qplot(Petal.Width, Sepal.Length, data = iris,
color = Species)
ggplotly(ggiris_colored)
iris %>% plot_ly(x = Petal.Width, y = Sepal.Length,
type = "scatter", color = Species, mode = "markers")
Based off analysis done by Rich Majerus in 2014 using the googleVis package
Data does not include 143 interdisciplinary majors and 9 undecided majors.
Majors like Bio/Chem are split between the two departments
General Lit/Lit majors are included with English
Dance majors and faculty are included with Theatre
major_data %>% ggplot(aes(x = Majors, y = FTE)) +
geom_point() +
ggtitle("Reed College Majors and FTE by Department")
# make a new data frame with only two columns to scatter plot
keep <- c('Majors', 'FTE')
data2 <- major_data[keep]
# add names to new data frame as factor
data2$pop.html.tooltip=major_data$Departments
# create interactive scatter plot using googleVis
Scatter1 <- gvisScatterChart(data2,
options=list(tooltip="{isHtml:'True'}", # Define tooltip
legend="none", lineWidth=0, pointSize=5,
vAxis="{title:'Faculty (Total FTE)'}", # y-axis label
hAxis="{title:'Majors (delared and intended)'}", # x-axis label
width=900, height=600)) # plot dimensions
print(Scatter1, 'chart')
Left-click and drag to select an area of the chart to zoom on. Right-click to zoom back out.
Scatter2 <- gvisScatterChart(data2,
options=list(
explorer="{actions: ['dragToZoom',
'rightClickToReset'],
maxZoomIn:0.05}",
#chartArea="{width:'85%',height:'80%'}",
tooltip="{isHtml:'True'}",
crosshair="{trigger:'both'}",
legend="none", lineWidth=0, pointSize=5,
vAxis="{title:'Faculty (Total FTE)'}",
hAxis="{title:'Majors (delared and intended)'}",
width=900, height=600))
print(Scatter2, 'chart')
The pnwflights14 package provides information contains information about all flights that departed from SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total.
We can use this data and the dplyr package to look at daily maximum departure delays throughout the year for Alaskan Airlines.
data("flights", package = "pnwflights14")
alaskan <- flights %>%
filter(carrier %in% c("AS")) %>%
mutate(date2014 = as.Date(paste0("2014/", month, "/", day))) %>%
group_by(date2014) %>%
summarize(max_dep_delay = max(dep_delay, na.rm=TRUE))
alaskan %>% ggplot(aes(x = date2014, y = max_dep_delay)) +
geom_line() +
scale_x_date(date_breaks = "1 month", date_labels = "%b %y") +
xlab("Date") +
ylab("Maximum Departure Delay")
ggplotly()
dygraph(Converting to time series format using xts)
alaskan_ts <- xts(alaskan$max_dep_delay, alaskan$date2014)
colnames(alaskan_ts) <- "Max Departure Delay"
dygraph(alaskan_ts) %>% dyRangeSelector()
Canada is an extremely large land mass (2nd largest country in the world), but is only the 37th largest country in terms of population
The US ranks 4th highest in land mass and 3rd highest in population
We can use data in the maps package to better visualize why these rankings exist
data(canada.cities, package = "maps")
canada_plot <- ggplot(canada.cities, aes(x = long, y = lat)) +
coord_equal() +
geom_point(aes(size=pop, text = paste0(name, ",",
"Pop: ", prettyNum(pop, big.mark = ",", scientific = FALSE))),
colour = "red", alpha = 1/2) +
borders(regions="canada")
canada_plot
ggplotly(canada_plot)
data(us.cities, package = "maps")
us_plot <- ggplot(us.cities, aes(x = long, y = lat)) +
coord_equal() +
geom_point(aes(size=pop, text = paste0(name, ",",
"Pop: ", prettyNum(pop, big.mark = ",", scientific = FALSE))),
colour = "red", alpha = 1/2) +
borders(regions="usa", xlim = c(-200, -60), ylim = c(20, 80))
us_plot
ggplotly(us_plot)
plot_ly(z = volcano, type = "surface")
datatable(iris, options = list(pageLength = 5))
Plotting maps in R with ggplot2
GapMinder (now owned by Google)
Hans Rosling’s TED talk - “The Best Stats You’ve Ever Seen”
- Data analysis
- Data wrangling/cleaning
- Data visualization
- Data tidying/manipulating
- Reproducible research
- Email me at cismay@reed.edu or chester.ismay@reed.edu to schedule a time to meet if office hours don’t work
- Tentative Spring 2016 office (ETC 223) hours
- Mondays (10 AM to 11 AM)
- Tuesdays (2 PM to 3 PM)
- Wednesdays (1:30 PM to 2:30 PM)
- Sometimes available for virtual office hours via Google Hangouts (email me for details)
- Code for slide creation on my GitHub page
- Slides available here
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.2 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.1 googleVis_0.5.8 maps_2.3-9 xts_0.9-7
## [5] zoo_1.7-12 readr_0.1.1 knitr_1.12.3 dplyr_0.4.2
## [9] dygraphs_0.6 pnwflights14_0.1.0.9000 plotly_2.3.0 ggplot2_2.0.0
## [13] revealjs_0.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 RColorBrewer_1.1-2 formatR_1.2.1 plyr_1.8.3 base64enc_0.1-3
## [6] viridis_0.3.3 tools_3.2.3 digest_0.6.8 jsonlite_0.9.17 evaluate_0.8
## [11] gtable_0.1.2 lattice_0.20-33 DBI_0.3.1 rstudioapi_0.3.1 yaml_2.1.13
## [16] parallel_3.2.3 gridExtra_2.0.0 httr_1.0.0 stringr_1.0.0 htmlwidgets_0.5
## [21] grid_3.2.3 R6_2.1.1 rmarkdown_0.8.1 RJSONIO_1.3-0 magrittr_1.5
## [26] scales_0.3.0.9000 htmltools_0.2.6 assertthat_0.1 colorspace_1.2-6 labeling_0.3
## [31] stringi_1.0-1 lazyeval_0.1.10 munsell_0.4.2