Overview

This file implements Programming Assignment 1: Visualize Data Using a Chart for the Coursera Data Visualization Class.

Load the Data

The original data comes from NASA GISTEMP Table Data: Global and Hemispheric Monthly Means and Zonal Annual Means, but the course instructors supplied a modified dataset which I downloaded.

The data is explained more fully in Global surface temperature change (Hansen, Ruedy, Sato, and Lo, 2010).

# setwd("~/Education/Coursera - Data Visualization/Projects/PA1")

gistemp <- read.csv("data/ExcelFormattedGISTEMPData2CSV.csv")
#str(gistemp)
#summary(gistemp)

The data is described as “Annual mean Land-Ocean Temperature Index in .01 degrees Celsius - selected zonal means”

Normally I would do further exploratory data analysis here, but that would be confusing since we are only submitting a single visualization.
See More About the Data below.

Submission Visualization

The assignment should be graded on the contents of this section.

Suggested coverage:
What are your X and Y axes?
Did you use a subset of the data? If so, what was it?
Are there any particular aspects of your visualization to which you would like to bring attention?
What do you think the data, and your visualization, shows?

I decided to visualize this dataset using a heatmap of the zonal temperatures by latitude. The dataset contains several granularities of latitude: global, by hemisphere, (90S-24S, 24S-24N, 24N-90N), and a further subdivision. I chose to use the finest grained data.

The X axis is the latitude zone while the Y axis is the year. Having latitude as the X axis is somewhat noninuitive, but was necessary to present the visualization in a portrait format for inclusion in this document.

I chose a red/blue colormap in order to be colorblind friendly (I think the red/green version works better for me, see below to compare). The colormap is nonlinear to better show variation in the center range (-100 to 100) which contains most of the data (see histogram below).

This visualization shows how non-uniform (by latitude) the temperature change has been over the last 130 years. We can clearly see the greatest increase over time has been near the North Pole (64N-90N) while the South Pole region shows significant random variation (e.g. see 1908).

The overall warming trend is also clearly visible (compare the top and bottom of the plot), but I believe this is better shown by the basic line plot of the global mean below.

cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S", 
              "X64S.44S", "X90S.64S"))
hm_data <- gistemp[, cols]

steps <- c(colorRampPalette(c("red", "black", "green"))(n = 7)[c(1,2,4,6,7)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")

# Use a nonlinear color mapping so we can better see variation in the center of the range
# steps <- c(colorRampPalette(c("red", "black", "green"))(n = 9)[c(1,2,5,8,9)])
steps <- c(colorRampPalette(c("red", "black", "blue"))(n = 9)[c(1,2,5,8,9)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")

# Fix title spacing problem (why is this ugly hack needed?!)
lmat = rbind(c(0,0),c(4,3),c(2,1))
lhei <- c(0.1, 1, 5)
lwid <- c(1.5, 4)

heatmap.2(as.matrix(hm_data), # Show year as Y and latitude as X
        Rowv = NULL, Colv = NULL, dendrogram = "none", # Suppress dendrograms and reordering
        symkey = TRUE, col = my_palette, scale = "none",
        main = "Global Mean Temp (0.01C) by Latitude Zones and Year",
        ylab="Year", xlab = "Latitude Zone",
        lhei = lhei, lwid = lwid, lmat = lmat,
        margins = c(8,5),
        labRow = gistemp$Year, labCol = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
                                              "24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
)

Additional Data Analysis

More About the Data

str(gistemp)
## 'data.frame':    135 obs. of  15 variables:
##  $ Year    : int  1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ...
##  $ Glob    : int  -19 -10 -9 -19 -27 -31 -30 -33 -20 -11 ...
##  $ NHem    : int  -33 -18 -17 -30 -42 -41 -39 -37 -22 -16 ...
##  $ SHem    : int  -5 -2 -1 -8 -12 -21 -21 -28 -17 -6 ...
##  $ X24N.90N: int  -38 -27 -21 -34 -56 -61 -49 -46 -42 -25 ...
##  $ X24S.24N: int  -16 -2 -10 -22 -17 -17 -24 -27 7 4 ...
##  $ X90S.24S: int  -5 -5 4 -2 -11 -20 -20 -26 -33 -17 ...
##  $ X64N.90N: int  -89 -54 -125 -28 -127 -119 -124 -158 -141 -82 ...
##  $ X44N.64N: int  -54 -40 -20 -57 -58 -70 -43 -52 -43 -13 ...
##  $ X24N.44N: int  -22 -14 -3 -20 -41 -43 -38 -21 -22 -21 ...
##  $ EQU.24N : int  -26 -5 -12 -25 -21 -11 -24 -24 7 -3 ...
##  $ X24S.EQU: int  -5 2 -8 -19 -14 -23 -24 -31 8 11 ...
##  $ X44S.24S: int  -2 -6 3 -1 -15 -27 -18 -24 -30 -16 ...
##  $ X64S.44S: int  -8 -3 8 0 -5 -7 -21 -29 -38 -17 ...
##  $ X90S.64S: int  39 37 42 37 40 38 28 21 16 19 ...
summary(gistemp)
##       Year           Glob             NHem              SHem          
##  Min.   :1880   Min.   :-47.00   Min.   :-52.000   Min.   :-47.00000  
##  1st Qu.:1914   1st Qu.:-20.00   1st Qu.:-21.500   1st Qu.:-22.50000  
##  Median :1947   Median : -8.00   Median : -2.000   Median : -9.00000  
##  Mean   :1947   Mean   :  1.63   Mean   :  3.326   Mean   : -0.07407  
##  3rd Qu.:1980   3rd Qu.: 17.50   3rd Qu.: 16.000   3rd Qu.: 25.00000  
##  Max.   :2014   Max.   : 75.00   Max.   : 91.000   Max.   : 59.00000  
##     X24N.90N          X24S.24N          X90S.24S          X64N.90N       
##  Min.   :-61.000   Min.   :-61.000   Min.   :-48.000   Min.   :-158.000  
##  1st Qu.:-26.000   1st Qu.:-22.000   1st Qu.:-26.000   1st Qu.: -47.500  
##  Median :  2.000   Median : -3.000   Median :-11.000   Median :   3.000  
##  Mean   :  5.415   Mean   :  1.926   Mean   : -2.704   Mean   :   9.022  
##  3rd Qu.: 21.000   3rd Qu.: 23.000   3rd Qu.: 21.000   3rd Qu.:  58.000  
##  Max.   :110.000   Max.   : 72.000   Max.   : 58.000   Max.   : 211.000  
##     X44N.64N          X24N.44N           EQU.24N         
##  Min.   :-70.000   Min.   :-57.0000   Min.   :-70.00000  
##  1st Qu.:-27.000   1st Qu.:-19.5000   1st Qu.:-23.50000  
##  Median :  0.000   Median : -8.0000   Median : -3.00000  
##  Mean   :  9.163   Mean   :  0.7111   Mean   :  0.08148  
##  3rd Qu.: 34.500   3rd Qu.: 13.5000   3rd Qu.: 19.50000  
##  Max.   :129.000   Max.   : 77.0000   Max.   : 72.00000  
##     X24S.EQU          X44S.24S           X64S.44S          X90S.64S       
##  Min.   :-55.000   Min.   :-43.0000   Min.   :-62.000   Min.   :-237.000  
##  1st Qu.:-22.000   1st Qu.:-23.0000   1st Qu.:-27.500   1st Qu.: -41.000  
##  Median : -3.000   Median : -9.0000   Median : -9.000   Median :   5.000  
##  Mean   :  3.748   Mean   :  0.7926   Mean   : -7.593   Mean   :  -5.119  
##  3rd Qu.: 29.500   3rd Qu.: 22.0000   3rd Qu.: 16.000   3rd Qu.:  37.500  
##  Max.   : 81.000   Max.   : 76.0000   Max.   : 38.000   Max.   : 136.000

Basic Line Plot

A simple and clear presentation of the global change.
But notice how this obscures the zonal variation highlighted by the heatmap!

require(zoo)
# Make zoo object of data
Glob.zoo <- zoo(gistemp$Glob, gistemp$Year)

# Calculate moving average with window 3 and make first and last value as NA (to ensure identical length of vectors)
m.av <- rollmean(Glob.zoo, 3,fill = list(NA, NULL, NA))

# Add calculated moving averages to existing data frame
gistemp$Glob.av <- coredata(m.av)

# Add additional line for moving average in red
ggplot(gistemp, aes(Year, Glob)) + geom_line() + 
  geom_line(aes(Year,Glob.av), color="red") + 
  #scale_x_datetime(breaks = date_breaks("5 min"),labels=date_format("%H:%M")) +
  xlab("Year") + ylab("Temperature (0.01 C)")+
  ggtitle("Global Annual Mean Temperature Difference and 3 Year MA")
## Warning: Removed 2 rows containing missing values (geom_path).

Histogram

This is primarily used to help determine the color mapping for the heatmap below.

cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S", 
              "X64S.44S", "X90S.64S"))
hist(as.matrix(gistemp[, cols]), breaks=100, xlab="Temperature (0.01C)",
     main="Histogram of Zonal Temperature Values")

Heatmap

Create a heatmap of the temperature differences by year and latitude. Use the finest grained latitude zones.

Presentation of the year as the Y axis is less intuitive, but necessary for formatting the heatmap in a reasonable width.

It may be better to use a nonlinear color mapping because the temperature values tend to be close to zero (see histogram above). Have yet to find a good way for doing this. See scales package.
See http://menugget.blogspot.com/2011/11/define-color-steps-for-colorramppalette.html

Changing the heatmap.2 layout

require(gplots) # For heatmap.2

cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S", 
              "X64S.44S", "X90S.64S"))
hm_data <- gistemp[, cols]

# heatmap(t(as.matrix(hm_data)), # Show year as X and latitude as Y
#         Rowv = NA, Colv = NA, # Suppress dendrograms and reordering
#         main = "Global Mean Temperatures by Latitude Zones",
#         xlab="Year", ylab = "Latitude",
#         labCol = gistemp$Year, labRow = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
#                                               "24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
# )

# my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
# my_palette <- colorRampPalette(c("red", "black", "green"))(n = 299)

# steps <- c("blue4", "cyan", "white", "yellow", "red4")
# Use half of the range to show coarse variation and half to show fine
# steps <- c(colorRampPalette(c("red", "black", "green"))(n = 5))
# my_palette <- color.palette(steps, c(200, 20, 20, 200), space="rgb")
steps <- c(colorRampPalette(c("red", "black", "green"))(n = 7)[c(1,2,4,6,7)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")
steps <- c(colorRampPalette(c("red", "black", "green"))(n = 9)[c(1,2,5,8,9)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")

# # Try using breaks instead
# temp.range <- range(as.matrix(hm_data))
# breaks <- c(seq(temp.range[1], -100, 10), seq(-100, 100, 2), seq(102, temp.range[2]+2, 10))
# my_palette <- colorRampPalette(c("red", "black", "green"))(n = length(breaks)+1)

# # Fix title spacing problem (why is this ugly hack needed?!)
lmat = rbind(c(0,0),c(4,3),c(2,1))
lhei <- c(0.1, 1, 5)
lwid <- c(1.5, 4)

heatmap.2(as.matrix(hm_data), # Show year as Y and latitude as X
        Rowv = NULL, Colv = NULL, dendrogram = "none", # Suppress dendrograms and reordering
        symkey = TRUE, col = my_palette, scale = "none",
        main = "Global Mean Temp (0.01C) by Latitude Zones and Year",
        ylab="Year", xlab = "Latitude Zone",
        #lhei = c(1, 5), # lwid = c(1, 5),
        #lhei = c(2, 10), # lwid = c(1, 5),
        #lhei = c(lcm(5), lcm(30)), # overlaps even with absolute heights
        #lhei = c(1.5, 8), lwid = c(1.5, 5),
        lhei = lhei, lwid = lwid, lmat = lmat,
        margins = c(8,5),
        labRow = gistemp$Year, labCol = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
                                              "24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
)

#title("Global Mean Temperatures (0.01C) by Latitude Zones and Year", line=4)

# heatmap.2(as.matrix(hm_data), col=redgreen(75), key=T, keysize=1.5,
#           symm=F,symkey=F,symbreaks=T, scale="none", # Use asymmetric color key
#           density.info="none", trace="none",cexCol=0.9,
#           Colv=hm_colDend,
#           Rowv=hm_rowDend,
#           #labRow=client_data$test_date_text,
#           margins=c(5,8))

Streamgraph

R has a nice Streamgraph package. See Introducing the streamgraph htmlwidget R Package for an overview. A more recent blog post: Streamgraphs in R

First load the streamgraph package, installing it if necessary.

if (!require(streamgraph)) {
  # Note that this required some extra work to install Rcpp first (dependency failed)
  # install.packages("Rcpp")
  require(devtools)
  devtools::install_github("hrbrmstr/streamgraph")
  require(streamgraph)
}

Streamgraph example from link above. Disabled once I have a working GISTEMP version.
How do I make the Y axis numbers meaningful?
How do I create titles and axis labels?

More examples at http://rpubs.com/hrbrmstr/streamgraph04

ggplot2::movies %>%
  select(year, Action, Animation, Comedy, Drama, Documentary, Romance, Short) %>%
  tidyr::gather(genre, value, -year) %>%
  group_by(year, genre) %>%
  tally(wt=value) %>%
  # streamgraph("genre", "n", "year") %>%
  streamgraph("genre", "n", "year", interactive=TRUE) %>%
  sg_axis_x(20) %>%
  # sg_colors("PuOr") %>% # obsolete, replaced by next line
  sg_fill_brewer("PuOr") %>%
  sg_legend(show=TRUE, label="Genres: ")

  # RES additions (how to do this?)
  # Titles require special handling.
  # See https://github.com/hrbrmstr/metricsgraphics/issues/25
  # ?htmltools::tags
#   ggtitle("Movie Count by Year and Genre") %>%
#   labs(x="Year",y="Number of Movies") 

Note that the interactive feature is not working properly in RStudio, local knitr, or Rpubs. Compare to the blog post above.

Note that it does work in the package author’s Rpubs at http://rpubs.com/hrbrmstr/streamgraph04

Try his version (did not work for me either):

ggplot2::movies %>%
  select(year, Action, Animation, Comedy, Drama, Documentary, Romance, Short) %>%
  tidyr::gather(genre, value, -year) %>%
  group_by(year, genre) %>%
  tally(wt=value) %>%
  ungroup -> dat

streamgraph(dat, "genre", "n", "year", interactive=TRUE) %>%
  sg_axis_x(20, "year", "%Y") %>%
  sg_colors("PuOr")

I’m not seeing a good way to use a Streamgraph with this data.

Color Blindness

I originally chose to use a red/green color map, but changed to red/blue in consideration of colorblindness. Vischeck was helpful for checking my graphic.

Some useful sites:
http://gis.stackexchange.com/questions/2887/how-to-account-for-colour-blindness-when-designing-maps
http://www.vischeck.com/ - simulate colorblindness on image

File originally created: Saturday, August 1, 2015
File knitted: Sun Aug 02 11:40:31 2015

Bibliography