This file implements Programming Assignment 1: Visualize Data Using a Chart for the Coursera Data Visualization Class.
The original data comes from NASA GISTEMP Table Data: Global and Hemispheric Monthly Means and Zonal Annual Means, but the course instructors supplied a modified dataset which I downloaded.
The data is explained more fully in Global surface temperature change (Hansen, Ruedy, Sato, and Lo, 2010).
# setwd("~/Education/Coursera - Data Visualization/Projects/PA1")
gistemp <- read.csv("data/ExcelFormattedGISTEMPData2CSV.csv")
#str(gistemp)
#summary(gistemp)
The data is described as “Annual mean Land-Ocean Temperature Index in .01 degrees Celsius - selected zonal means”
Normally I would do further exploratory data analysis here, but that would be confusing since we are only submitting a single visualization.
See More About the Data below.
The assignment should be graded on the contents of this section.
Suggested coverage:
What are your X and Y axes?
Did you use a subset of the data? If so, what was it?
Are there any particular aspects of your visualization to which you would like to bring attention?
What do you think the data, and your visualization, shows?
I decided to visualize this dataset using a heatmap of the zonal temperatures by latitude. The dataset contains several granularities of latitude: global, by hemisphere, (90S-24S, 24S-24N, 24N-90N), and a further subdivision. I chose to use the finest grained data.
The X axis is the latitude zone while the Y axis is the year. Having latitude as the X axis is somewhat noninuitive, but was necessary to present the visualization in a portrait format for inclusion in this document.
I chose a red/blue colormap in order to be colorblind friendly (I think the red/green version works better for me, see below to compare). The colormap is nonlinear to better show variation in the center range (-100 to 100) which contains most of the data (see histogram below).
This visualization shows how non-uniform (by latitude) the temperature change has been over the last 130 years. We can clearly see the greatest increase over time has been near the North Pole (64N-90N) while the South Pole region shows significant random variation (e.g. see 1908).
The overall warming trend is also clearly visible (compare the top and bottom of the plot), but I believe this is better shown by the basic line plot of the global mean below.
cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S",
"X64S.44S", "X90S.64S"))
hm_data <- gistemp[, cols]
steps <- c(colorRampPalette(c("red", "black", "green"))(n = 7)[c(1,2,4,6,7)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")
# Use a nonlinear color mapping so we can better see variation in the center of the range
# steps <- c(colorRampPalette(c("red", "black", "green"))(n = 9)[c(1,2,5,8,9)])
steps <- c(colorRampPalette(c("red", "black", "blue"))(n = 9)[c(1,2,5,8,9)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")
# Fix title spacing problem (why is this ugly hack needed?!)
lmat = rbind(c(0,0),c(4,3),c(2,1))
lhei <- c(0.1, 1, 5)
lwid <- c(1.5, 4)
heatmap.2(as.matrix(hm_data), # Show year as Y and latitude as X
Rowv = NULL, Colv = NULL, dendrogram = "none", # Suppress dendrograms and reordering
symkey = TRUE, col = my_palette, scale = "none",
main = "Global Mean Temp (0.01C) by Latitude Zones and Year",
ylab="Year", xlab = "Latitude Zone",
lhei = lhei, lwid = lwid, lmat = lmat,
margins = c(8,5),
labRow = gistemp$Year, labCol = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
"24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
)
str(gistemp)
## 'data.frame': 135 obs. of 15 variables:
## $ Year : int 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ...
## $ Glob : int -19 -10 -9 -19 -27 -31 -30 -33 -20 -11 ...
## $ NHem : int -33 -18 -17 -30 -42 -41 -39 -37 -22 -16 ...
## $ SHem : int -5 -2 -1 -8 -12 -21 -21 -28 -17 -6 ...
## $ X24N.90N: int -38 -27 -21 -34 -56 -61 -49 -46 -42 -25 ...
## $ X24S.24N: int -16 -2 -10 -22 -17 -17 -24 -27 7 4 ...
## $ X90S.24S: int -5 -5 4 -2 -11 -20 -20 -26 -33 -17 ...
## $ X64N.90N: int -89 -54 -125 -28 -127 -119 -124 -158 -141 -82 ...
## $ X44N.64N: int -54 -40 -20 -57 -58 -70 -43 -52 -43 -13 ...
## $ X24N.44N: int -22 -14 -3 -20 -41 -43 -38 -21 -22 -21 ...
## $ EQU.24N : int -26 -5 -12 -25 -21 -11 -24 -24 7 -3 ...
## $ X24S.EQU: int -5 2 -8 -19 -14 -23 -24 -31 8 11 ...
## $ X44S.24S: int -2 -6 3 -1 -15 -27 -18 -24 -30 -16 ...
## $ X64S.44S: int -8 -3 8 0 -5 -7 -21 -29 -38 -17 ...
## $ X90S.64S: int 39 37 42 37 40 38 28 21 16 19 ...
summary(gistemp)
## Year Glob NHem SHem
## Min. :1880 Min. :-47.00 Min. :-52.000 Min. :-47.00000
## 1st Qu.:1914 1st Qu.:-20.00 1st Qu.:-21.500 1st Qu.:-22.50000
## Median :1947 Median : -8.00 Median : -2.000 Median : -9.00000
## Mean :1947 Mean : 1.63 Mean : 3.326 Mean : -0.07407
## 3rd Qu.:1980 3rd Qu.: 17.50 3rd Qu.: 16.000 3rd Qu.: 25.00000
## Max. :2014 Max. : 75.00 Max. : 91.000 Max. : 59.00000
## X24N.90N X24S.24N X90S.24S X64N.90N
## Min. :-61.000 Min. :-61.000 Min. :-48.000 Min. :-158.000
## 1st Qu.:-26.000 1st Qu.:-22.000 1st Qu.:-26.000 1st Qu.: -47.500
## Median : 2.000 Median : -3.000 Median :-11.000 Median : 3.000
## Mean : 5.415 Mean : 1.926 Mean : -2.704 Mean : 9.022
## 3rd Qu.: 21.000 3rd Qu.: 23.000 3rd Qu.: 21.000 3rd Qu.: 58.000
## Max. :110.000 Max. : 72.000 Max. : 58.000 Max. : 211.000
## X44N.64N X24N.44N EQU.24N
## Min. :-70.000 Min. :-57.0000 Min. :-70.00000
## 1st Qu.:-27.000 1st Qu.:-19.5000 1st Qu.:-23.50000
## Median : 0.000 Median : -8.0000 Median : -3.00000
## Mean : 9.163 Mean : 0.7111 Mean : 0.08148
## 3rd Qu.: 34.500 3rd Qu.: 13.5000 3rd Qu.: 19.50000
## Max. :129.000 Max. : 77.0000 Max. : 72.00000
## X24S.EQU X44S.24S X64S.44S X90S.64S
## Min. :-55.000 Min. :-43.0000 Min. :-62.000 Min. :-237.000
## 1st Qu.:-22.000 1st Qu.:-23.0000 1st Qu.:-27.500 1st Qu.: -41.000
## Median : -3.000 Median : -9.0000 Median : -9.000 Median : 5.000
## Mean : 3.748 Mean : 0.7926 Mean : -7.593 Mean : -5.119
## 3rd Qu.: 29.500 3rd Qu.: 22.0000 3rd Qu.: 16.000 3rd Qu.: 37.500
## Max. : 81.000 Max. : 76.0000 Max. : 38.000 Max. : 136.000
A simple and clear presentation of the global change.
But notice how this obscures the zonal variation highlighted by the heatmap!
require(zoo)
# Make zoo object of data
Glob.zoo <- zoo(gistemp$Glob, gistemp$Year)
# Calculate moving average with window 3 and make first and last value as NA (to ensure identical length of vectors)
m.av <- rollmean(Glob.zoo, 3,fill = list(NA, NULL, NA))
# Add calculated moving averages to existing data frame
gistemp$Glob.av <- coredata(m.av)
# Add additional line for moving average in red
ggplot(gistemp, aes(Year, Glob)) + geom_line() +
geom_line(aes(Year,Glob.av), color="red") +
#scale_x_datetime(breaks = date_breaks("5 min"),labels=date_format("%H:%M")) +
xlab("Year") + ylab("Temperature (0.01 C)")+
ggtitle("Global Annual Mean Temperature Difference and 3 Year MA")
## Warning: Removed 2 rows containing missing values (geom_path).
This is primarily used to help determine the color mapping for the heatmap below.
cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S",
"X64S.44S", "X90S.64S"))
hist(as.matrix(gistemp[, cols]), breaks=100, xlab="Temperature (0.01C)",
main="Histogram of Zonal Temperature Values")
Create a heatmap of the temperature differences by year and latitude. Use the finest grained latitude zones.
Presentation of the year as the Y axis is less intuitive, but necessary for formatting the heatmap in a reasonable width.
It may be better to use a nonlinear color mapping because the temperature values tend to be close to zero (see histogram above). Have yet to find a good way for doing this. See scales package.
See http://menugget.blogspot.com/2011/11/define-color-steps-for-colorramppalette.html
require(gplots) # For heatmap.2
cols <- rev(c("X64N.90N", "X44N.64N", "X24N.44N", "EQU.24N", "X24S.EQU", "X44S.24S",
"X64S.44S", "X90S.64S"))
hm_data <- gistemp[, cols]
# heatmap(t(as.matrix(hm_data)), # Show year as X and latitude as Y
# Rowv = NA, Colv = NA, # Suppress dendrograms and reordering
# main = "Global Mean Temperatures by Latitude Zones",
# xlab="Year", ylab = "Latitude",
# labCol = gistemp$Year, labRow = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
# "24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
# )
# my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
# my_palette <- colorRampPalette(c("red", "black", "green"))(n = 299)
# steps <- c("blue4", "cyan", "white", "yellow", "red4")
# Use half of the range to show coarse variation and half to show fine
# steps <- c(colorRampPalette(c("red", "black", "green"))(n = 5))
# my_palette <- color.palette(steps, c(200, 20, 20, 200), space="rgb")
steps <- c(colorRampPalette(c("red", "black", "green"))(n = 7)[c(1,2,4,6,7)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")
steps <- c(colorRampPalette(c("red", "black", "green"))(n = 9)[c(1,2,5,8,9)])
my_palette <- color.palette(steps, c(200, 50, 50, 200), space="rgb")
# # Try using breaks instead
# temp.range <- range(as.matrix(hm_data))
# breaks <- c(seq(temp.range[1], -100, 10), seq(-100, 100, 2), seq(102, temp.range[2]+2, 10))
# my_palette <- colorRampPalette(c("red", "black", "green"))(n = length(breaks)+1)
# # Fix title spacing problem (why is this ugly hack needed?!)
lmat = rbind(c(0,0),c(4,3),c(2,1))
lhei <- c(0.1, 1, 5)
lwid <- c(1.5, 4)
heatmap.2(as.matrix(hm_data), # Show year as Y and latitude as X
Rowv = NULL, Colv = NULL, dendrogram = "none", # Suppress dendrograms and reordering
symkey = TRUE, col = my_palette, scale = "none",
main = "Global Mean Temp (0.01C) by Latitude Zones and Year",
ylab="Year", xlab = "Latitude Zone",
#lhei = c(1, 5), # lwid = c(1, 5),
#lhei = c(2, 10), # lwid = c(1, 5),
#lhei = c(lcm(5), lcm(30)), # overlaps even with absolute heights
#lhei = c(1.5, 8), lwid = c(1.5, 5),
lhei = lhei, lwid = lwid, lmat = lmat,
margins = c(8,5),
labRow = gistemp$Year, labCol = rev(c("64N-90N", "44N-64N", "24N-44N", "EQU-24N",
"24S-EQU", "44S-24S", "64S-44S", "90S-64S"))
)
#title("Global Mean Temperatures (0.01C) by Latitude Zones and Year", line=4)
# heatmap.2(as.matrix(hm_data), col=redgreen(75), key=T, keysize=1.5,
# symm=F,symkey=F,symbreaks=T, scale="none", # Use asymmetric color key
# density.info="none", trace="none",cexCol=0.9,
# Colv=hm_colDend,
# Rowv=hm_rowDend,
# #labRow=client_data$test_date_text,
# margins=c(5,8))
R has a nice Streamgraph package. See Introducing the streamgraph htmlwidget R Package for an overview. A more recent blog post: Streamgraphs in R
First load the streamgraph package, installing it if necessary.
if (!require(streamgraph)) {
# Note that this required some extra work to install Rcpp first (dependency failed)
# install.packages("Rcpp")
require(devtools)
devtools::install_github("hrbrmstr/streamgraph")
require(streamgraph)
}
Streamgraph example from link above. Disabled once I have a working GISTEMP version.
How do I make the Y axis numbers meaningful?
How do I create titles and axis labels?
More examples at http://rpubs.com/hrbrmstr/streamgraph04
ggplot2::movies %>%
select(year, Action, Animation, Comedy, Drama, Documentary, Romance, Short) %>%
tidyr::gather(genre, value, -year) %>%
group_by(year, genre) %>%
tally(wt=value) %>%
# streamgraph("genre", "n", "year") %>%
streamgraph("genre", "n", "year", interactive=TRUE) %>%
sg_axis_x(20) %>%
# sg_colors("PuOr") %>% # obsolete, replaced by next line
sg_fill_brewer("PuOr") %>%
sg_legend(show=TRUE, label="Genres: ")
# RES additions (how to do this?)
# Titles require special handling.
# See https://github.com/hrbrmstr/metricsgraphics/issues/25
# ?htmltools::tags
# ggtitle("Movie Count by Year and Genre") %>%
# labs(x="Year",y="Number of Movies")
Note that the interactive feature is not working properly in RStudio, local knitr, or Rpubs. Compare to the blog post above.
Note that it does work in the package author’s Rpubs at http://rpubs.com/hrbrmstr/streamgraph04
Try his version (did not work for me either):
ggplot2::movies %>%
select(year, Action, Animation, Comedy, Drama, Documentary, Romance, Short) %>%
tidyr::gather(genre, value, -year) %>%
group_by(year, genre) %>%
tally(wt=value) %>%
ungroup -> dat
streamgraph(dat, "genre", "n", "year", interactive=TRUE) %>%
sg_axis_x(20, "year", "%Y") %>%
sg_colors("PuOr")
I’m not seeing a good way to use a Streamgraph with this data.
I originally chose to use a red/green color map, but changed to red/blue in consideration of colorblindness. Vischeck was helpful for checking my graphic.
Some useful sites:
http://gis.stackexchange.com/questions/2887/how-to-account-for-colour-blindness-when-designing-maps
http://www.vischeck.com/ - simulate colorblindness on image
File originally created: Saturday, August 1, 2015
File knitted: Sun Aug 02 11:40:31 2015