We are going to learn about the details of creating an excellent visualization by recreating a classic: one of the best examples of data visualization is the following plot of weather data that appeared in the New York Times on January 4th, 2004:
This image is from Edward Tufte’s blog post, which also contains a couple of close-ups:
We are going to create a similar plot in R using ggplot2, but we will be using our own data. In particular, we will be using the weather data from the Minneapolis-St. Paul airport for 2012, which is available from wunderground.com. The data is spread across hundreds of web pages, but we will see later how to obtain the data by writing a Python script. For now, you will have access to an RData file that contains the relevant data in a data frame.
Here is an example of one of the pages that the data comes from.
As a whole, your project is to use these data to create a plot that is substantially similar to the top panel of the New York Times plot. While creating the plot is technically challenging in itself, it is also equally important to understand the content of the data so as to interpret and present the information in the best way possible. Minnesota weather in 2012 had many interesting and record-setting events, and the purpose of this project is to make a plot to present all of these data so that the viewer can see the significance of temperature trends in one glance.
Read in the data frame and take a glimpse at its structure:
load(url('http://www.stolaf.edu/people/olaf/cs125/MSP2012.RData'))
head(MSP2012)
## month day observedLo observedHi normalLo normalHi recordLo yearLo recordHi yearHi precip snow
## 1 1 1 19 34 8 24 -30 1974 48 1897 T T
## 2 1 2 11 20 8 24 -36 1885 45 1998 T T
## 3 1 3 10 29 8 24 -26 1887 46 1880 0.00 0.00
## 4 1 4 23 37 8 24 -32 1884 41 1898 0.00 0.00
## 5 1 5 24 45 8 24 -28 1924 47 1885 0.00 0.00
## 6 1 6 37 46 8 23 -27 1912 49 1900 0.00 0.00
#These color codes will be useful for building your graphics:
colors<-c(background="#e4dcd5", record="#cbc3aa", normal="#9f9786", daily="#763f59", axis="#a59d8b")
Now add a column that gives the day of the year (1 2 … 366). Then using geom_rect, plot the daily record temps, the normal high and low, and the actual daily high and low.
# your code here!
require(ggplot2)
## Loading required package: ggplot2
Days <- c(1:366)
MSP2012$Days <- Days
mspplot <- ggplot(MSP2012) +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$recordLo, ymax = MSP2012$recordHi), fill="#cbc3aa") +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$normalLo, ymax = MSP2012$normalHi), fill="#9f9786") +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$observedLo, ymax = MSP2012$observedHi), color="#763f59") +
coord_cartesian(xlim=c(0,366))+
theme(panel.background=element_rect(fill="#e4dcd5"),axis.line=element_blank(), panel.grid.major=element_blank(), panel.grid.minor=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
mspplot
Turn off the default axes and grid lines. One way to do this is to modify the axis.line, panel.grid.major, and panel.grid.minor elements of theme().
Create a vector that has the temperature divisions you want to see (e.g. -30 -20 … 110).
Create a vector that gives the cumulative number of days at the end of each month (0 31 60 … 366). Note: please calculate these values using the data you have in the data frame!
Using geom hline and geom vline, create axes and gridlines.
Create a vector of labels for the x axis, and another vector of labels for the y axis.
Using geom_text, add the labels in the appropriate places. (It is possible to get lots of details right. For example, to get degree symbols, it may be helpful to learn about what parse=TRUE does with “60*degree“.)
# your code here!
ynum <- seq(from=-40, to=110, by=10)
xnum <- c(which(MSP2012$day==1)-1,367)
xnum <- xnum[2:13]
ylab <- c("-40°", "-30°", "-20°", "-10°", "0°", "10°", "20°", "30°", "40°", "50°", "60°", "70°", "80°", "90°", "100°", "110°")
xlab <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
mspplot <- ggplot(MSP2012) +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$recordLo, ymax = MSP2012$recordHi), fill="#cbc3aa") +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$normalLo, ymax = MSP2012$normalHi), fill="#9f9786") +
geom_rect(aes(xmin = MSP2012$Days, xmax = MSP2012$Days + 1, ymin = MSP2012$observedLo, ymax = MSP2012$observedHi), color="#763f59") +
coord_cartesian(xlim=c(-17,366))+
annotate("segment", x=1, xend=1, y=-40, yend=110, color="#9f9786", size=1)+
annotate("segment", x=xnum, xend=xnum, y=-40, yend=110, linetype="dotted")+
annotate("segment", x=367, xend=367, y=-40, yend=110, color="#9f9786", size=1)+
geom_hline(yintercept = ynum, color="#e4dcd5")+
annotate("text", x=xnum-15, y=120, label=xlab, size=2)+
annotate("text", x=-10, y=ynum, label=ylab, size=2)+
annotate("text", x=376, y=ynum, label=ylab, size=2)+
theme(panel.background=element_rect(fill="#e4dcd5"),axis.line=element_blank(), panel.grid.major=element_blank(), panel.grid.minor=element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank())+
coord_fixed(ratio=.6)
mspplot
Read the help page for annotate.
Build the legend using annotate(“segment”, …) and annotate(“text”, …). Here are some useful coordinates you might want to use: For the segments, you will want the normal range color to extend for 10 units up from (198, -25), the actual temperature color from (198, -22) up 11 units, and the record temperature color from (198, -32) up 24 units. For the labels, “Normal Range” is at (180, -20), “Record High” is at (180, -8) with “Record Low” 24 units below it, and “Actual High” is at (214, -11) with “Actual Low” 10 units below it.
Add an overall title and other text that are not tied to the data. It might be helpful to place the title at (95, 125).
# your code here!
(mean(MSP2012$observedLo)+mean(MSP2012$observedHi))/2
## [1] 50.85109
mspplot3 <- mspplot+
annotate("rect", xmin=165, xmax=225, ymin=-40, ymax=35, fill="#e4dcd5")+
annotate("segment",x=198, xend=198, y=-32, yend=-8, color="#cbc3aa", size=2)+
annotate("segment",x=198, xend=198, y=-25, yend=-15, color="#9f9786", size=1.5)+
annotate("segment",x=198, xend=198, y=-22, yend=-11, color="#763f59", size=1)+
annotate("text", x=180, y=-20, label="Normal Range", size=2)+
annotate("text", x=180, y=-8, label="Record High", size=2)+
annotate("text", x=180, y=-32, label="Record Low", size=2)+
annotate("text", x=214, y=-11, label="Actual High", size=2)+
annotate("text", x=95, y=140, label="Minneapolis-St. Paul's Weather in 2012")+
annotate("rect", xmin=10, xmax=55, ymin=60, ymax=110, fill="#e4dcd5")+
annotate("text", x=20, y=108, label="Temperature", size=2)+
annotate("text", x=10, y=90, label="Bars represent range between the '\n' daily high and low. Average temperature'\n' for the year was 50.9°", size=1, hjust=0)
mspplot3
Create a data frame that contains just the rows where the record low temperature was attained (and similarly, a data frame for highs).
For each month, if one or more record high or record low temperatures were attained, use geom_segment and geom_text to add small comments to the highest record high and the lowest record low.
# your code here!
records <- MSP2012[which(MSP2012$observedHi==MSP2012$recordHi),]
mspplot4 <- mspplot3+
annotate("segment",x=c(10, 77,139,186,315), xend=c(10, 77,139,186,315), y=MSP2012$observedHi[c(10,77,139,186,315)], yend=MSP2012$observedHi[c(10,77,139,186,315)]+10)+
annotate("text", y = MSP2012$observedHi[10] + 10, x = MSP2012$Days[10], label = "Record High: 52°", size = 1.5, hjust = -.05) + annotate("text", y = MSP2012$observedHi[77] + 10, x = MSP2012$Days[77], label = "Record High: 80°", size = 1.5, hjust = -.05) + annotate("text", y = MSP2012$observedHi[139] + 10, x = MSP2012$Days[139], label = "Record High: 93°", size = 1.5, hjust = -.1) + annotate("text", y = MSP2012$observedHi[186] + 10, x = MSP2012$Days[186], label = "Record High: 101°", size = 1.5, hjust = -.1) + annotate("text", y = MSP2012$observedHi[315] + 10, x = MSP2012$Days[315], label = "Record High: 69°", size = 1.5, hjust = -.1)
mspplot4
This part is truly optional! (So do Part Two first…)
Build the lower part of the graph, showing precipitation.
Look up on the web how to combine your two plots into a single visualization.
# your code here!