Statistics 4868/6610 Data Visualization
Prof. Eric A. Suess
2/24/2016
Today we will discuss the end of Chapter 6, Comparison.
python for data manipulation.
We will discuss the Histogram Matrix and small multiples.
Forecasting Time Series
Hiereachical and Grouped Time Series
parallel processing
Start Chapter 8
For those of you that are interested in learning python for data manipulation and analysis, there are lots of good resources.
I would first suggest the video series on python from TheNewBoston.
Or you could start with the interactive course from codecademy.
I would second suggest the book by Norm Matloff, UC Davis, Fast Lane to Python.
And I would third suggest the book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney.
A nice install of python is the Enthought Canopy Express.
The author presents a nice python program to wrangle the transform-birth-rate data file. It takes rows, skipping the first column, stacking the data into two columns, the year and the rate.
import csv
reader = csv.reader(open('birth-rate.csv',
'r'), delimiter=",")
writer = csv.writer(open('birth-rate-
yearly5.csv', 'w'), delimiter=",")
rows_so_far = 0
data = ["year", "rate"]
writer.writerow(data)
for row in reader:
if rows_so_far == 0:
header = row
rows_so_far += 1
else:
for i in range(len(row)):
if i > 0 and row[i]:
data = [header[i], row[i]]
writer.writerow(data)
rows_so_far += 1
To run the code at a command line.
$ python transform-birth-rate.py >
birth-rate-yearly.csv
Or try the dplyr package in R written by Hadley Wickham.
Or try the newer tidyr package in R written by Hadley Wickham.
My second favorite plot is the Histogram Matrix using the lattice library in R.
There are three examples in the book.
According to the author,…
The technique of putting a bunch of small graphs together in a single graphic is commonly referred to as small multiples.
It encourages readers to make comparisons across groups and categories, and within them.
Plus, you can fit a lot in one space if your graphic is organized.
While we are discussing Correlation and Regresson and Comparison, we can discuss Forecasting and Hierarchical Time Series.
And add to it parallel processing, which is available in R as part of base R through the parallel package that is now part of base R. Many libraries have function that are giving parallel options in functions now.
There is an excellent OTexts Forecasting book available online, Forecasting: principles and practice.
library(fpp)
Last time we discuss Linear Regression and here is some discussion about the use of Dummy Variables in a time series setting.
We have already discussed Time Series Decomposition, you might check out X-12-ARIMA and STL.
Other topics to consider that can be useful when trying to visualize time series data.
Chapter 9 in fpp describes Hierarchical or grouped time series. These ideas are very similar to the ideas presented in our book for Comparison.
But since this is for time series data, forecasting can be used to project into the future.
Rob J. Hyndman, one of the authors of fpp, recently gave a talk about Visualization and forecasting of big time series data
He maintains the Time Series Data Library now part of DataMarket.
He has worked on the R libraries
library(forecast)
library(hts)
Read Example 9.7 in fpp and try the R code for the Australian tourism hierachy.
library(fpp)
library(hts)
y <- hts(vn, nodes=list(4,c(2,2,2,2)))
allf <- forecast(y, h=8, parallel=TRUE,
num.cores=4)
plot(allf)
# Example 1
# The hierarchical structure looks like
# 2 child nodes associated with level 1,
# which are followed by 3 and 2 sub-child
# nodes respectively at level 2.
nodes <- list(2, c(3, 2))
abc <- ts(5 + matrix(sort(rnorm(500)),
ncol = 5, nrow = 100))
x <- hts(abc, nodes)
# etc
fc <- forecast(x, h=10, fmethod="ets",
parallel=TRUE, num.cores=2)
plot(fc)
# arima
fc <- forecast(x, h=10, fmethod="arima",
parallel=TRUE, num.cores=4)
plot(fc)
# Example 2
# Suppose we've got the bottom names that
# can be useful for constructing the node
# structure and the labels at higher levels.
# We need to specify how to split them
# in the argument "characters".
abc <- ts(5 + matrix(sort(rnorm(1000)),
ncol = 10, nrow = 100))
colnames(abc) <- c("A10A", "A10B", "A10C",
"A20A", "A20B", "B30A","B30B", "B40A",
"B40B", "B40C")
y <- hts(abc, characters = c(1, 2, 1))
# etc
fc <- forecast(y, h=10, fmethod="ets",
parallel=TRUE, num.cores=2)
plot(fc)
# arima
fc <- forecast(y, h=10, fmethod="arima",
parallel=TRUE, num.cores=8)
plot(fc)