Data Visualization with R: A Light Intro

Bodong Chen, University of Toronto
KNAER Data Visualization Workshop
April 3, 2013

What is R?

  • A statistical programming language
  • A free software for statistical computing and graphics (and a lot more)
  • Widely used among statisticians for data analysis
  • Popularity has increased substantially in recent years



Why R?

  • A huge amount of packages
  • Extensive documentation, and vibrant user community
  • Reproducible analysis and research
  • It's free!

Why R?

Demo: Visualize census data

  • A subset of a student census dataset
load("./data/data_workshop.Rda") # load data
str(census) # check data structure
'data.frame':   1000 obs. of  7 variables:
 $ gender  : Factor w/ 2 levels "F","M": 1 1 1 1 2 1 1 1 2 2 ...
 $ race    : Factor w/ 9 levels "Black   ","E Asian ",..: 2 7 1 1 7 7 4 2 6 4 ...
 $ program : Factor w/ 3 levels "Academic","Applied",..: 1 1 2 1 1 1 1 2 1 1 ...
 $ progress: Factor w/ 4 levels "Having Difficulty",..: 2 2 1 3 2 3 2 3 2 2 ...
 $ mark    : num  90 85 41 67 58 67 88 52 76 90 ...
 $ mark9   : num  94 78 41 53 67 69 90 50 85 90 ...
 $ absence : num  1 2 8 1 3 0 1 6 4 3 ...

Bar graph: count one categorical variable, 'race'

library(ggplot2) # load ggplot2 library
qplot(race, data=census)

plot of chunk unnamed-chunk-1

Bar graph: specify fill color to darkblue

qplot(race, data=census, fill=I("darkblue"))

plot of chunk unnamed-chunk-2

Bar graph: specify fill color, by 'progress'

qplot(race, data=census, fill=progress)

plot of chunk unnamed-chunk-3

Bar graph: too colorful? use grey scale

qplot(race, data=census, fill=progress) + scale_fill_grey()

plot of chunk unnamed-chunk-4

Bar graph: don't like "stack"? use "dodge"

qplot(race, data=census, fill=progress, position="dodge")

plot of chunk unnamed-chunk-5

Bar graph: care about percentage? "fill" space

qplot(race, data=census, fill=progress, position="fill", ylab="percentage")

plot of chunk unnamed-chunk-6

Bar graph: flip coordinates

qplot(race, data=census, fill=progress, position="fill", ylab="percentage") + coord_flip()

plot of chunk unnamed-chunk-7

Histogram: count one continuous variable, 'mark'

qplot(mark, data=census)

plot of chunk unnamed-chunk-8

Histogram: split by 'progress'

qplot(mark, data = census, facets = progress ~ .)

plot of chunk unnamed-chunk-9

Histogram: split by 'progress' and 'gender'

qplot(mark, data = census, facets = progress ~ gender)

plot of chunk unnamed-chunk-10

Plot continuous on categorical

qplot(progress, mark, data=census) # results not good

plot of chunk unnamed-chunk-11

Jitter plot

qplot(progress, mark, data=census, geom="jitter")

plot of chunk unnamed-chunk-12

Jitter plot: deal with overplotting

qplot(progress, mark, data=census, geom="jitter", alpha=I(1/3))

plot of chunk unnamed-chunk-13

Box plot

qplot(progress, mark, data=census, geom="boxplot")

plot of chunk unnamed-chunk-14

Combine box plot with jitter plot

qplot(progress, mark, data=census, geom=c("boxplot", "jitter"), alpha=I(1/5))

plot of chunk unnamed-chunk-15

Density curve - Another angle on the same data

qplot(mark, data=census, fill=progress, geom="density")

plot of chunk unnamed-chunk-16

nDensity

ggplot(census, aes(x=factor(mark))) +
  geom_bar(aes(y=..ndensity.., group=progress, fill=progress)) +
  xlab("Mark") +
  ylab ("Frequency of Students by Progress") + 
  ggtitle("Distribution of Marks by Progress")

plot of chunk unnamed-chunk-17

Density curve: make transparent

qplot(mark, data=census, fill=progress, geom="density", alpha=I(1/2))

plot of chunk unnamed-chunk-18

Density curve: stack together

qplot(mark, data=census, fill=progress, geom="density", position="stack")

plot of chunk unnamed-chunk-19

Density curve: fill y axis

qplot(mark, data=census, fill=progress, geom="density", position="fill")

plot of chunk unnamed-chunk-20

Scatterplot: two continuous variables

qplot(mark9, mark, data=census)

plot of chunk unnamed-chunk-21

Scatterplot: change shape of points

qplot(mark9, mark, data=census, shape=I(1))

plot of chunk unnamed-chunk-22

Scatterplot: colorize points by 'program'

qplot(mark9, mark, data=census, shape=I(1), colour=program)

plot of chunk unnamed-chunk-23

Scatterplot: define size of points by 'absence'

qplot(mark9, mark, data=census, shape=I(1), colour=program, size=absence)

plot of chunk unnamed-chunk-24

Scatterplot: add a linear regression line

qplot(mark9, mark, data=census, geom=c("point", "smooth"), method="lm")

plot of chunk unnamed-chunk-25

Scatterplot: split by 'program'

qplot(mark9, mark, data=census, geom=c("point", "smooth"), method="lm", 
      facets= . ~ program)

plot of chunk unnamed-chunk-26

To Harness the Computational Power of R

# Filter data: filter students who got 0
census.sub <- subset(census, mark > 0 & mark9 > 0)
qplot(mark9, mark, data=census.sub, geom=c("point", "smooth"), method="lm")

plot of chunk unnamed-chunk-27

Summarize data, and heatmap

library(plyr) # load library plyr
hm.df <- ddply(census, .(race, program), summarize, absence=mean(absence))
ggplot(hm.df, aes(race, program, fill=absence)) + geom_tile() + 
  scale_fill_gradient2(high="red", low="white") # plot a heatmap

plot of chunk unnamed-chunk-28

Maps

library(maps) # load maps library
crime.map <- read.csv("./data/crime.map.csv") # read data
ggplot(crime.map, aes(x=long, y=lat, group=group, fill=Murder)) +
  geom_polygon(colour="black")

plot of chunk unnamed-chunk-29

Export visualizations

ggsave(file="plot.pdf")
ggsave(file="plot.jpeg", dpi=72)
ggsave(file="plot.svg", plot=htmap, width=10, height=5)

Shiny: interactive web applications

  • Turn analysis into web applications that can be accessed by everyone

knitr: Reproducible Report Writing

  • Reproducible analysis: update data and recomplile reports
  • Dynamic results embedding
  • Various output formats: PDF, Word, HTML
  • Flexible formatting: Report, Article, Book, Slides, Webpage


Additional resources

Thanks!

Questions?