Visual display of information

Eitan Tzelgov

Where we are and where we’re going

Why are we discussing this?

``Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space’’ ― Edward R. Tufte, The Visual Display of Quantitative Information

Should we use tables or graphs?

Table vs. Figure

Let’s look at the movies data

Table

Movie rating by Genre
Minimum First Quartile Median Third Quartile Maximum
Action 1.0 4.20 5.4 6.4 9.8
Animation 1.0 6.00 6.7 7.3 9.8
Comedy 1.0 5.00 6.0 6.9 10.0
Drama 1.0 5.85 6.9 7.8 9.9
Documentary 1.0 5.40 6.3 7.1 10.0
Romance 1.5 5.20 6.2 7.1 9.6
Short 1.0 5.30 6.4 7.5 10.0

How would you improve that table?

Model table features

Picture - prep

movies_clean<-movies%>%
  mutate(genre = case_when(Action==1 ~ "Action",
                           Animation==1 ~ "Animation",
                           Comedy==1 ~ "Comedy",
                           Drama==1 ~ "Drama",
                           Documentary==1~ "Documentary",
                           Romance==1 ~ "Romance",
                           Short ==1 ~ "Short"))%>%
  drop_na(genre)


movies_plot<-ggplot(data = movies_clean) +
  geom_pointrange(mapping = aes(x = genre, y = rating),
                  stat = "summary",
                  fun.min = function(z) { quantile(z,0.25) },
                  fun.max = function(z) { quantile(z,0.75) },
                  fun = median)+
  coord_flip()+
  theme_minimal()+
  ggtitle("Movie rating by Genre", subtitle = "Lines show interquantile range")

picture

How would you improve that picture?

So…

Let’s talk about figures

Just One variable?

One categorical variable

Movies example, again:

## # A tibble: 7 x 2
##   genre           n
##   <chr>       <int>
## 1 Action       4688
## 2 Animation    3606
## 3 Comedy      14269
## 4 Documentary  3183
## 5 Drama       16952
## 6 Romance       580
## 7 Short        2724

And plug into table:

Genre Frequency
Action 4688
Animation 3606
Comedy 14269
Documentary 3183
Drama 16952
Romance 580
Short 2724

A dot plot?

The logic of plotting with ggplot

Lots of helpful material online, see e.g this ggplot intro

What you print

movie_freq<-ggplot(data=movies_clean, aes(x=genre))+
  geom_point(stat="count")+
  theme_minimal()+##control how it looks, many themes
  scale_x_discrete("Genre")+
  scale_y_continuous("Frequency")+##label, many other ways to do that
  ggtitle("Movie Frequency by Genre",
          subtitle = "Source: IMDB")+
  coord_flip()##optional

movie_freq # call it

One continuous variable? A histogram?

What you print

movie_hist<-ggplot(movies,aes(x=rating)) + 
  geom_histogram(binwidth=.1)+ ## control the number of movies in each bin. 
  scale_x_continuous("Rating") +
  theme_minimal()+
  ggtitle("Distribution of movies' ratings",
          subtitle = "Source: IMDB")

movie_hist # call it

The histogram

Approximates the distribution of our variable by binning observations into specific values. See below

Two variables

Continuous

What you print

fish_plot<-ggplot(dat,aes(x=GDP90LGN, y = POLITY))+ 
  geom_point() +
  scale_x_continuous("log GDP in 1990 dollars") +
  scale_y_continuous("Polity score")+
  theme_minimal()+
  ggtitle("GDP and Level of Democracy",
          subtitle = "Source: Fish 2012")

fish_plot# call it

One categorical and one continuous variable? A Boxplot?

The boxplot

Captures quartiles and outliers A boxplot: sourcehttps://www.earthdatascience.org/courses/earth-analytics/document-your-science/add-images-to-rmarkdown-report/

What you print

opec<- ggplot(data=dat_opec,aes(x=OPEC_member, y = POLITY))+ 
  geom_boxplot()+
  scale_x_discrete("OPEC membership") +
  scale_y_continuous("Polity score") +
  theme_minimal()+
  ggtitle("OPEC membership and Level of Democracy",
          subtitle = "Source: Fish 2012")

opec# call it

Two categorical variables

Maybe a table:

table(dat_mus$OPEC_member, dat_mus$Muslim)##produces contingency tables,such as 2x2
##               
##                Muslim Majority No Muslim majority
##   Member                    10                  1
##   Not a Member              38                108

More than two variables

Mapping into an aesthetic

What you print

gdp_dem_mus <- ggplot(dat_mus,aes(x=GDP90LGN, y = POLITY,
                                  color = Muslim)) +
  geom_point() +
  scale_x_continuous("log GDP in 1990 dollars") +
  scale_y_continuous("Polity score") +
  theme_classic()+
  ggtitle("GDP and Level of Democracy by Muslim Majority Status",
          subtitle = "Source: Fish 2012")+
  theme_minimal()

Small multiples

What you print

mult<-ggplot(movies_clean,
       aes(x=length, y = rating)) + geom_point() +
  scale_x_continuous("Length") + 
  scale_y_continuous("Rating") +
  facet_wrap(~genre) + 
  theme_minimal()+
  ggtitle("Movies Length and rating by Genre ",
          subtitle = "Source: IMDB")

mult

Principles of Design

Tell the story well