Eitan Tzelgov
You might be slightly confused at this point, but…
Today
``Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space’’ ― Edward R. Tufte, The Visual Display of Quantitative Information
Summarize information in a clear and concise way
show the distribution of variables
show relationships between variables
show trends over time
show differences between group
and much more…
Data summary is an important part of telling a story in a research paper
| Minimum | First Quartile | Median | Third Quartile | Maximum | |
|---|---|---|---|---|---|
| Action | 1.0 | 4.20 | 5.4 | 6.4 | 9.8 |
| Animation | 1.0 | 6.00 | 6.7 | 7.3 | 9.8 |
| Comedy | 1.0 | 5.00 | 6.0 | 6.9 | 10.0 |
| Drama | 1.0 | 5.85 | 6.9 | 7.8 | 9.9 |
| Documentary | 1.0 | 5.40 | 6.3 | 7.1 | 10.0 |
| Romance | 1.5 | 5.20 | 6.2 | 7.1 | 9.6 |
| Short | 1.0 | 5.30 | 6.4 | 7.5 | 10.0 |
How would you improve that table?
Numbered caption describing table
Only three horizontal rules
Formatting for column headers
No excessive decimal places
movies_clean<-movies%>%
mutate(genre = case_when(Action==1 ~ "Action",
Animation==1 ~ "Animation",
Comedy==1 ~ "Comedy",
Drama==1 ~ "Drama",
Documentary==1~ "Documentary",
Romance==1 ~ "Romance",
Short ==1 ~ "Short"))%>%
drop_na(genre)
movies_plot<-ggplot(data = movies_clean) +
geom_pointrange(mapping = aes(x = genre, y = rating),
stat = "summary",
fun.min = function(z) { quantile(z,0.25) },
fun.max = function(z) { quantile(z,0.75) },
fun = median)+
coord_flip()+
theme_minimal()+
ggtitle("Movie rating by Genre", subtitle = "Lines show interquantile range")How would you improve that picture?
Which one is better?
A table, or a graph?
I would argue that graphs in general are a better means of communicating your results, summarizing information, and making an argument.
They require a bit more work, initially.
We must remember our discussion variable types
Nominal, ordinal (also known as categorical)
Interval, ratio (also known as continuous)
A table, showing frequencies
Or, a dot plot
## # A tibble: 7 x 2
## genre n
## <chr> <int>
## 1 Action 4688
## 2 Animation 3606
## 3 Comedy 14269
## 4 Documentary 3183
## 5 Drama 16952
## 6 Romance 580
## 7 Short 2724
| Genre | Frequency | ||
|---|---|---|---|
| Action | 4688 | ||
| Animation | 3606 | ||
| Comedy | 14269 | ||
| Documentary | 3183 | ||
| Drama | 16952 | ||
| Romance | 580 | ||
| Short | 2724 |
the data being plotted
a set of mappings from variables in the data to the aesthetics (appearance) of the geometric objects
the geometric objects (circles, lines, etc.) that appear on the plot
a coordinate system used to organize the geometric objects
the facets or groups of data shown in different plots
possibly, a theme
movie_freq<-ggplot(data=movies_clean, aes(x=genre))+
geom_point(stat="count")+
theme_minimal()+##control how it looks, many themes
scale_x_discrete("Genre")+
scale_y_continuous("Frequency")+##label, many other ways to do that
ggtitle("Movie Frequency by Genre",
subtitle = "Source: IMDB")+
coord_flip()##optional
movie_freq # call itApproximates the distribution of our variable by binning observations into specific values. See below
Captures quartiles and outliers
##
## Muslim Majority No Muslim majority
## Member 10 1
## Not a Member 38 108
Two options:
Map one variable into an aesthetic
Use small multiples (more than one picture)
gdp_dem_mus <- ggplot(dat_mus,aes(x=GDP90LGN, y = POLITY,
color = Muslim)) +
geom_point() +
scale_x_continuous("log GDP in 1990 dollars") +
scale_y_continuous("Polity score") +
theme_classic()+
ggtitle("GDP and Level of Democracy by Muslim Majority Status",
subtitle = "Source: Fish 2012")+
theme_minimal()Give your axes proper labels
Put the independent variable on the horizontal axis
Think about what the proper type of plot is
Don’t use pie charts!
Don’t use 3D!