The Good, the Bad, and the Ugly

Great Power, Great Responsibility

  • There are many ways to visualize the same data.

  • You have just seen how to make quite attractive visualization with ggplot2, which has good default settings, but judgement is still required, e.g. do I vary the size, or do I vary the color

  • Excel, etc. can also be used to make perfectly acceptable visualizations - or terrible ones.

Whats the difference?

  • \(\color{green}{\text{Good visualizations...}}\)

Clearly and accurately convey the key messages in the data

  • \(\color{red}{\text{Bad visualizations...}}\)

Obfuscate the data (either through ignorance, or malice!)

What does this mean?

  • Visualizations can be used by an analyst for their own consumption, to gain insights

  • Visualizations can also be used to provide information to a decision maker, and/or to convince someone

  • Bad visualizations hide patterns that could give insight, or mislead decision makers

Bad Visualizations

Bad Visualization?

  • Not all points can be labeled, so data is lost

  • Colors are meaningless are close enough to be confusing, but are still needed to make it readable

  • 3D adds nothing, visible volume is larger than true share

Better Visualization

  • All data is visible

  • Don’t lose small regions

  • Can easily compare relative sizes

  • Something to consider is that, for some people and applications, being not as “visually exciting” is a negative

On a World Map?

  • Possible with this data, but still a bit tedious to create we need to determine which countries lie in which region

  • Shading all countries in a region the same color is misleading - countries in, e.g. Latin America, will send students at different rates

  • We have access to per country data - we will plot it on a world map and see if it is effective.

Bad Scales

  • “Caucasian” bar is truncated

  • Every bar has its own scale - compare “Native American” to “African American”

  • Only thing useful is the numbers

  • Minor: mixed precision, unclear rounding applied

Two Rights Make A Wrong

  • Different units suggest(non-existent) crossover in 1995

  • Transformation shows true moments of change

Family Matters

  • If we are interested in shares within a year, its good

  • If we want to see rates of change, it is pretty much unusable!

The Good, the Bad, and the Ugly in R

Bar Charts

# Load ggplot library
library(ggplot2)
# Load our data, which lives in intl.csv
intl = read.csv("intl.csv")
str(intl)
## 'data.frame':    8 obs. of  2 variables:
##  $ Region       : Factor w/ 8 levels "Africa","Asia",..: 2 3 6 4 5 1 7 8
##  $ PercentOfIntl: num  0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# We want to make a bar plot with region on the X axis
# and Percentage on the y-axis.
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=PercentOfIntl))

# Make Region an ordered factor
# We can do this with the re-order command and transform command. 
intl = transform(intl, Region = reorder(Region, -PercentOfIntl))
# Look at the structure
str(intl)
## 'data.frame':    8 obs. of  2 variables:
##  $ Region       : Factor w/ 8 levels "Asia","Europe",..: 1 2 3 4 5 6 7 8
##   ..- attr(*, "scores")= num [1:8(1d)] -0.02 -0.531 -0.201 -0.09 -0.054 -0.098 -0.015 -0.002
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Africa" "Asia" "Europe" "Latin Am. & Caribbean" ...
##  $ PercentOfIntl: num  0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# Make the percentages out of 100 instead of fractions
intl$PercentOfIntl = intl$PercentOfIntl * 100
# Make the plot
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity", fill="dark blue") +
geom_text(aes(label=PercentOfIntl), vjust=-0.4) +
ylab("Percent of International Students") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

World map

# Load the ggmap package
library(ggmap)
# Load in the international student data
intlall = read.csv("intlall.csv",stringsAsFactors=FALSE)
# Lets look at the first few rows
head(intlall)
##           Citizenship UG  G SpecialUG SpecialG ExhangeVisiting Total
## 1             Albania  3  1         0        0               0     4
## 2 Antigua and Barbuda NA NA        NA        1              NA     1
## 3           Argentina NA 19        NA       NA              NA    19
## 4             Armenia  3  2        NA       NA              NA     5
## 5           Australia  6 32        NA       NA               1    39
## 6             Austria NA 11        NA       NA               5    16
# Those NAs are really 0s, and we can replace them easily
intlall[is.na(intlall)] = 0
# Now lets look again
head(intlall) 
##           Citizenship UG  G SpecialUG SpecialG ExhangeVisiting Total
## 1             Albania  3  1         0        0               0     4
## 2 Antigua and Barbuda  0  0         0        1               0     1
## 3           Argentina  0 19         0        0               0    19
## 4             Armenia  3  2         0        0               0     5
## 5           Australia  6 32         0        0               1    39
## 6             Austria  0 11         0        0               5    16
# Load the world map
world_map = map_data("world")
str(world_map)
## 'data.frame':    99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...
# Lets merge intlall into world_map using the merge command
world_map = merge(world_map, intlall, by.x ="region", by.y = "Citizenship")
str(world_map)
## 'data.frame':    63634 obs. of  12 variables:
##  $ region         : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ long           : num  20.5 20.4 19.5 20.5 20.4 ...
##  $ lat            : num  41.3 39.8 42.5 40.1 41.5 ...
##  $ group          : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ order          : int  789 822 870 815 786 821 818 779 879 795 ...
##  $ subregion      : chr  NA NA NA NA ...
##  $ UG             : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ G              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SpecialUG      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ExhangeVisiting: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total          : int  4 4 4 4 4 4 4 4 4 4 ...
# Plot the map
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black") +
  coord_map("mercator")

# Reorder the data
world_map = world_map[order(world_map$group, world_map$order),]
# Redo the plot
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black") +
  coord_map("mercator")

# Lets look for China
table(intlall$Citizenship) 
## 
##                      Albania          Antigua and Barbuda                    Argentina                      Armenia                    Australia                      Austria 
##                            1                            1                            1                            1                            1                            1 
##                      Bahrain                   Bangladesh                      Belarus                      Belgium                      Bolivia           Bosnia-Hercegovina 
##                            1                            1                            1                            1                            1                            1 
##                       Brazil                     Bulgaria                     Cambodia                     Cameroon                       Canada                        Chile 
##                            1                            1                            1                            1                            1                            1 
## China (People's Republic Of)                     Colombia                   Costa Rica                Cote d'Ivoire                      Croatia                       Cyprus 
##                            1                            1                            1                            1                            1                            1 
##               Czech Republic                      Denmark                      Ecuador                        Egypt                  El Salvador                      Estonia 
##                            1                            1                            1                            1                            1                            1 
##                     Ethiopia                      Finland                       France                      Georgia                      Germany                        Ghana 
##                            1                            1                            1                            1                            1                            1 
##                       Greece                    Guatemala                        Haiti                    Hong Kong                      Hungary                      Iceland 
##                            1                            1                            1                            1                            1                            1 
##                        India                    Indonesia                         Iran                         Iraq                      Ireland                       Israel 
##                            1                            1                            1                            1                            1                            1 
##                        Italy                      Jamaica                        Japan                       Jordan                   Kazakhstan                        Kenya 
##                            1                            1                            1                            1                            1                            1 
##                 Korea, South                       Kuwait                       Latvia                      Lebanon                    Lithuania                    Macedonia 
##                            1                            1                            1                            1                            1                            1 
##                     Malaysia                    Mauritius                       Mexico                      Moldova                     Mongolia                   Montenegro 
##                            1                            1                            1                            1                            1                            1 
##                      Morocco                        Nepal                  Netherlands                  New Zealand                      Nigeria                       Norway 
##                            1                            1                            1                            1                            1                            1 
##                     Pakistan                     Paraguay                         Peru                  Philippines                       Poland                     Portugal 
##                            1                            1                            1                            1                            1                            1 
##                        Qatar                      Romania                       Russia                       Rwanda                 Saudi Arabia                       Serbia 
##                            1                            1                            1                            1                            1                            1 
##                 Sierra Leone                    Singapore                     Slovakia                      Somalia                 South Africa                        Spain 
##                            1                            1                            1                            1                            1                            1 
##                    Sri Lanka                    St. Lucia St. Vincent & The Grenadines                        Sudan                       Sweden                  Switzerland 
##                            1                            1                            1                            1                            1                            1 
##                        Syria                       Taiwan                     Tanzania                     Thailand            Trinidad & Tobago                      Tunisia 
##                            1                            1                            1                            1                            1                            1 
##                       Turkey                       Uganda                      Ukraine         United Arab Emirates               United Kingdom                      Unknown 
##                            1                            1                            1                            1                            1                            1 
##                      Uruguay                    Venezuela                      Vietnam                    West Bank                       Zambia                     Zimbabwe 
##                            1                            1                            1                            1                            1                            1
# Lets "fix" that in the intlall dataset
intlall$Citizenship[intlall$Citizenship=="China (People's Republic Of)"] = "China"
# We'll repeat our merge and order from before
world_map = merge(map_data("world"), intlall, 
                  by.x ="region",
                  by.y = "Citizenship")
world_map = world_map[order(world_map$group, world_map$order),]
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("mercator")

# We can try other projections - this one is visually interesting
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("ortho", orientation=c(20, 30, 0))

ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("ortho", orientation=c(-37, 175, 0))

Line Charts

# First, lets make sure we have ggplot2 loaded
library(ggplot2)
# Now lets load our dataframe
households = read.csv("households.csv")
str(households)
## 'data.frame':    8 obs. of  7 variables:
##  $ Year          : int  1970 1980 1990 1995 2000 2005 2010 2012
##  $ MarriedWChild : num  40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6
##  $ MarriedWOChild: num  30.3 29.9 29.8 28.9 28.7 28.3 28.8 29.1
##  $ OtherFamily   : num  10.6 12.9 14.8 15.6 16 16.7 17.4 17.8
##  $ MenAlone      : num  5.6 8.6 9.7 10.2 10.7 11.3 11.9 12.3
##  $ WomenAlone    : num  11.5 14 14.9 14.7 14.8 15.3 14.8 15.2
##  $ OtherNonfamily: num  1.7 3.6 4.6 5 5.7 5.6 6.2 6.1
# Load reshape2
library(reshape2)
# Lets look at the first two columns of our households dataframe
households[,1:2]
##   Year MarriedWChild
## 1 1970          40.3
## 2 1980          30.9
## 3 1990          26.3
## 4 1995          25.5
## 5 2000          24.1
## 6 2005          22.9
## 7 2010          20.9
## 8 2012          19.6
# First few rows of our melted households dataframe
head(melt(households, id="Year"))
##   Year      variable value
## 1 1970 MarriedWChild  40.3
## 2 1980 MarriedWChild  30.9
## 3 1990 MarriedWChild  26.3
## 4 1995 MarriedWChild  25.5
## 5 2000 MarriedWChild  24.1
## 6 2005 MarriedWChild  22.9
households[,1:3]
##   Year MarriedWChild MarriedWOChild
## 1 1970          40.3           30.3
## 2 1980          30.9           29.9
## 3 1990          26.3           29.8
## 4 1995          25.5           28.9
## 5 2000          24.1           28.7
## 6 2005          22.9           28.3
## 7 2010          20.9           28.8
## 8 2012          19.6           29.1
melt(households, id="Year")[1:10,3]
##  [1] 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6 30.3 29.9
melt(households, id="Year")[1:10,]
##    Year       variable value
## 1  1970  MarriedWChild  40.3
## 2  1980  MarriedWChild  30.9
## 3  1990  MarriedWChild  26.3
## 4  1995  MarriedWChild  25.5
## 5  2000  MarriedWChild  24.1
## 6  2005  MarriedWChild  22.9
## 7  2010  MarriedWChild  20.9
## 8  2012  MarriedWChild  19.6
## 9  1970 MarriedWOChild  30.3
## 10 1980 MarriedWOChild  29.9
# Plot it
ggplot(melt(households, id="Year"),       
       aes(x=Year, y=value, color=variable)) +
  geom_line(size=2) + geom_point(size=5) +  
  ylab("Percentage of Households")