There are many ways to visualize the same data.
You have just seen how to make quite attractive visualization with ggplot2, which has good default settings, but judgement is still required, e.g. do I vary the size, or do I vary the color
Excel, etc. can also be used to make perfectly acceptable visualizations - or terrible ones.
Clearly and accurately convey the key messages in the data
Obfuscate the data (either through ignorance, or malice!)
Visualizations can be used by an analyst for their own consumption, to gain insights
Visualizations can also be used to provide information to a decision maker, and/or to convince someone
Bad visualizations hide patterns that could give insight, or mislead decision makers
Not all points can be labeled, so data is lost
Colors are meaningless are close enough to be confusing, but are still needed to make it readable
3D adds nothing, visible volume is larger than true share
All data is visible
Don’t lose small regions
Can easily compare relative sizes
Something to consider is that, for some people and applications, being not as “visually exciting” is a negative
Possible with this data, but still a bit tedious to create we need to determine which countries lie in which region
Shading all countries in a region the same color is misleading - countries in, e.g. Latin America, will send students at different rates
We have access to per country data - we will plot it on a world map and see if it is effective.
“Caucasian” bar is truncated
Every bar has its own scale - compare “Native American” to “African American”
Only thing useful is the numbers
Minor: mixed precision, unclear rounding applied
Different units suggest(non-existent) crossover in 1995
Transformation shows true moments of change
If we are interested in shares within a year, its good
If we want to see rates of change, it is pretty much unusable!
# Load ggplot library
library(ggplot2)
# Load our data, which lives in intl.csv
intl = read.csv("intl.csv")
str(intl)
## 'data.frame': 8 obs. of 2 variables:
## $ Region : Factor w/ 8 levels "Africa","Asia",..: 2 3 6 4 5 1 7 8
## $ PercentOfIntl: num 0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# We want to make a bar plot with region on the X axis
# and Percentage on the y-axis.
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity") +
geom_text(aes(label=PercentOfIntl))# Make Region an ordered factor
# We can do this with the re-order command and transform command.
intl = transform(intl, Region = reorder(Region, -PercentOfIntl))
# Look at the structure
str(intl)
## 'data.frame': 8 obs. of 2 variables:
## $ Region : Factor w/ 8 levels "Asia","Europe",..: 1 2 3 4 5 6 7 8
## ..- attr(*, "scores")= num [1:8(1d)] -0.02 -0.531 -0.201 -0.09 -0.054 -0.098 -0.015 -0.002
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr "Africa" "Asia" "Europe" "Latin Am. & Caribbean" ...
## $ PercentOfIntl: num 0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# Make the percentages out of 100 instead of fractions
intl$PercentOfIntl = intl$PercentOfIntl * 100
# Make the plot
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity", fill="dark blue") +
geom_text(aes(label=PercentOfIntl), vjust=-0.4) +
ylab("Percent of International Students") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))# Load the ggmap package
library(ggmap)
# Load in the international student data
intlall = read.csv("intlall.csv",stringsAsFactors=FALSE)
# Lets look at the first few rows
head(intlall)
## Citizenship UG G SpecialUG SpecialG ExhangeVisiting Total
## 1 Albania 3 1 0 0 0 4
## 2 Antigua and Barbuda NA NA NA 1 NA 1
## 3 Argentina NA 19 NA NA NA 19
## 4 Armenia 3 2 NA NA NA 5
## 5 Australia 6 32 NA NA 1 39
## 6 Austria NA 11 NA NA 5 16
# Those NAs are really 0s, and we can replace them easily
intlall[is.na(intlall)] = 0
# Now lets look again
head(intlall)
## Citizenship UG G SpecialUG SpecialG ExhangeVisiting Total
## 1 Albania 3 1 0 0 0 4
## 2 Antigua and Barbuda 0 0 0 1 0 1
## 3 Argentina 0 19 0 0 0 19
## 4 Armenia 3 2 0 0 0 5
## 5 Australia 6 32 0 0 1 39
## 6 Austria 0 11 0 0 5 16
# Load the world map
world_map = map_data("world")
str(world_map)
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
# Lets merge intlall into world_map using the merge command
world_map = merge(world_map, intlall, by.x ="region", by.y = "Citizenship")
str(world_map)
## 'data.frame': 63634 obs. of 12 variables:
## $ region : chr "Albania" "Albania" "Albania" "Albania" ...
## $ long : num 20.5 20.4 19.5 20.5 20.4 ...
## $ lat : num 41.3 39.8 42.5 40.1 41.5 ...
## $ group : num 6 6 6 6 6 6 6 6 6 6 ...
## $ order : int 789 822 870 815 786 821 818 779 879 795 ...
## $ subregion : chr NA NA NA NA ...
## $ UG : num 3 3 3 3 3 3 3 3 3 3 ...
## $ G : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SpecialUG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ExhangeVisiting: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total : int 4 4 4 4 4 4 4 4 4 4 ...
# Plot the map
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(fill="white", color="black") +
coord_map("mercator")# Reorder the data
world_map = world_map[order(world_map$group, world_map$order),]
# Redo the plot
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(fill="white", color="black") +
coord_map("mercator")# Lets look for China
table(intlall$Citizenship)
##
## Albania Antigua and Barbuda Argentina Armenia Australia Austria
## 1 1 1 1 1 1
## Bahrain Bangladesh Belarus Belgium Bolivia Bosnia-Hercegovina
## 1 1 1 1 1 1
## Brazil Bulgaria Cambodia Cameroon Canada Chile
## 1 1 1 1 1 1
## China (People's Republic Of) Colombia Costa Rica Cote d'Ivoire Croatia Cyprus
## 1 1 1 1 1 1
## Czech Republic Denmark Ecuador Egypt El Salvador Estonia
## 1 1 1 1 1 1
## Ethiopia Finland France Georgia Germany Ghana
## 1 1 1 1 1 1
## Greece Guatemala Haiti Hong Kong Hungary Iceland
## 1 1 1 1 1 1
## India Indonesia Iran Iraq Ireland Israel
## 1 1 1 1 1 1
## Italy Jamaica Japan Jordan Kazakhstan Kenya
## 1 1 1 1 1 1
## Korea, South Kuwait Latvia Lebanon Lithuania Macedonia
## 1 1 1 1 1 1
## Malaysia Mauritius Mexico Moldova Mongolia Montenegro
## 1 1 1 1 1 1
## Morocco Nepal Netherlands New Zealand Nigeria Norway
## 1 1 1 1 1 1
## Pakistan Paraguay Peru Philippines Poland Portugal
## 1 1 1 1 1 1
## Qatar Romania Russia Rwanda Saudi Arabia Serbia
## 1 1 1 1 1 1
## Sierra Leone Singapore Slovakia Somalia South Africa Spain
## 1 1 1 1 1 1
## Sri Lanka St. Lucia St. Vincent & The Grenadines Sudan Sweden Switzerland
## 1 1 1 1 1 1
## Syria Taiwan Tanzania Thailand Trinidad & Tobago Tunisia
## 1 1 1 1 1 1
## Turkey Uganda Ukraine United Arab Emirates United Kingdom Unknown
## 1 1 1 1 1 1
## Uruguay Venezuela Vietnam West Bank Zambia Zimbabwe
## 1 1 1 1 1 1
# Lets "fix" that in the intlall dataset
intlall$Citizenship[intlall$Citizenship=="China (People's Republic Of)"] = "China"
# We'll repeat our merge and order from before
world_map = merge(map_data("world"), intlall,
by.x ="region",
by.y = "Citizenship")
world_map = world_map[order(world_map$group, world_map$order),]
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("mercator")# We can try other projections - this one is visually interesting
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("ortho", orientation=c(20, 30, 0))ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("ortho", orientation=c(-37, 175, 0))# First, lets make sure we have ggplot2 loaded
library(ggplot2)
# Now lets load our dataframe
households = read.csv("households.csv")
str(households)
## 'data.frame': 8 obs. of 7 variables:
## $ Year : int 1970 1980 1990 1995 2000 2005 2010 2012
## $ MarriedWChild : num 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6
## $ MarriedWOChild: num 30.3 29.9 29.8 28.9 28.7 28.3 28.8 29.1
## $ OtherFamily : num 10.6 12.9 14.8 15.6 16 16.7 17.4 17.8
## $ MenAlone : num 5.6 8.6 9.7 10.2 10.7 11.3 11.9 12.3
## $ WomenAlone : num 11.5 14 14.9 14.7 14.8 15.3 14.8 15.2
## $ OtherNonfamily: num 1.7 3.6 4.6 5 5.7 5.6 6.2 6.1
# Load reshape2
library(reshape2)
# Lets look at the first two columns of our households dataframe
households[,1:2]
## Year MarriedWChild
## 1 1970 40.3
## 2 1980 30.9
## 3 1990 26.3
## 4 1995 25.5
## 5 2000 24.1
## 6 2005 22.9
## 7 2010 20.9
## 8 2012 19.6
# First few rows of our melted households dataframe
head(melt(households, id="Year"))
## Year variable value
## 1 1970 MarriedWChild 40.3
## 2 1980 MarriedWChild 30.9
## 3 1990 MarriedWChild 26.3
## 4 1995 MarriedWChild 25.5
## 5 2000 MarriedWChild 24.1
## 6 2005 MarriedWChild 22.9
households[,1:3]
## Year MarriedWChild MarriedWOChild
## 1 1970 40.3 30.3
## 2 1980 30.9 29.9
## 3 1990 26.3 29.8
## 4 1995 25.5 28.9
## 5 2000 24.1 28.7
## 6 2005 22.9 28.3
## 7 2010 20.9 28.8
## 8 2012 19.6 29.1
melt(households, id="Year")[1:10,3]
## [1] 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6 30.3 29.9
melt(households, id="Year")[1:10,]
## Year variable value
## 1 1970 MarriedWChild 40.3
## 2 1980 MarriedWChild 30.9
## 3 1990 MarriedWChild 26.3
## 4 1995 MarriedWChild 25.5
## 5 2000 MarriedWChild 24.1
## 6 2005 MarriedWChild 22.9
## 7 2010 MarriedWChild 20.9
## 8 2012 MarriedWChild 19.6
## 9 1970 MarriedWOChild 30.3
## 10 1980 MarriedWOChild 29.9
# Plot it
ggplot(melt(households, id="Year"),
aes(x=Year, y=value, color=variable)) +
geom_line(size=2) + geom_point(size=5) +
ylab("Percentage of Households")