Analytics Edge: Unit 7 - An Introduction to Visualization

Visualizing the World

Why Visualization?

“The picture-examining eye is the best finder we have of the wholly unanticipated” - John Tukey
Visualizing data allows us to discern relationships, structures, distributions, outlines, patterns, behaviors, dependencies, and outcomes
Useful for initial data exploration, for interpreting your model, and for communicating your results

Initial Exploration Shows a Relationship

Explore Further: Color by Factor

Plot the Regression Line

Add Geographical Data to a Map

Show Relationships in a Heatmap

Make Histograms - Explore Categories

Color a Map According to Data

The Power of Visualizations

We will see how visualizations can be used to
- Better understand data
- Communication information to the public
- Show the results of analytical models

The World Health Organization

“WHO is the authority for health within the United Nations system. It is responsible for providing leadership on global health matters, shaping the health research agenda, setting norms and standards, articulating evidence-based policy options, providing technical support to countries and monitoring and assessing health trends.”

The World Health Report

WHO communicates information about global health in order to inform citizens, donors, policymakers, and organizations across the world
Their primary publication is “World Health Report”
Each issue focuses on a specific aspect of global health, and includes statistics and experts’ assessments

Online Data Repository

WHO also maintains an open, online repository of global health data
WHO provides some data visualizations, which helps them communicate more effectively with the public

What is a Data Visualization?

A mapping of data properties to visual properties
Data properties are usually numerical or categorical
Visual properties can be (x,y) coordinates, colors, sizes, shapes and heights

Anscombe’s Quartet

Mean of X : 9.0
Variance of X : 11.0
Mean of Y : 7.50
Variance of Y : 4.12
Correlation between X and Y : 0.816
Regression Equation : Y = 3.00 + 0.500X

ggplot

“ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a power model of graphics that make it easy to produce complex multi-layered graphics.” - Hadley Wickham, creator, www.ggplot2.org

Graphics in Base R vs ggplot

In base R, each mapping of data properties to visual properties is its own special case
- Graphics composed of simple elements like points, lines
- Difficult to add elements to existing plots
In ggplot, the mapping of data properties to visual properties is done by adding layers to the plot

Grammar of Graphics

ggplot graphics consist of at least 3 elements:
- **Data*, in a data frame
- Aesthetic mapping describing how variables in the data frame are mapped to graphical attributes
  - Color, shape, scale, x-y axes, subsets
- Geometric objects determine how values are rendered graphically
  - Points, lines, boxplots, bars, polygons

The Analytics Edge

WHO’s online data repository of global health information is used by citizens, policymakers, and organizations across the world.
Visualizing the data facilitates the understanding and communication of global health trends at a glance

*ggplot in R lets you visualize for exploration, modeling, and sharing results

WHO Visualizations in R

Basic Scatterplot

# Read in data
WHO = read.csv("WHO.csv")
# Output structure
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Plot 
plot(WHO$GNI, WHO$FertilityRate)
# Let's redo this using ggplot 
# Install and load the ggplot2 library:
library(ggplot2)

# Create the ggplot object with the data and the aesthetic mapping:
scatterplot = ggplot(WHO, aes(x = GNI, y = FertilityRate))
# Add the geom_point geometry
scatterplot + geom_point()

# Make a line graph instead:
scatterplot + geom_line()

# Switch back to our points:
scatterplot + geom_point()

# Redo the plot with blue triangles instead of circles:
scatterplot + geom_point(color = "blue", size = 3, shape = 17)

# Another option:
scatterplot + geom_point(color = "darkred", size = 3, shape = 8)

# Add a title to the plot:
scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")

# Save our plot:
fertilityGNIplot = scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
pdf("MyPlot.pdf")
print(fertilityGNIplot)
dev.off()
## png 
##   2

MORE ADVANCED SCATTERPLOTS

# Color the points by region: 
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()

# Color the points according to life expectancy:
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()


# Is the fertility rate of a country was a good predictor of the percentage of the population under 15?
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()

# Let's try a log transformation:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()

# Simple linear regression model to predict the percentage of the population under 15, using the log of the fertility rate:
mod = lm(Under15 ~ log(FertilityRate), data = WHO)
summary(mod)
## 
## Call:
## lm(formula = Under15 ~ log(FertilityRate), data = WHO)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3131  -1.7742   0.0446   1.7440   7.7174 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.6540     0.4478   17.09   <2e-16 ***
## log(FertilityRate)  22.0547     0.4175   52.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 181 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9391, Adjusted R-squared:  0.9387 
## F-statistic:  2790 on 1 and 181 DF,  p-value: < 2.2e-16
# Add this regression line to our plot:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm")

# 99% confidence interval
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)

# No confidence interval in the plot
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = FALSE)

# Change the color of the regression line:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", colour = "orange")