“The picture-examining eye is the best finder we have of the wholly unanticipated” - John Tukey
Visualizing data allows us to discern relationships, structures, distributions, outlines, patterns, behaviors, dependencies, and outcomes
Useful for initial data exploration, for interpreting your model, and for communicating your results
“WHO is the authority for health within the United Nations system. It is responsible for providing leadership on global health matters, shaping the health research agenda, setting norms and standards, articulating evidence-based policy options, providing technical support to countries and monitoring and assessing health trends.”
WHO communicates information about global health in order to inform citizens, donors, policymakers, and organizations across the world
Their primary publication is “World Health Report”
Each issue focuses on a specific aspect of global health, and includes statistics and experts’ assessments
WHO also maintains an open, online repository of global health data
WHO provides some data visualizations, which helps them communicate more effectively with the public
A mapping of data properties to visual properties
Data properties are usually numerical or categorical
Visual properties can be (x,y) coordinates, colors, sizes, shapes and heights
ggplot graphics consist of at least 3 elements:
WHO’s online data repository of global health information is used by citizens, policymakers, and organizations across the world.
Visualizing the data facilitates the understanding and communication of global health trends at a glance
*ggplot in R lets you visualize for exploration, modeling, and sharing results
# Read in data
WHO = read.csv("WHO.csv")
# Output structure
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Plot
plot(WHO$GNI, WHO$FertilityRate)
# Let's redo this using ggplot
# Install and load the ggplot2 library:
library(ggplot2)# Create the ggplot object with the data and the aesthetic mapping:
scatterplot = ggplot(WHO, aes(x = GNI, y = FertilityRate))
# Add the geom_point geometry
scatterplot + geom_point()# Make a line graph instead:
scatterplot + geom_line()# Switch back to our points:
scatterplot + geom_point()# Redo the plot with blue triangles instead of circles:
scatterplot + geom_point(color = "blue", size = 3, shape = 17) # Another option:
scatterplot + geom_point(color = "darkred", size = 3, shape = 8) # Add a title to the plot:
scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")# Save our plot:
fertilityGNIplot = scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
pdf("MyPlot.pdf")
print(fertilityGNIplot)
dev.off()
## png
## 2# Color the points by region:
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()# Color the points according to life expectancy:
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()
# Is the fertility rate of a country was a good predictor of the percentage of the population under 15?
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()# Let's try a log transformation:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()# Simple linear regression model to predict the percentage of the population under 15, using the log of the fertility rate:
mod = lm(Under15 ~ log(FertilityRate), data = WHO)
summary(mod)
##
## Call:
## lm(formula = Under15 ~ log(FertilityRate), data = WHO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3131 -1.7742 0.0446 1.7440 7.7174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6540 0.4478 17.09 <2e-16 ***
## log(FertilityRate) 22.0547 0.4175 52.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 181 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.9391, Adjusted R-squared: 0.9387
## F-statistic: 2790 on 1 and 181 DF, p-value: < 2.2e-16
# Add this regression line to our plot:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm")# 99% confidence interval
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)# No confidence interval in the plot
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = FALSE)# Change the color of the regression line:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", colour = "orange")