Introduction to Data Visualization

“A picture is worth a thousand words” Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.

Purpose of Data Visualization

  • The principal purpose of a graph is to answer questions about data.
  • The purpose of all statistical procedures is to make data understandable.
  • Whether you are reading a data report at work or conducting a cancer research project, it is a good idea to be able to interpret large amounts of numerical data.
  • The ultimate purpose of a graph is to make a “bunch of numbers” (the data) understandable.

What is ggplot2?

  • ggplot2 is a data visualization package for the statistical programming language R.
  • Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.

Types of Visualization

In statistics, we generally have two kinds of visualization:

  • Exploratory data visualization: Exploring the data visually to find patterns among the data entities.
  • Explanatory data visualization: Showcasing the identified patterns using simple graphs.

Why Visualization?

“A picture is worth a thousand words”

  • Data visualizations make big and small data easier for the human brain to understand, and visualization also makes it easier to detect patterns, trends, and outliers in groups of data.
  • Good data visualizations should place meaning into complicated datasets so that their message is clear and concise.

Grammar of Graphics

  • Data - The dataset being plotted
  • Aesthetics- The scales onto which we plot our data
  • Geometry - The visual elements used for our data
  • Facet -Groups by which we divide the data

Definition of Key Terms and Concepts

Variables

  • Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object. The key point is that variables can vary and can be expressed as a particular numerical value or as falling in a unique category. The following are some examples:

    • Gender as the observed variable could vary in two ways: male or female (discrete).
    • Mental state could be a variable and vary in five ways: very anxious, anxious, neutral, relaxed, and very relaxed (discrete).
    • Number of errors on a written test can be considered as a variable. If there were 10 questions, then such a variable could vary from 0 to 10 errors (discrete).
    • Make of automobile owned or leased can be considered as a variable (discrete).
    • Weight of an individual can be considered as a variable (continuous).
    • Speed of an airplane can be considered as a variable (continuous).
    • How high one can jump can be considered as a variable (continuous).

Discrete Variable

  • Discrete Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.

  • The term discrete describes how the variable is measured or counted.

  • Discrete variables vary in a manner so that the characteristics being measured fall in unique categories.

  • Such categories must be mutually exclusive, which means that any observation must fall in one and only one category.

  • The categories must also be inclusive, which means that there must be a category for every possible observation.

  • Examples of discrete variables from the list above are gender,mental state, number of errors, and make of automobile.

Continuous Variable

  • Continuous Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.

  • The term continuous describes how the variable is measured. Continuous variables vary by taking on any one of a large number of measures (often infinite).

  • Examples of continuous variables from the list above are weight, speed, and how high one can jump.

Category

  • Category—A natural grouping of the characteristics of a discrete variable.

  • Think of the type of automobile as a variable having categories such as Porsche, Ferrari, and Maserati.

  • The automobile possessing “Porsche” characteristics is counted or placed in a category titled Porsche.

Continuous Distribution

  • Continuous Distribution—Plainly speaking, a continuous distribution is just a “bunch of numbers” that resulted when something was measured at the continuous level.

  • We use various types of statistical analysis, including graphs, to make sense of such distributions.

  • Histogram and line graphs are the most commonly used graphing techniques to describe continuous distributions.

Discrete Distribution

  • Discrete Distribution—Could also consist of a bunch of numbers, but the numbers take on a different meaning. Using the gender variable as an example, we could assign the value label of 1 to the male category and 2 to the female category.

  • This being the case, you now have a column of numbers consisting of 1s and 2s that you are trying to understand.

  • You could use SPSS to produce a bar graph.

Levels of Measurement

  • Levels of Measurement—A classification method used to define how variables are measured. For most statistical work, there are four possible ways to measure variables:
  1. nominal,
  2. ordinal,
  3. interval, and
  4. ratio

Independent Variable

  • Independent Variable—The independent variable is manipulated and has the freedom to take on different values.

  • It is the presumed cause of change in the dependent variable in experimental work.

  • In observational-type studies, it is often referred to as the predictor variable.

  • This definition hinges on the idea that knowledge of the predictor variable will facilitate the successful estimation of the value for the dependent variable.

Dependent Variable

  • Dependent Variable—This variable can take on different values; however, these values are said to “depend” on the value of the independent variable.

  • In experimental work, we test whether the manipulation of the independent variable results in a significant change in the dependent variable.

  • In observational studies, the value of the dependent variable can be better predicted by knowledge of the value of the independent variable.


Horizontal Axis

  • Horizontal Axis—In this course, the term horizontal axis is used infrequently, as the preferred terminology is the x-axis.

  • Both these terms will always refer to the horizontal axis of the chart you are building.

  • During certain ggplot2 operations, you will find that the vertical axis is referred to as the x-axis. When this happens, we revert to the term horizontal axis so as to avoid confusion.

Vertical Axis

  • Vertical Axis—The vertical axis rises perpendicular to the x-axis (horizontal) and is most frequently referred to as the y-axis.

Basic Principles of {ggplot2}

The {ggplot2} package is based on the principles of “The Grammar of Graphics” (hence “gg” in the name of {ggplot2}), that is, a coherent system for describing and building graphs. The main idea is to design a graphic as a succession of layers.

The main layers are:

  1. The dataset that contains the variables that we want to represent. This is done with the ggplot() function and comes first.
  2. The variable(s) to represent on the x and/or y-axis, and the aesthetic elements (such as color, size, fill, shape and transparency) of the objects to be represented. This is done with the aes() function (abbreviation of aesthetic).
  3. The type of graphical representation (scatter plot, line plot, barplot, histogram, boxplot, etc.). This is done with the functions geom_point(), geom_line(), geom_bar(), geom_histogram(), geom_boxplot(), etc.
  4. If needed, additional layers (such as labels, annotations, scales, axis ticks, legends, themes, facets, etc.) can be added to personalize the plot.

Basic Principles of {ggplot2}

To create a plot, we thus first need to specify the data in the ggplot() function and then add the required layers such as the variables, the aesthetic elements and the type of plot:

ggplot(data) +
  aes(x = var_x, y = var_y) +
  geom_x()
  • data in ggplot() is the name of the data frame which contains the variables var_x and var_y.
  • The + symbol is used to indicate the different layers that will be added to the plot. Make sure to write the + symbol at the end of the line of code and not at the beginning of the line, otherwise R throws an error.
  • The layer aes() indicates what variables will be used in the plot and more generally, the aesthetic elements of the plot.
  • Finally, x in geom_x() represents the type of plot.
  • Other layers are usually not required unless we want to personalize the plot further.

Note that it is a good practice to write one line of code per layer to improve code readability.

Load Packages and Exploring Data

Load Packages

library(tidyverse) 
library(gt)
library(gridExtra) 

Exploring Data

data <- read.csv("../data/pulse_data.csv")

Exploring Data

# examine first few rows 
data %>% 
  head() %>% 
  gt() 
Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 BMI BMICat
1.73 57 18 Female No Yes Moderate No 86 88 19.04507 Underweight
1.79 58 19 Female No Yes Moderate Yes 82 150 18.10181 Underweight
1.67 62 18 Female No Yes High Yes 96 176 22.23099 Normal
1.95 84 18 Male No Yes High No 71 73 22.09073 Normal
1.73 64 18 Female No Yes Low No 90 88 21.38394 Normal
1.84 74 22 Male No Yes Low Yes 78 141 21.85728 Normal

Working with ggplot2

  1. We start by specifying the data:
ggplot(data) # data

  1. Then we add the variables to be represented with the aes() function:
ggplot(data) + # data
  aes(x = Ran) # variables

  1. Finally, we indicate the type of plot:
ggplot(data) + # data
  aes(x = Ran) + # variables
  geom_bar() # type of plot

You will also sometimes see the aesthetic elements (aes() with the variables) inside the ggplot() function in addition to the dataset:

ggplot(data, aes(x = Ran)) +
  geom_bar()

Simple Bar Charts

  • A barplot shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.
  • Barplot is sometimes described as a boring way to visualize information. However it is probably the most efficient way to show this kind of data. Ordering bars and providing good annotation are often necessary

Purpose Simple Bar Charts

  • To describe the number of observations in each category of the discrete variable
  • To visualize estimated error for discrete variables

Simple Bar Graph (Vertical Orientation)

# Single Categorical Variable 
data %>% 
  ggplot(aes(x = BMICat))+
  geom_bar(fill = "#97B3C6")

# Sorting Bar Chart 
data %>% 
  ggplot(aes(x = fct_infreq(BMICat)))+
  geom_bar(fill = "#97B3C6")

Simple Bar Graph (Horizontal Orientation)

# Sorting Bar Chart 
data %>% 
  ggplot(aes(x = fct_infreq(BMICat)))+
  geom_bar(fill = "#97B3C6")+
  coord_flip()

Line plot

Line plots, particularly useful in time series or finance, can be created similarly but by using geom_line()

data %>% 
ggplot(aes(x = Age, y = Weight)) +
  geom_point()

Combination of line and points

data %>% 
  ggplot(aes(x = Age, y = Weight)) +
  geom_point()+
  geom_line() # add line

Histogram

ggplot(data) +
  aes(x = Age) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By default, the number of bins is equal to 30. You can change this value using the bins argument inside the geom_histogram() function:

ggplot(data) +
  aes(x = Age) +
  geom_histogram(bins = sqrt(nrow(data)))

Density plot

ggplot(data) +
  aes(x = Age) +
  geom_density()

Combination of histogram and densities

ggplot(data) +
  aes(x = Age, y = ..density..) +
  geom_histogram() +
  geom_density()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or superimpose several densities:

ggplot(data) +
  aes(x = Age, color = Gender, fill = Gender) +
  geom_density(alpha = 0.25) # add transparency

Boxplot

# Boxplot for one variable
ggplot(data) +
  aes(x = "", y = Pulse1) +
  geom_boxplot()
Warning: Removed 1 rows containing non-finite values (stat_boxplot).

# Boxplot by factor
ggplot(data) +
  aes(x = Gender, y = Pulse1) +
  geom_boxplot()
Warning: Removed 1 rows containing non-finite values (stat_boxplot).