Introduction to Data Visualization

“A picture is worth a thousand words” Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.

Purpose of Data Visualization

The principal purpose of a graph is to answer questions about data.
The purpose of all statistical procedures is to make data understandable.
Whether you are reading a data report at work or conducting a cancer research project, it is a good idea to be able to interpret large amounts of numerical data.
The ultimate purpose of a graph is to make a “bunch of numbers” (the data) understandable.

What is ggplot2?

ggplot2 is a data visualization package for the statistical programming language R.
Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.

Types of Visualization

In statistics, we generally have two kinds of visualization:

Exploratory data visualization: Exploring the data visually to find patterns among the data entities.
Explanatory data visualization: Showcasing the identified patterns using simple graphs.

Why Visualization?

“A picture is worth a thousand words”

Data visualizations make big and small data easier for the human brain to understand, and visualization also makes it easier to detect patterns, trends, and outliers in groups of data.
Good data visualizations should place meaning into complicated datasets so that their message is clear and concise.

Grammar of Graphics

Data - The dataset being plotted
Aesthetics- The scales onto which we plot our data
Geometry - The visual elements used for our data
Facet -Groups by which we divide the data

Definition of Key Terms and Concepts

Variables

Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object. The key point is that variables can vary and can be expressed as a particular numerical value or as falling in a unique category. The following are some examples:
- Gender as the observed variable could vary in two ways: male or female (discrete).
- Mental state could be a variable and vary in five ways: very anxious, anxious, neutral, relaxed, and very relaxed (discrete).
- Number of errors on a written test can be considered as a variable. If there were 10 questions, then such a variable could vary from 0 to 10 errors (discrete).
- Make of automobile owned or leased can be considered as a variable (discrete).
- Weight of an individual can be considered as a variable (continuous).
- Speed of an airplane can be considered as a variable (continuous).
- How high one can jump can be considered as a variable (continuous).

Discrete Variable

Discrete Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.
The term discrete describes how the variable is measured or counted.
Discrete variables vary in a manner so that the characteristics being measured fall in unique categories.
Such categories must be mutually exclusive, which means that any observation must fall in one and only one category.
The categories must also be inclusive, which means that there must be a category for every possible observation.
Examples of discrete variables from the list above are gender,mental state, number of errors, and make of automobile.

Continuous Variable

Continuous Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.
The term continuous describes how the variable is measured. Continuous variables vary by taking on any one of a large number of measures (often infinite).
Examples of continuous variables from the list above are weight, speed, and how high one can jump.

Continuous Distribution

Continuous Distribution—Plainly speaking, a continuous distribution is just a “bunch of numbers” that resulted when something was measured at the continuous level.
We use various types of statistical analysis, including graphs, to make sense of such distributions.
Histogram and line graphs are the most commonly used graphing techniques to describe continuous distributions.

Discrete Distribution

Discrete Distribution—Could also consist of a bunch of numbers, but the numbers take on a different meaning. Using the gender variable as an example, we could assign the value label of 1 to the male category and 2 to the female category.
This being the case, you now have a column of numbers consisting of 1s and 2s that you are trying to understand.
You could use SPSS to produce a bar graph.

Levels of Measurement

Levels of Measurement—A classification method used to define how variables are measured. For most statistical work, there are four possible ways to measure variables:

nominal,
ordinal,
interval, and
ratio

Independent Variable

Independent Variable—The independent variable is manipulated and has the freedom to take on different values.
It is the presumed cause of change in the dependent variable in experimental work.
In observational-type studies, it is often referred to as the predictor variable.
This definition hinges on the idea that knowledge of the predictor variable will facilitate the successful estimation of the value for the dependent variable.

Dependent Variable

Dependent Variable—This variable can take on different values; however, these values are said to “depend” on the value of the independent variable.
In experimental work, we test whether the manipulation of the independent variable results in a significant change in the dependent variable.
In observational studies, the value of the dependent variable can be better predicted by knowledge of the value of the independent variable.

Horizontal Axis

Horizontal Axis—In this course, the term horizontal axis is used infrequently, as the preferred terminology is the x-axis.
Both these terms will always refer to the horizontal axis of the chart you are building.
During certain ggplot2 operations, you will find that the vertical axis is referred to as the x-axis. When this happens, we revert to the term horizontal axis so as to avoid confusion.

Vertical Axis

Vertical Axis—The vertical axis rises perpendicular to the x-axis (horizontal) and is most frequently referred to as the y-axis.

Basic Principles of `{ggplot2}`

The {ggplot2} package is based on the principles of “The Grammar of Graphics” (hence “gg” in the name of {ggplot2}), that is, a coherent system for describing and building graphs. The main idea is to design a graphic as a succession of layers.

The main layers are:

The dataset that contains the variables that we want to represent. This is done with the ggplot() function and comes first.
The variable(s) to represent on the x and/or y-axis, and the aesthetic elements (such as color, size, fill, shape and transparency) of the objects to be represented. This is done with the aes() function (abbreviation of aesthetic).
The type of graphical representation (scatter plot, line plot, barplot, histogram, boxplot, etc.). This is done with the functions geom_point(), geom_line(), geom_bar(), geom_histogram(), geom_boxplot(), etc.
If needed, additional layers (such as labels, annotations, scales, axis ticks, legends, themes, facets, etc.) can be added to personalize the plot.

Basic Principles of `{ggplot2}`

To create a plot, we thus first need to specify the data in the ggplot() function and then add the required layers such as the variables, the aesthetic elements and the type of plot:

ggplot(data) +
  aes(x = var_x, y = var_y) +
  geom_x()

data in ggplot() is the name of the data frame which contains the variables var_x and var_y.
The + symbol is used to indicate the different layers that will be added to the plot. Make sure to write the + symbol at the end of the line of code and not at the beginning of the line, otherwise R throws an error.
The layer aes() indicates what variables will be used in the plot and more generally, the aesthetic elements of the plot.
Finally, x in geom_x() represents the type of plot.
Other layers are usually not required unless we want to personalize the plot further.

Note that it is a good practice to write one line of code per layer to improve code readability.

Load Packages and Exploring Data

Load Packages

library(tidyverse) 
library(gt)
library(gridExtra)

Exploring Data

data <- read.csv("../data/pulse_data.csv")

Exploring Data

# examine first few rows 
data %>% 
  head() %>% 
  gt()

Height	Weight	Age	Gender	Smokes	Alcohol	Exercise	Ran	Pulse1	Pulse2	BMI	BMICat
1.73	57	18	Female	No	Yes	Moderate	No	86	88	19.04507	Underweight
1.79	58	19	Female	No	Yes	Moderate	Yes	82	150	18.10181	Underweight
1.67	62	18	Female	No	Yes	High	Yes	96	176	22.23099	Normal
1.95	84	18	Male	No	Yes	High	No	71	73	22.09073	Normal
1.73	64	18	Female	No	Yes	Low	No	90	88	21.38394	Normal
1.84	74	22	Male	No	Yes	Low	Yes	78	141	21.85728	Normal

Working with ggplot2

We start by specifying the data:

ggplot(data) # data

Then we add the variables to be represented with the aes() function:

ggplot(data) + # data
  aes(x = Ran) # variables

Finally, we indicate the type of plot:

ggplot(data) + # data
  aes(x = Ran) + # variables
  geom_bar() # type of plot

You will also sometimes see the aesthetic elements (aes() with the variables) inside the ggplot() function in addition to the dataset:

ggplot(data, aes(x = Ran)) +
  geom_bar()

Simple Bar Charts

A barplot shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.
Barplot is sometimes described as a boring way to visualize information. However it is probably the most efficient way to show this kind of data. Ordering bars and providing good annotation are often necessary

Purpose Simple Bar Charts

To describe the number of observations in each category of the discrete variable
To visualize estimated error for discrete variables

Simple Bar Graph (Vertical Orientation)

# Single Categorical Variable 
data %>% 
  ggplot(aes(x = BMICat))+
  geom_bar(fill = "#97B3C6")

# Sorting Bar Chart 
data %>% 
  ggplot(aes(x = fct_infreq(BMICat)))+
  geom_bar(fill = "#97B3C6")

Simple Bar Graph (Horizontal Orientation)

# Sorting Bar Chart 
data %>% 
  ggplot(aes(x = fct_infreq(BMICat)))+
  geom_bar(fill = "#97B3C6")+
  coord_flip()

Line plot

Line plots, particularly useful in time series or finance, can be created similarly but by using geom_line()

data %>% 
ggplot(aes(x = Age, y = Weight)) +
  geom_point()

Combination of line and points

data %>% 
  ggplot(aes(x = Age, y = Weight)) +
  geom_point()+
  geom_line() # add line

Histogram

ggplot(data) +
  aes(x = Age) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By default, the number of bins is equal to 30. You can change this value using the bins argument inside the geom_histogram() function:

ggplot(data) +
  aes(x = Age) +
  geom_histogram(bins = sqrt(nrow(data)))

Density plot

ggplot(data) +
  aes(x = Age) +
  geom_density()

Combination of histogram and densities

ggplot(data) +
  aes(x = Age, y = ..density..) +
  geom_histogram() +
  geom_density()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or superimpose several densities:

ggplot(data) +
  aes(x = Age, color = Gender, fill = Gender) +
  geom_density(alpha = 0.25) # add transparency

Boxplot

# Boxplot for one variable
ggplot(data) +
  aes(x = "", y = Pulse1) +
  geom_boxplot()

Warning: Removed 1 rows containing non-finite values (stat_boxplot).

# Boxplot by factor
ggplot(data) +
  aes(x = Gender, y = Pulse1) +
  geom_boxplot()

Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Data Visualization using R

Jubayer Hossain, Founder & Instructor, CHIRAL Bangladesh

10 May 2022

Introduction to Data Visualization

Purpose of Data Visualization

What is ggplot2?

Types of Visualization

Why Visualization?

Grammar of Graphics

Definition of Key Terms and Concepts

Variables

Discrete Variable

Continuous Variable

Category

Continuous Distribution

Discrete Distribution

Levels of Measurement

Independent Variable

Dependent Variable

Horizontal Axis

Vertical Axis

Basic Principles of `{ggplot2}`

Basic Principles of `{ggplot2}`

Load Packages and Exploring Data

Load Packages

Exploring Data

Exploring Data

Working with ggplot2

Simple Bar Charts

Purpose Simple Bar Charts

Simple Bar Graph (Vertical Orientation)

Simple Bar Graph (Horizontal Orientation)

Line plot

Combination of line and points

Histogram

Density plot

Combination of histogram and densities

Boxplot

Data Visualization using R

Jubayer Hossain, Founder & Instructor, CHIRAL Bangladesh

10 May 2022

Introduction to Data Visualization

Purpose of Data Visualization

What is ggplot2?

Types of Visualization

Why Visualization?

Grammar of Graphics

Definition of Key Terms and Concepts

Variables

Discrete Variable

Continuous Variable

Category

Continuous Distribution

Discrete Distribution

Levels of Measurement

Independent Variable

Dependent Variable

Horizontal Axis

Vertical Axis

Basic Principles of {ggplot2}

Basic Principles of {ggplot2}

Load Packages and Exploring Data

Load Packages

Exploring Data

Exploring Data

Working with ggplot2

Simple Bar Charts

Purpose Simple Bar Charts

Simple Bar Graph (Vertical Orientation)

Simple Bar Graph (Horizontal Orientation)

Line plot

Combination of line and points

Histogram

Density plot

Combination of histogram and densities

Boxplot

Basic Principles of `{ggplot2}`

Basic Principles of `{ggplot2}`