“A picture is worth a thousand words” Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.
In statistics, we generally have two kinds of visualization:
“A picture is worth a thousand words”
Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object. The key point is that variables can vary and can be expressed as a particular numerical value or as falling in a unique category. The following are some examples:
Discrete Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.
The term discrete describes how the variable is measured or counted.
Discrete variables vary in a manner so that the characteristics being measured fall in unique categories.
Such categories must be mutually exclusive, which means that any observation must fall in one and only one category.
The categories must also be inclusive, which means that there must be a category for every possible observation.
Examples of discrete variables from the list above are gender,mental state, number of errors, and make of automobile.
Continuous Variable—A characteristic that describes some physical or mental aspect of the individual, group, or inanimate object that has been observed.
The term continuous describes how the variable is measured. Continuous variables vary by taking on any one of a large number of measures (often infinite).
Examples of continuous variables from the list above are weight, speed, and how high one can jump.
Category—A natural grouping of the characteristics of a discrete variable.
Think of the type of automobile as a variable having categories such as Porsche, Ferrari, and Maserati.
The automobile possessing “Porsche” characteristics is counted or placed in a category titled Porsche.
Continuous Distribution—Plainly speaking, a continuous distribution is just a “bunch of numbers” that resulted when something was measured at the continuous level.
We use various types of statistical analysis, including graphs, to make sense of such distributions.
Histogram and line graphs are the most commonly used graphing techniques to describe continuous distributions.
Discrete Distribution—Could also consist of a bunch of numbers, but the numbers take on a different meaning. Using the gender variable as an example, we could assign the value label of 1 to the male category and 2 to the female category.
This being the case, you now have a column of numbers consisting of 1s and 2s that you are trying to understand.
You could use SPSS to produce a bar graph.
Independent Variable—The independent variable is manipulated and has the freedom to take on different values.
It is the presumed cause of change in the dependent variable in experimental work.
In observational-type studies, it is often referred to as the predictor variable.
This definition hinges on the idea that knowledge of the predictor variable will facilitate the successful estimation of the value for the dependent variable.
Dependent Variable—This variable can take on different values; however, these values are said to “depend” on the value of the independent variable.
In experimental work, we test whether the manipulation of the independent variable results in a significant change in the dependent variable.
In observational studies, the value of the dependent variable can be better predicted by knowledge of the value of the independent variable.
Horizontal Axis—In this course, the term horizontal axis is used infrequently, as the preferred terminology is the x-axis.
Both these terms will always refer to the horizontal axis of the chart you are building.
During certain ggplot2 operations, you will find that the vertical axis is referred to as the x-axis. When this happens, we revert to the term horizontal axis so as to avoid confusion.
{ggplot2}The {ggplot2} package is based on the principles of “The Grammar of Graphics” (hence “gg” in the name of {ggplot2}), that is, a coherent system for describing and building graphs. The main idea is to design a graphic as a succession of layers.
The main layers are:
ggplot() function and comes first.aes() function (abbreviation of aesthetic).geom_point(), geom_line(), geom_bar(), geom_histogram(), geom_boxplot(), etc.{ggplot2}To create a plot, we thus first need to specify the data in the ggplot() function and then add the required layers such as the variables, the aesthetic elements and the type of plot:
ggplot(data) +
aes(x = var_x, y = var_y) +
geom_x()
data in ggplot() is the name of the data frame which contains the variables var_x and var_y.+ symbol is used to indicate the different layers that will be added to the plot. Make sure to write the + symbol at the end of the line of code and not at the beginning of the line, otherwise R throws an error.aes() indicates what variables will be used in the plot and more generally, the aesthetic elements of the plot.x in geom_x() represents the type of plot.Note that it is a good practice to write one line of code per layer to improve code readability.
library(tidyverse)
library(gt)
library(gridExtra)
data <- read.csv("../data/pulse_data.csv")
# examine first few rows
data %>%
head() %>%
gt()
| Height | Weight | Age | Gender | Smokes | Alcohol | Exercise | Ran | Pulse1 | Pulse2 | BMI | BMICat |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.73 | 57 | 18 | Female | No | Yes | Moderate | No | 86 | 88 | 19.04507 | Underweight |
| 1.79 | 58 | 19 | Female | No | Yes | Moderate | Yes | 82 | 150 | 18.10181 | Underweight |
| 1.67 | 62 | 18 | Female | No | Yes | High | Yes | 96 | 176 | 22.23099 | Normal |
| 1.95 | 84 | 18 | Male | No | Yes | High | No | 71 | 73 | 22.09073 | Normal |
| 1.73 | 64 | 18 | Female | No | Yes | Low | No | 90 | 88 | 21.38394 | Normal |
| 1.84 | 74 | 22 | Male | No | Yes | Low | Yes | 78 | 141 | 21.85728 | Normal |
ggplot(data) # data
aes() function:ggplot(data) + # data
aes(x = Ran) # variables
ggplot(data) + # data
aes(x = Ran) + # variables
geom_bar() # type of plot
You will also sometimes see the aesthetic elements (aes() with the variables) inside the ggplot() function in addition to the dataset:
ggplot(data, aes(x = Ran)) +
geom_bar()
# Single Categorical Variable
data %>%
ggplot(aes(x = BMICat))+
geom_bar(fill = "#97B3C6")
# Sorting Bar Chart
data %>%
ggplot(aes(x = fct_infreq(BMICat)))+
geom_bar(fill = "#97B3C6")
# Sorting Bar Chart
data %>%
ggplot(aes(x = fct_infreq(BMICat)))+
geom_bar(fill = "#97B3C6")+
coord_flip()
Line plots, particularly useful in time series or finance, can be created similarly but by using geom_line()
data %>%
ggplot(aes(x = Age, y = Weight)) +
geom_point()
data %>%
ggplot(aes(x = Age, y = Weight)) +
geom_point()+
geom_line() # add line
ggplot(data) +
aes(x = Age) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By default, the number of bins is equal to 30. You can change this value using the bins argument inside the geom_histogram() function:
ggplot(data) +
aes(x = Age) +
geom_histogram(bins = sqrt(nrow(data)))
ggplot(data) +
aes(x = Age) +
geom_density()
ggplot(data) +
aes(x = Age, y = ..density..) +
geom_histogram() +
geom_density()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Or superimpose several densities:
ggplot(data) +
aes(x = Age, color = Gender, fill = Gender) +
geom_density(alpha = 0.25) # add transparency
# Boxplot for one variable
ggplot(data) +
aes(x = "", y = Pulse1) +
geom_boxplot()
Warning: Removed 1 rows containing non-finite values (stat_boxplot).
# Boxplot by factor
ggplot(data) +
aes(x = Gender, y = Pulse1) +
geom_boxplot()
Warning: Removed 1 rows containing non-finite values (stat_boxplot).