library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsR Tutorial 5: Data Visualization
MKT 410: Marketing Analytics
Learning Objectives
Now that we have learned how to import, tidy, and transform datasets, we will cover how to visualize your data to better help you understand the underlying data. We will use a very useful and popular graph making tool called ggplot2, which implements the grammar of graphics, a system for describing and building graphs.
In this tutorial, we will cover:
- The basic steps for visualizing your data using
ggplot2 - How to visualize distributions of single variables
- How to visualize relationships between two or more variables
- How to save your plots
Where relevant, we will also loop in Tableau, as it is an important data visualization tool used in many business settings that is easier to operate than R, although it is less powerful than R.
Prerequisites
We’ll again use the tidyverse, which contains the ggplot2 package.
In addition to tidyverse, we will need to load the palmerpenguins package, which includes the penguins dataset that contains body measurements for penguins on three islands in the Palmer Archipelago.
# install.packages("palmerpenguins") ## install if needed
library(palmerpenguins)Lastly, we’ll load the ggthemes package, which offers a colorblind safe color palette.
# install.packages("ggthemes") ## install if needed
library(ggthemes)Getting Started
The penguins Dataset
Do penguins with longer flippers weigh more or less than penguins with shorter flippers? What does the relationship between flipper length and body mass look like?
- Is it positive?
- Negative?
- Linear?
- Nonlinear?
- Does the relationship vary by the species of the penguin?
- How about by the island where the penguin lives?
We can create visualizations that we can use to answer these questions.
The penguins dataset contains relevant information to help us.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, …
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347…
#> $ sex <fct> male, female, female, NA, female, male, female, m…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…In this dataset, we have some key variables:
species: a penguin’s species (Adelie, Chinstrap, or Gentoo)flipper_length_mm: length of a penguin’s flipper, in millimetersbody_mass_g: body mass of a penguin, in grams
To learn more about penguins, open its help page by running ?penguins.
Ultimate Goal
The ultimate goal of this tutorial is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.
Creating a ggplot
The steps to creating a plot in ggplot2 are:
- Define a plot object
- Add layers to it
1. Define a plot object
Let’s create a plot object using the ggplot() function and dataset penguins:
ggplot(data = penguins)Since we haven’t told ggplot() how to visualize the data, there’s nothing to display, so we get a completely blank “plot”.
We now need to tell ggplot() how to visually represent our data. We need to define how variables in your dataset are mapped to visual properties, or “aesthetics”, of your plot.
Let’s start from the very first visual components of any plot: the x-axis and y-axis. For our penguins data, we’ll specify:
- x-axis: flipper length
- y-axis: body mass
To do this, we will use the aes() function for the mapping argument in ggplot():
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)We’ve now created our base plot object. It knows what our data is, and which variables are the x and y axes.
Next, we will want to actually fill in the plot with the penguins data.
2. Adding our first layer
We will now articulate how to represent the observations from our dataset onto the plot.
To do so, we need to define a geom: the geometrical object that a plot uses to represent data. These geometrical objects are made available in ggplot2 with functions that start with geom_. There are several types of geometrical objects (“geoms”) that we can use to describe data, such as:
geom_bar(): bar charts using bar geomsgeom_line(): line charts using line geomsgeom_boxplot(): boxplots using boxplot geomsgeom_point(): scatterplots using point geoms
Let’s add our first layer using the geom_point() function, which adds a layer of points to your plot. This layer of points creates a scatterplot. Note that we simply add the layer using the + symbol (we’re not piping %>% the plot object into the geom function).
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).While this plot doesn’t show all the information that we might be interested in (such as relationship of body mass and flipper length for each of the different species), we can already start answering the overall question: “What is the relationship between flipper length and body mass?”
Notice that there was a warning message. This message states that there are two observations with missing values. That is, two penguins in our dataset have missing body mass and/or flipper length values. There’s no way to plot their data onto this plot, but R will still inform you of the missing data.
Adding Aesthetics and Layers
Adding an Aesthetic: Color by Species
While the above plot shows a positive relationship between flipper length and body mass, we should always be curious if any other variables might explain or change the nature of this apparent relationship. Does this relationship differ by species?
We can examine this question by representing species with different colored points. To achieve this, should we use an aesthetic or a geom? The correct answer is to modify the aesthetic mapping, since it is a visual property rather than a geometric object. (Don’t worry if this is confusing for now, as we will go through additional examples later on).
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).When a categorical variable (in this case species) is mapped to an aesthetic (in this case color), ggplot2 will automatically assign a unique value of the aesthetic to each unique level of the variable. It will also add a legend that explains which values correspond to which levels.
Adding a Layer: Line of Best Fit
Next, let’s add a new layer displaying the relationship between flipper length and body mass. We can do this by using a smooth curve via the geom_smooth() function. This is a new geometric object representing our data. In geom_smooth(), we also need to specify how we want to draw the curve; here, we’ll specify a linear model using method = "lm".
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).Global vs. Local Aesthetics
While the plot above does add lines to our plot,notice that there is a different line for each of the penguin species.
This is because aesthetic mappings defined within the initial ggplot() function are defined at the global level, and are passed down to each of the subsequent geom layers of the plot. However, we can also specify aesthetic mappings at the local level within each geom_ function via its own mapping argument.
- Global aesthetics:
- Defined in initial
ggplot()function call - Applied to all geom layers of the plot
- Defined in initial
- Local aesthetics:
- Defined within a specific
geom_...()function call - Applied to the specific geom layer only
- Defined within a specific
Let’s modify our plot and have the line be applied to the whole dataset, while keeping colors different by species.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g) ## global aesthetics
) +
geom_point(mapping = aes(color = species)) + ## local aesthetic
geom_smooth(method = "lm")
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).Adding an Aesthetic: Shape by Species
It is generally a good idea to represent groups of points not just with different colors but also with different shapes. The shape again is a visual property of the points, so we’ll add that to the aes() function in geom_point().
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).Adding Labels
We can also make our labels “prettier” by adding a new layer labs() that specifies how different text within the plot can be displayed. Lastly, we can improve the color palette to be colorblind safe with the scale_color_colorblind() function from the ggthemes package.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
# Add cleaner looking labels
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)", y = "Body mass (g)",
color = "Species", shape = "Species"
) +
# Use a colorblind safe color palette
scale_color_colorblind()
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).At last, we have recreated our “ultimate goal”!
Exercises
Make a scatterplot of
bill_depth_mmvs.bill_length_mm. Describe the relationship between these two variables.What happens if you make a scatterplot of
speciesvs.bill_depth_mm? What might be a better choice of geom?Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for
labs().Try to recreate the following visualization. What aesthetic should
bill_depth_mmbe mapped to? Should it be mapped at the global level or at the local (geom) level?Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island) ) + geom_point() + geom_smooth(se = FALSE)
Recreating in Tableau
Let’s take a quick pause and see how we’ll recreate this plot on Tableau. A CSV file for the penguins dataset is included in Laulima under “Resources > data > penguins.csv”. (Note that this version of the penguins data also includes an “id” column, which is helpful for us here.)
- Open Tableau Desktop.
- In the top left corner under “Data”, click “Connect to Data”.
- Under “To a file”, select “Text file” and navigate to the “penguins.csv” file and click “open”.
- In the bottom left, click “Sheet 1”.
- In the left hand menu, right-click on “id” and select “Dimension” to convert the “id” variable from a measure to a dimension.
- In the left hand menu, hold Ctrl/Cmd and select both “Body Mass G” and “Flipper Length Mm”.
- In the top right corner, Click “Show Me” and select the scatterplot option.
- From the left hand menu, double-click on “id” to add it to the “Marks” section.
- From the left hand menu, drag “Species” over the “Color” option in the “Marks” section.
- From the left hand menu, drag “Species” over the “Shape” option in the “Marks” section.
- At the top of the left hand menu, select “Analytics” to switch the menu to the Analytics menu.
- Under “Model” in the new left hand menu, select “Trend Line”.
- Click on any one of the trend lines in the plot and select “Edit” to open the Trend Lines Options.
- Under “Factors”, un-check the “Species” option.
- (Optional) Under “Options”, check the “Allow confidence bands” option.
- Click “OK” to exit out of the Trend Lines Options.
- Double-click the y-axis area and uncheck “Include zero”, then click the “X” to exit out of the menu.
- Double-click the x-axis area and uncheck “Include zero”, then click the “X” to exit out of the menu.
- At the top of the graph, double-click on “Sheet 1” and rename the sheet title to “Body mass and flipper length”
The resulting graph should look something like this:
Visualizing Distributions
Categorical Variables
We can visualize the distribution of the categorical variable species with a bar chart, using geom_bar().
ggplot(penguins, aes(x = species)) +
geom_bar()We may want to reorder the categorical (or “factor”) variable from the largest count to lowest count. We can do this using the fct_infreq() function.
ggplot(penguins, aes(x = fct_infreq(species))) +
geom_bar()Other options are fct_inorder(), ordering by the order in which they first appear, or fct_inseq(), ordering by the numeric value of the categorical value.
Numerical Variables
Histogram
We can visualize the distribution of the numerical variable body_mass_g with a histogram, using geom_hist().
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_bin()`).You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 20)
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 2000)Density Plot
An alternative visualization for distributions of numerical values is a density plot, which is a smoothed-out version of a histogram and a practical alternative.
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_density()`).Exercises
- Make a bar plot of species of penguins, where you assign
speciesto theyaesthetic. How is this plot different? - How are the following two plots different? Which aesthetic,
colororfill, is more useful for changing the color of bars? - What does the
binsargument ingeom_histogram()do?
Recreating in Tableau
Bar Chart
Create a new Sheet in Tableau.
- From the left hand column, drag “Species” to the “Columns” section at the top of the window.
- From the left hand column, drag “id” to the “Rows” section at the top of the window.
- In the “Rows” section, click the drop-down menu next to “id” and select “Measure > Count”.
- In the “Columns” section, click the drop-down menu next to “Species” and select “Sort…”.
- Under “Sort By”, select “Field”.
- Under “Sort Order”, select Descending.
- Click the “X” to exit out of the sort menu.
Histogram
Create a new Sheet in Tableau.
- From the left hand menu, drag “Body Mass G” to the “Columns” section at the top of the window.
- In the top right corner, click “Show Me” and select the histogram option.
- In the left hand menu, right-click on “Body Mass G (bin)” and select “Edit”.
- Adjust the “Size of bins” value to change the number of bins (smaller size = larger number of bins).
Unfortunately, Tableau does not have a native option to create a one-dimensional density plot.
Visualizing Relationships
A Numerical and A Categorical Variable
Box Plots
To visualize the relationship between a numerical and a categorical variable, we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of positions (percentiles) that describe a distribution. It consists of:
- A box that indicates the middle half of the data (25th to 75th percentile), or what’s called the interquartile range (IQR). The line in the middle of the box displays the median (50th percentile).
- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
- Visual points that display outliers, which are observations that fall more than 1.5 times the IQR from either edge of the box.
We can plot boxplots using geom_boxplot():
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot()
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).Density Plots by Species
Alternatively, we can use geom_density() plots again, this time specifying the color aesthetic to be species. The linewidth argument in geom_density() specifies how thick each density line is.
ggplot(penguins, aes(x = body_mass_g, color = species)) +
geom_density(linewidth = 0.75)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_density()`).We can also use the fill and alpha argument to specify opacity of the filled-in color of the density curves.
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
geom_density(linewidth = 0.75, alpha = 0.5)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_density()`).Two Categorical Variables
Stacked Bar Plots
We can use stacked bar plots to visualize the relationship between two categorical variables, for example between island and species.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar()To get a better sense of the percentage balance within each island, we can use the position argument in the geom_bar() function. Using position = "fill", we specify that the bar should “fill” the entire available space. (Note the default value of position is "stack", which is what is being used in the plot above).
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")Two Numerical Variables
Two numerical variables lends itself to a scatterplot, as we’ve shown already.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).Three or More Variables
Facets (or Subplots)
Let’s say we want to analyze the relationship between a lot of variables, such as:
- Flipper length
- Body mass
- Species
- Island
A graph will quickly become way too cluttered when we try and map everything onto one plot.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).One way to visualize the relationship between multiple variables is to use facets, or subplots that each display one subset of the data. Each facet (or subplot) will show the subset of the data that corresponds to one of the values in a categorical variable.
The function to specify your facets is facet_wrap(). The first argument of facet_wrap() is a formula, which you create with ~ followed by a variable name. Note that the variable should be a categorical (factor) variable.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island)
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).You can specify multiple categorical variables. The below example uses combinations of both island and sex to create the facets by specifying ~island + sex as the argument in facet_wrap().
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island + sex)
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).Correlations/Covariations
Let’s say we have more than two numerical variables that we’d like analyze the relationship of:
- Flipper length
- Body mass
- Bill length
- Bill depth
One way to look at the relationship between all of these variables together is to use a correlation matrix. This is a matrix that shows the correlation between each pair of variables in a list of variables.
To construct a correlation matrix in ggplot2, we’ll need to install and load another library: ggcorrplot.
# install.packages("ggcorrplot") ## install if needed
library(ggcorrplot)First, we’ll need to compute the correlation matrix itself for these four variables. The cor() function (part of base R) is used to calculate the correlations between column variables provided to it. Since there are some missing values in our data, we’ll also need to specify which rows in the dataset to use. In the following example, we’ll use pairwise-complete observations by specifying use = "pairwise.complete.obs".
corr <- penguins %>%
# Select variables
select(
flipper_length_mm,
body_mass_g,
bill_length_mm,
bill_depth_mm
) %>%
# Compute correlations
cor(use = "pairwise.complete.obs")
corr
#> flipper_length_mm body_mass_g bill_length_mm bill_depth_mm
#> flipper_length_mm 1.0000000 0.8712018 0.6561813 -0.5838512
#> body_mass_g 0.8712018 1.0000000 0.5951098 -0.4719156
#> bill_length_mm 0.6561813 0.5951098 1.0000000 -0.2350529
#> bill_depth_mm -0.5838512 -0.4719156 -0.2350529 1.0000000Next, we’ll simply plot these correlations using ggcorrplot().
ggcorrplot(corr)We see strong positive correlations between flipper length, body mass, and bill length. We also find that bill depth is strongly negatively correlated with flipper length and body mass (i.e. larger penguins have flatter bills).
Exercises
- Using
penguins, make a scatterplot ofbill_depth_mmvs.bill_length_mm. What does this plot show about the relationship between bill depth and bill length? - Now make a scatterplot of
bill_depth_mmvs.bill_length_mmand color the points byspecies. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species? - The
mpgdataframe included with theggplot2package consists of 234 observations collected by the EPA on 38 car models. Make a scatterplot ofhwyvs.displusing thempgdataframe. Next, map a third, numerical variable (feel free to choose any of the numerical variables available in the dataset; useglimpse(mpg)or?mpgto find out more about the dataset) tocolor, thensize, then bothcolorandsize, thenshape. How do these aesthetics behave differently for categorical vs. numerical variables?
Recreating in Tableau
Box Plot
Create a new Sheet in Tableau.
- From the left hand menu, drag “Species” to the “Columns” section at the top of the window.
- From the left hand menu, drag “Body Mass G” to the “Rows” section at the top of the window.
- From the left hand menu, drag “id” to the “Marks” section.
- In the top right corner, click “Show Me” and select the “box and whisker” option.
- Hover over any of the resulting box plots and right-click and select “Edit”.
- Check the “Hide underlying marks (except outliers) option, then click”OK”.
Stacked Bar Plot
Create a new Sheet in Tableau.
- From the left hand column, drag “Island” to the “Columns” section at the top of the window.
- From the left hand column, drag “id” to the “Rows” section at the top of the window.
- In the “Rows” section, click the drop-down menu next to “id” and select “Measure > Count”.
- From the left hand column, drag “Species” to the “Color” option in the “Marks” section.
Facets
Create a new Sheet in Tableau.
- From the left hand menu, drag “Flipper Length Mm” to the “Columns” section at the top of the window.
- From the left hand menu, drag “Body Mass G” to the “Rows” section at the top of the menu.
- From the left hand menu, drag “id” to the “Marks” section.
- From the left hand menu, drag “Island” to the “Columns” section.
- From the left hand menu, drag “Sex” to the “Rows” section.
Saving Your Plots
The ggsave() function allows you to save the most recent plot generated to disk.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
ggsave(filename = "output/penguin-plot.png")You can also use ggsave() to save a specific plot that you’ve assigned to an object.
p <- ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
ggsave(p, filename = "output/penguin-plot.png")Summary
We’ve now covered four important steps of data analysis:
- Importing data
- Tidying data
- Transforming data
- Visualizing data
In the rest of this course, we will focus on the modeling portion of data analysis in the context of marketing problems and how to use the insights from those analyses to make important marketing decisions.