1. Introduction

Data visualization is a powerful tool for understanding and communicating patterns, trends, and insights within datasets. Among the numerous data visualization libraries available, ggplot2 stands out as one of the most versatile and widely-used tools for creating high-quality visualizations in the R programming language. Developed by Hadley Wickham, ggplot2 is based on the grammar of graphics and provides a structured and flexible approach to creating visualizations.

What is ggplot2?

ggplot2 is an R package that follows the philosophy of the “Grammar of Graphics.” This philosophy is centered around the idea that every data visualization can be constructed by combining a limited set of fundamental components:

  1. Data: The dataset you want to visualize.
  2. Aesthetics: How you map variables in your data to visual elements, such as colors, shapes, and sizes.
  3. Geometric Objects (geoms): The type of visual representation you want to use, such as points, lines, bars, or polygons.
  4. Statistical Transformations (stats): Any data transformations or statistical summaries you want to apply to your data before visualization.
  5. Coordinate System (coord): The system that defines how data points are mapped to the plot’s physical space.
  6. Faceting: How you split the data into subsets and create separate plots for each subset.

Key Features and Benefits of ggplot2

  1. Elegant Syntax: ggplot2’s syntax is both intuitive and expressive, making it easy to create complex visualizations with minimal code.

  2. Layered Approach: You can add layers to a plot, allowing you to build up visualizations step by step. This layered approach makes it easy to customize and modify plots as needed.

  3. Customization: ggplot2 offers extensive options for customization, including control over colors, themes, labels, and annotations.

  4. Data Exploration: ggplot2 is not just for creating final visualizations; it’s also a valuable tool for exploring your data. You can quickly generate multiple views of your data to gain insights.

  5. Publication-Quality Graphics: The default aesthetics and themes in ggplot2 are designed to create visually appealing, publication-ready graphics.

Common Use Cases

  1. Scatterplots: Visualize the relationship between two continuous variables using points, allowing you to identify correlations, clusters, or outliers.

  2. Bar Charts and Histograms: Display the distribution of a single variable or compare multiple categories using bars or histograms.

  3. Line Charts: Illustrate trends or changes in data over time or across ordered categories.

  4. Box Plots and Violin Plots: Visualize the distribution of a variable’s values, including summary statistics like medians, quartiles, and outliers.

  5. Heatmaps: Display complex relationships between two variables using color intensity.

  6. Faceted Plots: Split your data into subsets based on one or more categorical variables, creating multiple plots for comparison.

  7. Geospatial Visualizations: Create maps and spatial plots by mapping data points or polygons to geographic coordinates.

2. Tools for Displaying Single Variables

Histogram

A histogram is a graphical representation of the distribution of a dataset. It’s a type of bar chart that displays the frequencies or counts of data points falling into specific intervals or “bins” along the horizontal axis, with the vertical axis representing the frequency of observations in each bin. Histograms are particularly useful for understanding the underlying structure of a dataset, revealing patterns, trends, and the shape of the data’s distribution.

Here are some key characteristics and components of a histogram:

  1. Bins or Intervals: The horizontal axis is divided into contiguous intervals or bins, which are typically of equal width. Each bin represents a range of values within which data points are grouped.

  2. Frequency or Count: The vertical axis represents the frequency or count of data points that fall into each bin. This is a measure of how many data points belong to each interval.

  3. Bars: Bars or rectangles are drawn above each bin, with the height of each bar indicating the frequency of data points in that interval. The width of the bars is usually determined by the width of the bins.

  4. No Gaps: There are no gaps between the bars in a histogram because the bins are contiguous.

Histograms are commonly used in data analysis and visualization for various purposes, including:

  • Data Distribution: Histograms provide insights into the distribution of data, helping to identify patterns, central tendencies (e.g., mean or median), and variations.

  • Skewness and Symmetry: They help determine whether a distribution is symmetric or skewed (positively or negatively).

  • Outliers: Outliers, which are data points significantly different from the majority, can often be detected in histograms as bars that are much taller or shorter than the others.

  • Data Transformation: Histograms can guide decisions about data transformations, such as log or square root transformations, to make the data more suitable for certain statistical analyses.

  • Choosing Bin Width: The choice of bin width can affect the appearance of the histogram and the insights it provides. Careful selection of bin width is important for accurate interpretation.

Histograms are a fundamental tool in exploratory data analysis (EDA) and are often used alongside other visualizations and statistical techniques to gain a deeper understanding of datasets. They are particularly useful for continuous data but can also be adapted for discrete data by appropriately defining bins.

To demonstrate the use of Histograms, we will use the Wage data set in the ISLR library.

View(Wage)
with(Wage, hist(wage, nclass=20, col="grey", border="navy",
main="", xlab="Wage", cex=1.2))
title(main = "Distribution of Wage", cex=1.2, col.main="navy",
font.main=4)

The above R code chunk creates a histogram of the “wage” variable in a dataset called “Wage.” Let’s break down the code and explain the output it produces:

  1. View(Wage): This command is not directly related to the histogram creation but is used to open a data viewer for the “Wage” dataset. It allows you to interactively explore the contents of the dataset in a separate viewer window.

  2. with(Wage, hist(wage, nclass=20, col="grey", border="navy", main="", xlab="Wage", cex=1.2)):

    • with(Wage, ...): This function is used to specify that the variables and data should be taken from the “Wage” dataset.

    • hist(wage, nclass=20, col="grey", border="navy", main="", xlab="Wage", cex=1.2): Within the context of the “Wage” dataset, this command creates a histogram of the “wage” variable with the following arguments:

      • wage: This is the variable from the dataset that will be used to create the histogram.

      • nclass=20: It specifies that the histogram should have 20 bins or intervals. In other words, the range of wage values is divided into 20 equally spaced intervals, and the frequency of wage values in each interval is counted and plotted as bars.

      • col="grey": Sets the fill color of the histogram bars to grey.

      • border="navy": Sets the border color of the histogram bars to navy blue.

      • main="": Sets an empty main title for the plot. You will add a custom title separately.

      • xlab="Wage": Sets the label for the x-axis to “Wage.”

      • cex=1.2: Increases the size of text and labels by a factor of 1.2 for better readability.

  3. title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4): This line of code adds a custom title to the histogram plot:

    • main = "Distribution of Wage": Sets the main title to “Distribution of Wage.”

    • cex=1.2: Increases the size of the main title text by a factor of 1.2.

    • col.main="navy": Sets the color of the main title text to navy blue.

    • font.main=4: Specifies that the main title should be bold.

By running the code, the output is a histogram plot of the “wage” variable from the “Wage” dataset. The histogram have 20 bins, grey bars with navy blue borders, and a custom title that reads “Distribution of Wage.” It provides a visual representation of the distribution of wage values in the dataset, giving the understanding of how wages are distributed and whether there are any notable patterns or characteristics in the data. As can be observed, the wages are heavy at the left between 50-150, with peak wage distribution around 100.

p <- ggplot(data = Wage, aes(x=wage))
p <- p + geom_histogram(binwidth=25, aes(fill=race))
p <- p + scale_fill_brewer(palette="Set1")
p <- p + facet_wrap( ~ race, ncol=2)
p <- p + labs(x="Wage", y="Frequency")+ theme(axis.title =
element_text(color="black", face="bold"))
p <- p + ggtitle("Histogram of Wage by Race") + theme(plot.title
=element_text(color="black", face="bold", size=16))
p

We can use the ggplot2 library to create a faceted histogram plot to visualize the distribution of wages by race in the “Wage” dataset. Let’s break down the code step by step and interpret its components:

p <- ggplot(data = Wage, aes(x=wage))
  • p <- ggplot(data = Wage, aes(x=wage)): This line initializes a ggplot object named “p.” It specifies the data source as the “Wage” dataset and defines the aesthetic mapping (aes) for the x-axis (horizontal axis) to the “wage” variable. This means that the “wage” variable will be plotted on the x-axis.
p <- p + geom_histogram(binwidth=25, aes(fill=race))
  • p <- p + geom_histogram(binwidth=25, aes(fill=race)): This line adds a histogram layer to the ggplot object “p.” It creates a histogram of the “wage” variable with several settings:
    • binwidth=25: Sets the width of each histogram bin to 25 units. This determines the range of values that will be grouped together in each bin.
    • aes(fill=race): Maps the “race” variable to the fill aesthetic, which means that the bars in the histogram will be colored based on the “race” variable. Each race category will have a different color.
p <- p + scale_fill_brewer(palette="Set1")
  • p <- p + scale_fill_brewer(palette="Set1"): This line sets the fill colors of the bars in the histogram using the “Set1” color palette from the Brewer color scales. This step customizes the appearance of the histogram bars.
p <- p + facet_wrap( ~ race, ncol=2)
  • p <- p + facet_wrap( ~ race, ncol=2): This line uses the facet_wrap function to create a faceted plot, which means that separate histograms will be created for each unique value of the “race” variable. The ncol=2 argument specifies that the facet panels should be arranged in two columns.
p <- p + labs(x="Wage", y="Frequency")+ theme(axis.title = element_text(color="black", face="bold"))
  • p <- p + labs(x="Wage", y="Frequency")+ theme(axis.title = element_text(color="black", face="bold")): This line adds axis labels to the plot, with “Wage” as the x-axis label and “Frequency” as the y-axis label. It also customizes the appearance of the axis titles, making them bold and setting their text color to black.
p <- p + ggtitle("Histogram of Wage by Race") + theme(plot.title = element_text(color="black", face="bold", size=16))
  • p <- p + ggtitle("Histogram of Wage by Race") + theme(plot.title = element_text(color="black", face="bold", size=16)): This line adds a main title to the plot, which reads “Histogram of Wage by Race.” It also customizes the appearance of the main title by making it bold, setting its text color to black, and increasing its font size.

Finally, p contains the fully customized ggplot object with all the plot layers, aesthetics, labels, and titles. When we evaluate p, it will generate the faceted histogram plot visualizing the distribution of wages by race in the “Wage” dataset. Each facet (subplot) represents a different race category, and the histogram bars are colored according to the “race” variable. This plot helps in comparing the wage distributions across different racial groups.

Boxplot

A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution of a dataset. It provides a summary of key statistical measures and visually displays the spread, central tendency, and potential outliers within the data. Box plots are especially useful for comparing the distribution of multiple datasets or variables.

Components and characteristics of a box plot

  1. Box: The central rectangular box represents the interquartile range (IQR) of the data, which contains the middle 50% of the observations. The box spans from the first quartile (Q1) to the third quartile (Q3). The length of the box (also known as the “box height”) indicates the spread or variability of the middle half of the data.

  2. Line Inside the Box: A vertical line inside the box typically represents the median (Q2), which is the middle value when the data is ordered.

  3. Whiskers: Lines, called whiskers, extend from the edges of the box to the minimum and maximum values within a specified range, usually determined by a formula. Whiskers can provide insights into the data’s range and potential outliers.

  4. Outliers: Data points outside the whiskers are considered potential outliers and are often individually marked. Outliers are values that are significantly different from the majority of the data points and may be of particular interest for further investigation.

Box plots can be created horizontally (with the box extending horizontally) or vertically (with the box extending vertically), depending on the orientation of the plot.

Box plots are valuable for several purposes:

  • Comparing Distributions: Box plots make it easy to visually compare the distributions of multiple datasets or variables side by side.

  • Identifying Skewness: Skewness in the data can often be observed by examining the asymmetry of the box and whiskers.

  • Detecting Outliers: Outliers are readily visible in a box plot, helping to identify data points that may warrant further investigation.

  • Summarizing Data: Box plots provide a concise summary of the central tendency and spread of a dataset without displaying all data points.

  • Data Exploration: They are useful for exploratory data analysis (EDA) and understanding the basic characteristics of a dataset.

Here’s a simplified example of how to create a box plot in R using the ggplot2 package:

with(Wage,boxplot(wage,col="grey", border="navy", main="",
xlab="Wage",pch = 19, cex=0.8))
title(main = "Distribution of Wage", cex=1.2, col.main="navy",
font.main=4)

In this example, the box plot visualizes the distribution of wages in the “Wage” dataset. Let’s break down the code and explain its components step by step:

with(Wage, boxplot(wage, col="grey", border="navy", main="", xlab="Wage", pch=19, cex=0.8))
  1. with(Wage, ...): This function is used to specify that the variables and data should be taken from the “Wage” dataset. It sets the context for the following code.

  2. boxplot(wage, col="grey", border="navy", main="", xlab="Wage", pch=19, cex=0.8): Within the context of the “Wage” dataset, this code creates a box plot of the “wage” variable with the following settings:

    • wage: This is the variable from the dataset that will be plotted as a box plot. The box plot summarizes the distribution of wage values.

    • col="grey": Sets the color of the boxes in the box plot to grey.

    • border="navy": Sets the border color of the boxes to navy blue.

    • main="": Sets an empty main title for the plot. You will add a custom title separately.

    • xlab="Wage": Sets the label for the x-axis to “Wage.”

    • pch=19: Sets the type of point character used for displaying individual data points within the plot. In this case, it uses character 19, which is a solid circle.

    • cex=0.8: Adjusts the size of the point characters to be 80% of the default size.

The box plot visually represents the distribution of wage values, including key statistics such as the median, quartiles (Q1 and Q3), and potential outliers. It also includes individual data points as solid circles.

title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4)
  1. title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4): This line of code adds a custom main title to the plot:

    • main = "Distribution of Wage": Sets the main title to “Distribution of Wage.”

    • cex=1.2: Increases the size of the main title text by a factor of 1.2.

    • col.main="navy": Sets the color of the main title text to navy blue.

    • font.main=4: Specifies that the main title should be displayed in a bold font.

As stated earlier, box plot can be used to creates a series of ggplot2-based visualizations to display several variables. Below is a boxplots of wage distributions by race in the “Wage” dataset. Let’s break down each part of the code and explain what it does:

  1. p1 <- ggplot(Wage, aes(x=race, y=wage)) + geom_boxplot():
    • p1 is initialized as a ggplot object.
    • Wage is specified as the dataset.
    • aes(x=race, y=wage) defines the aesthetics for the plot, mapping the “race” variable to the x-axis and the “wage” variable to the y-axis.
    • geom_boxplot() adds a boxplot layer to the ggplot object, creating individual boxplots for each race category, displaying the distribution of wages.
  2. p2 <- p1 + labs(x="Race", y="Wage") + theme(axis.title = element_text(color="black", face="bold", size = 12)):
    • p2 is created by building upon the existing p1.
    • labs(x="Race", y="Wage") sets the x-axis label to “Race” and the y-axis label to “Wage.”
    • theme(axis.title = element_text(color="black", face="bold", size = 12)) customizes the appearance of axis titles by setting their text color to black, making them bold, and adjusting the font size to 12.
  3. p3 <- p2 + ggtitle("Boxplot of Wage by Race") + theme(plot.title = element_text(color="black", face="bold", size=16)):
    • p3 is created by building upon the existing p2.
    • ggtitle("Boxplot of Wage by Race") adds a main title to the plot, which reads “Boxplot of Wage by Race.”
    • theme(plot.title = element_text(color="black", face="bold", size=16)) customizes the appearance of the main title by setting its text color to black, making it bold, and adjusting the font size to 16.
  4. p3:
    • The code simply evaluates p3, which combines all the layers and settings specified in the previous steps.
    • As a result, it generates a boxplot visualization of wage distributions, with individual boxplots for each race category.
    • The x-axis is labeled “Race,” and the y-axis is labeled “Wage.”
    • The plot includes a main title, “Boxplot of Wage by Race,” displayed at the top.
    • Axis titles, main title, and plot text are customized with specific colors, font styles, and sizes for better readability and aesthetics.
p1 <- ggplot(Wage, aes(x=race,y=wage))+geom_boxplot()
p2 <- p1 + labs(x="Race", y="Wage")+ theme(axis.title =
element_text(color="black", face="bold", size = 12))
p3 <- p2 + ggtitle("Boxplot of Wage by Race") + theme(plot.title
=
element_text(color="black", face="bold", size=16))
p3

The above code creates a well-customized boxplot visualization that allows you to compare the wage distributions across different race categories in the “Wage” dataset. It provides insights into the central tendencies, spread, and potential outliers in wage data for each race group.

3. Tools for Displaying Relationships Between Two Variables

Scatterplot

A scatter plot is a type of data visualization that displays individual data points as dots or markers on a two-dimensional plane, typically with one variable plotted on the x-axis (horizontal) and another variable plotted on the y-axis (vertical). Each data point represents a unique observation or data record, and its position on the plot corresponds to the values of the two variables it represents. Scatter plots are useful for visually examining the relationship or association between two continuous variables.

Key characteristics and uses of scatter plots include

  1. Identification of Patterns: Scatter plots help reveal patterns, trends, or relationships between two variables. Depending on the pattern observed, you can infer whether there is a positive, negative, or no association between the variables.

  2. Correlation Assessment: Scatter plots are often used to assess the strength and direction of correlation between two variables. In a positive correlation, as one variable increases, the other tends to increase as well. In a negative correlation, as one variable increases, the other tends to decrease.

  3. Outlier Detection: Outliers, which are data points significantly different from the majority, can often be identified as points far from the main cluster in a scatter plot. Outliers can be important for understanding data quality and potential anomalies.

  4. Data Clustering: Scatter plots can reveal the presence of clusters or groups within the data, especially when there are distinct concentrations of data points.

  5. Data Distribution: They provide a visual representation of the distribution of data points in the two-dimensional space, showing where most data points are concentrated.

  6. Visualizing Multivariate Data: Scatter plots can also be extended to display multivariate data by using color, size, or shape to represent additional variables. These are called “bubble” or “3D” scatter plots.

Here’s a simple example of creating a scatter plot in R using the ggplot2 package using the Wage dataset:

with(Wage, plot(age, wage, pch = 19, cex=0.6))
title(main = "Relationship between Age and Wage")

This example generates a scatter plot to visualize the relationship between two variables, “age” and “wage,” from the “Wage” dataset. Here’s a breakdown of the code and its components:

  1. with(Wage, ...): This function sets the context for the subsequent code. It specifies that the variables and data should be taken from the “Wage” dataset.

  2. plot(age, wage, pch = 19, cex = 0.6):

    • plot(age, wage, pch = 19, cex = 0.6) creates a scatter plot of two variables, “age” and “wage,” from the “Wage” dataset.
    • age is specified as the x-axis variable, and wage is specified as the y-axis variable.
    • pch = 19 sets the point character (marker) type to 19, which corresponds to a solid circle. This determines how the data points will be represented in the scatter plot.
    • cex = 0.6 adjusts the size of the point characters to be 60% of the default size. This controls the size of the circles representing data points in the plot.
  3. title(main = "Relationship between Age and Wage"):

    • title(main = "Relationship between Age and Wage") adds a main title to the plot, which reads “Relationship between Age and Wage.”

Hence, this code creates a scatter plot where the x-axis represents “age,” the y-axis represents “wage,” and individual data points are displayed as small solid circles. The plot helps visualize the relationship between age and wage in the “Wage” dataset. You can use this scatter plot to assess whether there is any apparent correlation or pattern between these two variables, which may provide insights into how age relates to wages in the dataset.

The scatter plot can also be used to visualise multiple variable. The example code generates a scatter plot that visualizes the relationship between two variables, “age” and “wage,” from the “Wage” dataset, while also distinguishing data points by race using different colors. Let’s give a breakdown of the code and its components:

  1. with(Wage, ...): This function sets the context for the subsequent code. It specifies that the variables and data should be taken from the “Wage” dataset.

  2. plot(age, wage, col = c("lightgreen","navy","mediumvioletred","red")[race], pch = 19, cex = 0.6):

    • plot(age, wage, ...) creates a scatter plot of two variables, “age” and “wage,” from the “Wage” dataset.
    • age is specified as the x-axis variable, and wage is specified as the y-axis variable.
    • col = c("lightgreen","navy","mediumvioletred","red")[race] assigns colors to data points based on the “race” variable. It uses a vector of colors (“lightgreen,” “navy,” “mediumvioletred,” “red”) and maps each data point’s “race” value to one of these colors, determining the color of the data point.
    • pch = 19 sets the point character (marker) type to 19, which corresponds to a solid circle.
    • cex = 0.6 adjusts the size of the point characters to be 60% of the default size.
  3. legend(70, 310, ...): This function adds a legend to the plot to explain the colors used for different races.

    • 70 and 310 specify the coordinates (x, y) where the legend will be positioned within the plot.
    • legend=levels(Wage$race) provides the labels for the legend based on the unique levels of the “race” variable in the dataset.
    • col=c("lightgreen","navy","mediumvioletred", "red") specifies the colors corresponding to each race category.
    • bty="n" removes the box around the legend.
    • cex=0.7 adjusts the size of the legend text to be 70% of the default size.
    • pch=19 sets the legend symbols to be solid circles to match the data points.
  4. title(main = "Relationship between Age and Wage by Race"):

    • title(main = "Relationship between Age and Wage by Race") adds a main title to the plot, which reads “Relationship between Age and Wage by Race.”
with(Wage, plot(age, wage, col = c("lightgreen","navy",
"mediumvioletred",
"red")[race], pch = 19, cex=0.6))
legend(70, 310, legend=levels(Wage$race),
col=c("lightgreen","navy",
"mediumvioletred", "red"), bty="n", cex=0.7, pch=19)
title(main = "Relationship between Age and Wage by Race")

Therefore, we have successfully generates a scatter plot that shows the relationship between age and wage, with data points colored according to race. The legend explains the color-coding used for different race categories. This visualization allows you to examine how age and wage are distributed across different racial groups in the “Wage” dataset, making it easier to identify patterns or differences in the data.

Contour Plot

A contour plot, also known as a contour chart or density plot, is a two-dimensional data visualization technique used to display the three-dimensional surface of a function or dataset. It is particularly useful for visualizing relationships between two continuous variables and showing how their values change across a range of input values. Contour plots are commonly used in scientific, engineering, and data analysis contexts to visualize data and understand complex relationships.

Key characteristics and uses of contour plots include

  1. Representation of a Surface: Contour plots represent a three-dimensional surface in a two-dimensional space. The contour lines, curves, or regions on the plot correspond to different values of a continuous variable or a function of two variables.

  2. Data Density Visualization: Contour plots can be used to visualize the density or concentration of data points in a scatter plot. Regions with denser data points have more contour lines or shading, while sparser regions have fewer or no contour lines.

  3. Heatmaps: Contour plots are often combined with heatmaps to represent the values of a third variable. Color shading within the contour lines can be used to indicate the magnitude of the third variable.

  4. Interpolation: Contour plots can interpolate between data points to estimate values for intermediate points on the plot, providing a smooth representation of the surface.

  5. Identifying Critical Points: Contour plots can help identify critical points on the surface, such as maxima, minima, and saddle points, where the gradient of the function is zero.

  6. Parameter Tuning: In scientific and engineering applications, contour plots are often used to explore how changing parameters affect the behavior of a system or function.

Here’s an example of how to create a contour plot in R using ggplot2:

d0 <- ggplot(Wage,aes(age, wage))+ stat_density2d()
d0 <- d0 +labs(x="Age", y="Wage")+ theme(axis.title =
element_text(color="black", face="bold"))
d0 + ggtitle("Contour Plot of Age and Wage") + theme(plot.title
= element_text(color="black", face="bold", size=16))

In this example the ggplot2 library to create a contour plot to visualize the distribution and density of wage values across different ages in the “Wage” dataset. Here’s a breakdown of the code and its components:

d0 <- ggplot(Wage, aes(age, wage)) + stat_density2d()
  1. d0 <- ggplot(Wage, aes(age, wage)): This line initializes a ggplot object named “d0” and specifies the “Wage” dataset as the data source. It also defines the aesthetic mapping (aes) for the plot, with “age” mapped to the x-axis and “wage” mapped to the y-axis. This sets up the basic structure of the plot.

  2. + stat_density2d(): This line adds a stat_density2d layer to the ggplot object. The stat_density2d function is used to create a 2D density plot. It calculates the density of wage values across different age values and represents it as contour lines or regions. In other words, it displays how the density of wage values varies with age.

d0 <- d0 + labs(x="Age", y="Wage") + theme(axis.title = element_text(color="black", face="bold"))
  1. d0 <- d0 + labs(x="Age", y="Wage"): This line adds axis labels to the plot. It sets the x-axis label to “Age” and the y-axis label to “Wage.”

  2. + theme(axis.title = element_text(color="black", face="bold")): This line customizes the appearance of axis titles. It sets the text color of the axis titles to black and makes them bold for better readability.

d0 + ggtitle("Contour Plot of Age and Wage") + theme(plot.title = element_text(color="black", face="bold", size=16))
  1. d0 + ggtitle("Contour Plot of Age and Wage"): This line adds a main title to the plot, which reads “Contour Plot of Age and Wage.”

  2. + theme(plot.title = element_text(color="black", face="bold", size=16)): This line customizes the appearance of the main title. It sets the text color of the main title to black, makes it bold, and increases its font size to 16 for emphasis.

In summary, this code creates a contour plot using ggplot2 to visualize the density of wage values across different age values in the “Wage” dataset. It provides insights into how wages are distributed with respect to age and allows you to identify areas of higher or lower wage density. The plot includes axis labels and a main title, along with customizations for the appearance of the text elements.

4. Tools for Displaying More Than Two Variables

###Scatterplot Matrix

A scatterplot matrix, also known as a scatterplot matrix or SPLOM, is a grid of scatterplots that displays pairwise relationships between multiple variables in a dataset. Each scatterplot in the matrix represents the relationship between two different variables. Scatterplot matrices are a valuable tool for exploratory data analysis (EDA) and are commonly used in statistics and data visualization to gain insights into the relationships, patterns, and distributions of variables in a multivariate dataset.

Key characteristics and uses of scatterplot matrices include

  1. Pairwise Comparison: Each scatterplot in the matrix compares two variables, allowing you to quickly visualize how one variable relates to another.

  2. Diagonal Elements: The diagonal of the matrix typically includes univariate plots, such as histograms, density plots, or boxplots, for each variable. This provides information about the distribution of individual variables.

  3. Multivariate Exploration: Scatterplot matrices are particularly useful for examining how variables interact when considered together, helping you identify potential correlations, clusters, or outliers.

  4. Identifying Trends and Patterns: Scatterplot matrices allow you to identify linear or nonlinear trends, clusters, and other patterns in the data. These visual patterns can guide further data analysis.

  5. Outlier Detection: Outliers or unusual data points may be apparent in scatterplots as data points that deviate from the overall pattern.

  6. Dimensionality Reduction: By visualizing relationships in the data, scatterplot matrices can help you decide which variables are most relevant for further analysis and modeling.

Here’s a simple example of a scatterplot matrix created using ggplot2.

attach(College)
X <- cbind(Apps, Accept, Enroll, Room.Board, Books)
5
## [1] 5
scatterplotMatrix(X, diagonal=c("boxplot"), reg.line=F,
smoother=F, pch=19, cex=0.6, col="blue")
title (main="Scatterplot Matrix of College Attributes",
col.main="navy", font.main=4, line = 3)

The provided code uses the “College” dataset to create a scatterplot matrix using the scatterplotMatrix function from the “car” package. Here’s a breakdown of the code and its components:

  1. attach(College): This line attaches the “College” dataset to the R environment. It allows you to refer to the variables in the dataset directly without specifying the dataset name each time. However, note that using attach is not always recommended because it can lead to unintended behavior in complex code. It’s often better to use the dataset explicitly.

  2. library(car): This line loads the “car” package, which provides the scatterplotMatrix function for creating scatterplot matrices.

  3. X <- cbind(Apps, Accept, Enroll, Room.Board, Books): This line creates a new data frame X by binding together the columns “Apps,” “Accept,” “Enroll,” “Room.Board,” and “Books” from the “College” dataset. These variables will be used for the scatterplot matrix.

  4. scatterplotMatrix(X, diagonal=c("boxplot"), reg.line=F, smoother=F, pch=19, cex=0.6, col="blue"):

    • scatterplotMatrix(X, ...) generates a scatterplot matrix using the variables in the data frame X.
    • diagonal=c("boxplot") specifies that diagonal elements of the scatterplot matrix should be boxplots for each variable, showing their distributions.
    • reg.line=F disables the display of regression lines on scatterplots.
    • smoother=F disables the display of smoothing lines on scatterplots.
    • pch=19 sets the type of point character (marker) to 19, which corresponds to solid circles.
    • cex=0.6 adjusts the size of the point characters to be 60% of the default size.
    • col="blue" sets the color of the points in the scatterplots to blue.
  5. title(main="Scatterplot Matrix of College Attributes", col.main="navy", font.main=4, line = 3):

    • title(main=...) adds a main title to the scatterplot matrix. The main title is “Scatterplot Matrix of College Attributes.”
    • col.main="navy" sets the color of the main title text to navy blue.
    • font.main=4 specifies that the main title should be displayed in a bold font.
    • line=3 adjusts the vertical position of the main title to line 3 (it’s commonly used to adjust the title’s position).

The above code attaches the “College” dataset, creates a scatterplot matrix of selected variables, and customizes the appearance of the plot with titles, colors, and point styles. The scatterplot matrix provides a visual representation of the relationships between these variables and their distributions. It is a useful tool for exploring and understanding the data’s multivariate characteristics.

Parallel Coordinates

In ggplot2, a parallel coordinates plot (also known as parallel coordinates chart or parallel coordinate plot) is a data visualization technique used to display multivariate data. It is particularly useful for exploring the relationships and patterns in datasets with many numerical variables. A parallel coordinates plot represents each data point as a polyline (a series of connected line segments) that spans parallel axes, with each axis corresponding to a different variable.

Key characteristics and uses of parallel coordinates plots

  1. Multivariate Visualization: Parallel coordinates plots are effective for visualizing multiple variables simultaneously, making them suitable for exploring complex datasets with many dimensions.

  2. Data Comparison: They allow for the comparison of data points along each axis, enabling you to identify trends, clusters, and patterns across variables.

  3. Interactivity: Interactive versions of parallel coordinates plots allow users to brush or highlight data points to focus on specific subsets of the data.

  4. Normalization: Data can be normalized or standardized before plotting to ensure that variables with different scales do not dominate the visualization.

Here’s an example of how to create a parallel coordinates plot using ggplot2 in R:

# using the Auto data in ISLR, string match auto names on “toyota” and “ford”
# and work with corresponding data subset. Also, need to create variable Make.
Comp1 = Auto[c(grep("toyota", Auto$name), grep("ford", Auto$name)), ]
Comp1$Make = c(rep("Toyota", 25), rep("Ford", 48))
Y = with(Comp1, cbind(cylinders, weight, horsepower, displacement, acceleration, mpg))
# Colors by condition:
car.colors = ifelse(test = Comp1$Make=="Ford", yes = "blue", no
= "magenta")
# Line type by condition:
car.lty = ifelse(test = Comp1$year < 75, yes = "dotted", no = "longdash")
parcoord(Y, col = car.colors, lty = car.lty, var.label=T)
mtext("Profile Plot of Toyota and Ford Cars", line = 2)

This example performs several tasks related to data manipulation and visualization using the “Auto” dataset from the “ISLR” and “MASS” libraries. Let’s break down the code and interpret each part:

  1. Loading Libraries:
    • library(ISLR): This line loads the “ISLR” library, which contains datasets and functions for the book “An Introduction to Statistical Learning with Applications in R.”
    • library(MASS): This line loads the “MASS” library, which contains functions and datasets from the book “Modern Applied Statistics with S.”
  2. Data Subsetting and Variable Creation:
    • Comp1 = Auto[c(grep("toyota", Auto$name), grep("ford", Auto$name)), ]: This code creates a new data frame named “Comp1” by subsetting the “Auto” dataset. It selects rows where the “name” column contains either “toyota” or “ford” using regular expressions.
    • Comp1$Make = c(rep("Toyota", 25), rep("Ford", 48)): It creates a new variable “Make” in the “Comp1” dataset and assigns values “Toyota” for the first 25 rows and “Ford” for the remaining 48 rows.
  3. Data Transformation:
    • Y = with(Comp1, cbind(cylinders, weight, horsepower, displacement, acceleration, mpg)): This line creates a new dataset “Y” by selecting specific columns (cylinders, weight, horsepower, displacement, acceleration, mpg) from the “Comp1” dataset and combining them into a matrix.
  4. Color and Line Type Assignment:
    • car.colors = ifelse(test = Comp1$Make=="Ford", yes = "blue", no = "magenta"): This code creates a vector “car.colors” that assigns colors based on the “Make” variable. Cars made by “Ford” are assigned the color “blue,” while others are assigned “magenta.”
    • car.lty = ifelse(test = Comp1$year < 75, yes = "dotted", no = "longdash"): It creates a vector “car.lty” that assigns line types based on the “year” variable. Cars with a year before 1975 are assigned a “dotted” line, and others are assigned a “longdash” line.
  5. Visualization:
    • parcoord(Y, col = car.colors, lty = car.lty, var.label = T): This line creates a parallel coordinates plot (“parcoord”) using the “Y” dataset. Parallel coordinates plots are used to visualize multivariate data. The “col” argument specifies colors based on “car.colors,” and the “lty” argument specifies line types based on “car.lty.”
    • mtext("Profile Plot of Toyota and Ford Cars", line = 2): It adds a main title to the parallel coordinates plot with the text “Profile Plot of Toyota and Ford Cars” on the second line.

As can be seen, we subsets and transforms data from the “Auto” dataset, assigns colors and line types based on specific conditions, and creates a parallel coordinates plot to visualize the relationships between various automotive attributes for Toyota and Ford cars. The plot helps in comparing these two car brands based on their attributes.

Star Plot

A star plot, also known as a spider plot or radar chart, is a data visualization tool used to display multivariate data in a two-dimensional space. It is particularly useful for comparing and visualizing the relationships between multiple variables across different categories or groups. A star plot represents data as a set of radiating axes, with each axis corresponding to a different variable or attribute. Data points are plotted along these axes, and connecting the data points creates a star-shaped or polygonal pattern.

Key characteristics and uses of star plots include

  1. Multivariate Comparison: Star plots are effective for comparing the values of multiple variables for different categories or groups. Each category or group is represented as a separate polygon in the plot.

  2. Visualizing Profiles: Star plots can be used to visualize profiles or patterns of variables for different entities, such as individuals, products, or organizations. Each entity’s profile is represented by a polygon.

  3. Normalization: It is common to normalize the data before creating a star plot, especially if the variables have different scales. Normalization ensures that each variable contributes equally to the plot.

  4. Highlighting Differences: Star plots make it easy to see differences in the values of variables across categories. Differences in the shapes of the polygons can indicate variations in the data.

  5. Limitations: Star plots are most suitable for small to moderate numbers of variables (typically less than ten) due to the complexity of visualizing many axes. When the number of variables is large, the plot can become cluttered and difficult to interpret.

Here’s an example of how to create a basic star plot in R using ggplot2 (Note you can also create a star plot in R using the fmsb library)

plot_path <- "C:/Users/HP/OneDrive/Documents/STA832DATAMINING/my solutions/starplot.pdf"

# Save the plot to the specified path
pdf(plot_path)
# Or use png() for PNG format: png(plot_path)

# Your code for generating the plot
CollegeSmall = College[College$Enroll <= 100,]
ShortLabels = c("Alaska Pacific", "Capitol", "Centenary", "Central Wesleyan", "Chatham", "Christendom", "Notre Dame", "St. Joseph", "Lesley", "McPherson", "Mount Saint Clare", "Saint Francis IN", "Saint Joseph", "Saint Mary-of-the-Woods", "Southwestern", "Spalding", "St. Martin's", "Tennessee Wesleyan", "Trinity DC", "Trinity VT", "Ursuline", "Webber", "Wilson", "Wisconsin Lutheran")
faces(CollegeSmall[,-c(1:4)], scale=T, nrow=6, ncol=4, labels=ShortLabels)
mtext("Comparison of Selected Colleges and Universities", line=2)
# Close the graphics device
dev.off()
## png 
##   2

For the sake of clarity, see the appendix for the plot. The star plot compares selected colleges and universities. Here’s a step-by-step interpretation of what the code does:

  1. plot_path <- "C:/Users/HP/OneDrive/Documents/STA832DATAMINING/my solutions/starplot.pdf": This line defines a file path where the resulting plot will be saved. The plot will be saved in PDF format with the specified file path.

  2. pdf(plot_path): This function opens a PDF graphics device, setting it up so that any subsequent plots will be written to the specified PDF file.

  3. CollegeSmall = College[College$Enroll <= 100,]: This line creates a subset of the College data frame. It selects rows from the College data frame where the Enroll column has a value less than or equal to 100, and assigns this subset to the CollegeSmall variable. This is likely done to filter out colleges with small enrollments.

  4. ShortLabels: This is a character vector containing short labels for the colleges and universities that will be plotted on the star plot.

  5. faces(CollegeSmall[,-c(1:4)], scale=T, nrow=6, ncol=4, labels=ShortLabels): This is the code that generates the star plot. Here’s what each argument does:

    • CollegeSmall[,-c(1:4)]: This selects all columns of the CollegeSmall data frame except the first four columns (presumably, these columns contain non-relevant information).
    • scale=T: It scales the variables in the plot.
    • nrow=6 and ncol=4: These parameters specify the number of rows and columns for arranging the individual plots for each college in the grid.
    • labels=ShortLabels: It provides the short labels for each college to be displayed on the plot.
  6. mtext("Comparison of Selected Colleges and Universities", line=2): This line adds a main title to the plot, specifying the title text and the line on which it will appear.

  7. dev.off(): This function closes the PDF graphics device that was opened with pdf(plot_path). It finalizes the PDF file and saves the star plot to the specified file path.

5. Conclusion

GGPLOT2 is a powerful and flexible tool for data visualization in R. Its grammar of graphics approach allows for the creation of a wide range of visualizations while maintaining a high level of customization and control. Whether you are a data scientist, analyst, or researcher, ggplot2 can help you explore and communicate insights effectively through visually engaging plots and charts. Learning ggplot2 is a valuable skill for anyone working with data and seeking to unlock its hidden stories.

Several plots are produced using ggplot2. A typical example is the box plot of wage values, which displays individual data points, and adds custom titles and formatting to enhance the plot’s appearance. The box plot provides a visual summary of the distribution of wages in the “Wage” dataset, making it easy to identify central tendencies, variations, and potential outliers.

A scatter plot visualization allows you to examine how age and wage are distributed across different groups in a dataset, making it easier to identify patterns or differences in the data. Similarly, a contour plot using ggplot2 to visualize the density of data values across different variable values in the dataset. It provides insights into how varaibles are distributed with respect to factors under consideration and allows you to identify areas of higher or lower density. The plot can include axis labels and a main title, along with customizations for the appearance of the text elements.

Consequently, when displaying relationship between more than two variables the scatterplot matrix, Parallel Cordinates and Star plots are examples of tools one can use. A scatterplot matrix is a grid of scatterplots that shows pairwise relationships between multiple variables. The scatterplot matrix provides a visual summary of the relationships between these attributes, making it easier to understand the data and explore potential patterns and dependencies. A parallel coordinates plot display plots where each line represents a data point’s values across several variables. Finally the star plot filters a dataset of colleges and universities to include only those with enrollments of 100 or fewer students, then creates a star plot to compare these selected institutions. The resulting plot is saved as a PDF file with the specified file path. The plot displays various characteristics or variables of these colleges using a star-shaped visualization. For detailed study about the use of ggplot2 in data visualization, see the following Text

Appendix

Appendix A: College Comparison Plot