Before applying formal multivariate techniques, it is important to first explore the data using graphical methods. These visualisations provide an intuitive understanding of the structure of the data and help identify patterns, relationships, and potential issues such as outliers or unusual observations.
A key aspect of this exploratory step is assessing whether there are correlations between variables. Many multivariate methods, such as PCA, Exploratory Factor Analysis (EFA), and Discriminant Analysis, rely on the presence of relationships between variables to extract meaningful structure. Graphical tools allow us to quickly evaluate whether such relationships exist and whether these methods are appropriate.
Common graphical displays include scatterplots, pairwise scatterplot matrices, and correlation heatmaps. These plots help reveal trends, clusters, and dependencies between variables, and provide a first indication of the underlying dimensionality of the data.
In the following section, we will use these graphical techniques to explore the data before moving on to more formal methods for modelling and interpreting multivariate relationships.
First we need to read in our data into R.Throughtout this example we
will use the wine data. These data are the results of a
chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines.
The attributes are:
The wine data is in a .txt format, so to read in the
data we can use the read.table() function in R.
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic.acid","Ash","Alcalinity.of.ash","Magnesium","Total.phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color.intensity","Hue","OD280/OD315.of.diluted.wines","Proline")
dim(wine)
#> [1] 178 14
head(wine, 5)
#> Cultivar Alcohol Malic.acid Ash Alcalinity.of.ash Magnesium Total.phenols
#> 1 1 14.23 1.71 2.43 15.6 127 2.80
#> 2 1 13.20 1.78 2.14 11.2 100 2.65
#> 3 1 13.16 2.36 2.67 18.6 101 2.80
#> 4 1 14.37 1.95 2.50 16.8 113 3.85
#> 5 1 13.24 2.59 2.87 21.0 118 2.80
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color.intensity Hue
#> 1 3.06 0.28 2.29 5.64 1.04
#> 2 2.76 0.26 1.28 4.38 1.05
#> 3 3.24 0.30 2.81 5.68 1.03
#> 4 3.49 0.24 2.18 7.80 0.86
#> 5 2.69 0.39 1.82 4.32 1.04
#> OD280/OD315.of.diluted.wines Proline
#> 1 3.92 1065
#> 2 3.40 1050
#> 3 3.17 1185
#> 4 3.45 1480
#> 5 2.93 735
The wine dataset contains 178 observations of 14
variables, including the 13 measured quantities of chemicals and the
variable Cultivar, which indicates the type of grape from which the wine
was produced.
A simple and effective way to begin exploring relationships between variables is through scatterplots. These plots display the relationship between two variables at a time and allow us to visually assess patterns such as linear relationships, clusters, or outliers.
In R, scatterplots can be created using the ggplot2 package, which provides a flexible and consistent framework for data visualisation. The geom_point() function is used to create scatterplots by plotting individual observations as points.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.5.3
ggplot(data=wine, aes(x=Alcohol, y=Malic.acid)) + geom_point()
ggplot(data=wine, aes(x=Ash, y=Alcalinity.of.ash)) + geom_point()
ggplot(data=wine, aes(x=Hue, y=Color.intensity)) + geom_point()
By creating multiple pairwise scatterplots for different combinations of variables, we can begin to identify potential correlations and structures in the data. This provides an important first step before applying more formal multivariate techniques.
In addition to basic scatterplots, it is often useful to incorporate known group information into the visualisation. In this dataset, the variable Cultivar indicates the class membership of each observation. By colouring the points according to this variable, we can assess whether the groups show distinct patterns or separation in the data.
This can be done in ggplot2 by mapping the color aesthetic to the Cultivar variable:
ggplot(data=wine, aes(x=Hue, y=Color.intensity, col=Cultivar)) + geom_point()
ggplot(data=wine, aes(x=Alcohol, y=Malic.acid, col=Cultivar, size=Magnesium)) + geom_point()
Another useful way to visualise relationships between multiple
variables is through the scatterplotMatrix() function from
the car package. This function creates an enhanced
scatterplot matrix with additional features compared to the base
pairs() function.
library(car)
#> Loading required package: carData
scatterplotMatrix(wine[,2:14])
scatterplotMatrix(wine[,2:14],groups=wine$Cultivar)
Compared to simpler approaches, scatterplotMatrix()
provides:
For the interpretation of the plots, you can:
A similar type of visualisation can be obtained using the
pairs.panels() function from the psych
package. It provides an enhanced scatterplot matrix that combines
several types of information into one display.
library(psych)
#> Warning: package 'psych' was built under R version 4.5.3
#>
#> Attaching package: 'psych'
#> The following object is masked from 'package:car':
#>
#> logit
#> The following objects are masked from 'package:ggplot2':
#>
#> %+%, alpha
pairs.panels(wine[,2:14])
This function produces a matrix with multiple layers of information:
Some additional options are:
library(psych)
pairs.panels(wine[,2:14],
method = "pearson",
hist.col = "lightblue",
density = TRUE,
ellipses = TRUE)
Where:
For the interpretation:
pairs.panels() is a powerful exploratory tool because it
combines scatterplots, correlations, and distributions into a single
figure. This makes it particularly useful for quickly assessing the
overall structure of multivariate data before applying formal methods
such as PCA or factor analysis.
A correlation plot is a compact way to visualise the correlation
matrix and quickly assess relationships between many variables at once.
The corrplot package provides a flexible function,
corrplot(), with several display options, controlled by the
method argument.
library(corrplot)
#> Warning: package 'corrplot' was built under R version 4.5.3
#> corrplot 0.95 loaded
corrplot(cor(wine[,2:14]),method='color')
corrplot(cor(wine[,2:14]),method='circle')
corrplot(cor(wine[,2:14]),method='number')
corrplot(cor(wine[,2:14]),method='shade')
You can improve readability with extra arguments:
corrplot(cor(wine[,2:14]),method='color', type='upper',addCoef.col='black',
tl.col='black',tl.srt=45,number.cex=0.8,number.digits=1)
The corrplot() function provides a flexible and visually
intuitive way to explore correlation structures. By switching between
methods such as color, circle, and number, you can emphasize either
visual patterns or exact values, depending on the goal of the
analysis.
In addition to functions such as corrplot(), correlation
matrices can also be visualized using ggplot2, which offers
greater flexibility and customization. This approach involves reshaping
the correlation matrix into a tidy format and then displaying it as a
heatmap.
First, the correlation matrix is computed and converted into a long (tidy) format:
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ lubridate 1.9.4 ✔ tibble 3.3.0
#> ✔ purrr 1.1.0 ✔ tidyr 1.3.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ psych::%+%() masks ggplot2::%+%()
#> ✖ psych::alpha() masks ggplot2::alpha()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ dplyr::recode() masks car::recode()
#> ✖ purrr::some() masks car::some()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
wine_cormat <- cor(wine[,2:14]) %>%
as.data.frame() %>%
rownames_to_column() %>%
pivot_longer(-rowname)
This transformation results in a dataset where each row represents a pair of variables and their corresponding correlation value.
The correlation heatmap can then be constructed using
ggplot2:
wine_cormat %>%
ggplot(aes(x=rowname,y=name,fill=value))+
geom_tile() +
geom_text(aes(label=round(value,2)),color="white") +
scale_fill_gradient2(low = "red",
high = "darkgreen",
mid="white",
midpoint=0,
limit=c(-1,1),
name="pearson\nCorrelation")
Interpretation:
Each square represents the correlation between two variables
The color indicates the strength and direction of the relationship:
The numbers inside the squares show the exact correlation values
In some cases, relationships between variables may not be fully
captured in two dimensions. A 3D scatterplot allows us to visualise the
relationship between three variables simultaneously. In R, this can be
done using the scatterplot3d package.
library(scatterplot3d)
#> Warning: package 'scatterplot3d' was built under R version 4.5.2
scatterplot3d(wine$Magnesium, wine$Flavanoids, wine$Hue,
main="3D Scatterplot",
xlab="Magnesium", ylab="Flavanoids", zlab="Hue",
pch=16, color= wine$Cultivar)
In a 3D scatterplot, each axis represents one variable, each point is an observation and the spatial arrangement shows how the three variables relate. The colors of the points indicate group membership (Cultivar).
Clusters of points may indicate group structure, while separation between colors suggests that groups differ across variables. Overlapping points indicate weaker separation, and patterns (e.g. planes or trends) may suggest relationships between variables.
Static 3D plots can be harder to interpret due to perspective as
overlapping points may obscure patterns. Rotating the plot interactively
is often helpful (this is not available in basic scatterplot3d).
Scatterplot3d provides a simple way to extend scatterplots
into three dimensions, allowing for a richer visual exploration of
relationships between variables. However, it is often best used
alongside 2D visualisations for clearer interpretation.