2025-09-05
Why do we need a visual representation of the data?
Data often comes to us as a large matrix containing many observations over many variables. It’s hard to make sense of the relationships between variables when you’re only looking at a bunch of numbers.
## [1] 190 208
Why do we need a visual representation of the data?
Sometimes a high level view of the data can reveal structure that wouldn’t be obvious any other way.
The type of data you’re working with depends on the project and can range from gene counts from an omics sequencing assay to clinical measurements extracted from a patients’ electronic health records.
It’s the collection of all the information you have available and it takes many forms.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If not, you usually have to do some preprocessing (data wrangling) in order to get it into this form.
what are some ways you receive your data in?
'\t'
and ','
. Every
programming/statistical environment has ways of working with this type
of data. patient_id: P001
name: Alice Smith
age: 45
diagnosis: Hypertension
medication_1: Lisinopril
medication_2: Hydrochlorothiazide
blood_pressure: 140/90
heart_rate: 78
oxygen_saturation: 97
[
{
"patient_id": "P001",
"name": "Alice Smith",
"age": 45,
"diagnosis": "Hypertension",
"medications": ["Lisinopril", "Hydrochlorothiazide"],
"vitals": {
"blood_pressure": "140/90",
"heart_rate": 78,
"oxygen_saturation": 97
}
},
{
"patient_id": "P002",
"name": "Bob Johnson",
"age": 60,
"diagnosis": "Type 2 Diabetes",
"medications": ["Metformin"],
"vitals": {
"blood_pressure": "130/85",
"heart_rate": 72,
"oxygen_saturation": 96
}
}
]
A data.frame is the typical way of working with rectangular data.
It’s hard to make sense of your data if you just looking at it as a bunch of numbers.
## [1] 150 5
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
Here are some of the most common and useful things to look for when trying to understand your data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
#you can extend the capabilities of builtin summary
mysummary <- function(x){
c(summary(x), n=length(x), numNA=sum(is.na(x)))
}
mysummary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max. n
## 4.300000 5.100000 5.800000 5.843333 6.400000 7.900000 150.000000
## numNA
## 0.000000
Summaries are a good place to start, but they don’t tell the whole picture.
The builtin “anscombe” dataset shows this off nicely. It’s a set of 4 coordinate pairs that all have very similar means, variances and correlations, but very different associations.
## [1] 11 8
## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
## [1] 0.8164205
## [1] 0.8162365
## [1] 0.8162867
## [1] 0.8165214
# Plot each of the 4 datasets
par(mfrow = c(2,2)) # 2x2 layout
for (i in 1:4) {
x <- anscombe[[paste0("x", i)]]
y <- anscombe[[paste0("y", i)]]
plot(x, y, main = paste("Dataset", i),
xlim = c(4, 20), ylim = c(2, 14),
pch = 19, col = "blue")
abline(lm(y ~ x), col = "red") # linear fit
}
There’s a lot more going on here than can be explained with simple summaries.
As another example, we can create two datasets that come from 2 different distributions(unimodal vs bimodal) that have the same mean and variances. A plot of the densities shows off their differences.
n <- 1000
bimodal <- c(rnorm(n/2, mean = 4, sd = 1),
rnorm(n/2, mean = 8, sd = 1))
# Unimodal: Single normal with same mean and SD
target_mean <- mean(bimodal)
target_sd <- sd(bimodal)
unimodal <- rnorm(n, mean = target_mean, sd = target_sd)
# Compare summaries
mean(bimodal)
## [1] 5.974974
## [1] 6.010529
## [1] 5.029983
## [1] 5.036837
# Visual comparison
par(mfrow = c(1,2))
hist(bimodal, breaks = 40, col = "skyblue", main = "Bimodal Distribution")
hist(unimodal, breaks = 40, col = "salmon", main = "Unimodal Distribution")
If you only looked at the means (5.9749743, 6.0105295) and variances (5.0299829, 5.0368372), you wouldn’t pick up the differences that a simple plot easily shows.
Let’s go over common ways to visualize data, starting with a single variable.
If the variable is numeric and continuous, it’s a good idea to start with a histogram.
This breaks up the observed values into ‘bins’ and counts the number of occurances falling in each one. By default, R tries to find a good default for the number bins, but this can be changed by the user
Another way is to plot a ‘smoothed’ version of this histogram call a density plot
instead of ‘bins’ you specify a bandwidth parameter to control how ‘smooth’ you want the plot to be
if you’re dealing with discrete numbers, sometimes is better to plot the actual counts rather than binning or smoothing. The built in ‘table’ function counts the occurrences of each level of a categorical variable.
##
## 1 2 3 4 6 8
## 7 10 3 10 1 1
Another common way is to show the data as a boxplot. This combines the typical summary stats as well as ‘outliers’ in a simple chart.
Just like the numeric summaries don’t show the full picture, visual representations of these summaries have the same problem. You can address this by showing the actual data points along with the boxplot summaries with a stripchart. The ‘add’ parameter of the stripchart function makes the dots appear on top of the previous plot instead of creating a new one.
# Create single boxplot
boxplot(mtcars$mpg)
stripchart(mtcars$mpg,
method = "jitter",
pch = 21,
vertical = TRUE,
add = TRUE)
Most of these plots can still be used even if you have more than one variable.
If you want to show histograms over multiple levels of a grouping factor, you can plot them as separate plots specifying which observations you want in each plot.
If you want to make a figure with more than one plot, you can use the
par
function with the mfrow
, or
mfcol
parameter.
par
is short for parametersmfrow
is short for multi figure row fillmfcol
is short for multi figure col fillpar(mfrow=c(3,1))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])
par(mfrow=c(1,3))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])
With some extra coding you can even show them all together on one plot, overlapping with each other.
# Set up colors
colors <- c("setosa" = rgb(1, 0, 0, 0.4), # semi-transparent red
"versicolor" = rgb(0, 1, 0, 0.4), # semi-transparent green
"virginica" = rgb(0, 0, 1, 0.4)) # semi-transparent blue
# Plot histogram for each species
hist(iris$Sepal.Length[iris$Species == "setosa"],
col = colors["setosa"],
xlim = range(iris$Sepal.Length),
main = "Overlapping Histograms of Sepal.Length by Species",
xlab = "Sepal Length",
breaks = 20,
freq = FALSE)
# Add others
hist(iris$Sepal.Length[iris$Species == "versicolor"],
col = colors["versicolor"],
add = TRUE,
breaks = 20,
freq = FALSE)
hist(iris$Sepal.Length[iris$Species == "virginica"],
col = colors["virginica"],
add = TRUE,
breaks = 20,
freq = FALSE)
legend("topright", legend = names(colors), fill = colors, border = NA)
Boxplots have a slightly easier syntax, using R’s “formula” notation: Y~X
boxplot(Sepal.Length ~ Species,
data = iris,
col = c("tomato", "skyblue", "palegreen"),
main = "Sepal Length by Species",
xlab = "Species", ylab = "Sepal Length",
outline = FALSE) # hides outlier symbols (since we’ll show all points)
# Add jittered points
stripchart(Sepal.Length ~ Species,
data = iris,
method = "jitter",
pch = 21,
bg = "gray80",
col = "black",
vertical = TRUE,
add = TRUE)
For two numerical variables, the most common approach is to do a ‘scatterplot’. This maps the two values to the XY coordinates of a 2D plot.
These scatterplots can be customized to show more than one category in the same plot
# Set up colors by species
species_colors <- c("setosa" = "tomato",
"versicolor" = "skyblue",
"virginica" = "palegreen")
# Plot empty plot area
plot(iris$Sepal.Length, iris$Sepal.Width,
type = "n", # don't plot points yet
xlab = "Sepal Length", ylab = "Sepal Width",
main = "Sepal Dimensions by Species (Base R)")
# Add points by species
for (sp in levels(iris$Species)) {
points(iris$Sepal.Length[iris$Species == sp],
iris$Sepal.Width[iris$Species == sp],
col = species_colors[sp],
pch = 19)
}
# Add a legend
legend("topright", legend = names(species_colors),
col = species_colors, pch = 19, title = "Species")
Visualization is a useful tool and should always be done after loading in your data.
Once you’ve run your analysis you’ll want to share your results with your collaborators.
This is were rmarkdown and Rstudio come in handy. Your write your analysis in an .rmd file instead of an .r script. The .rmd will contain all the code you used as well as the writeup and conclusions you draw from interpreting the results.