Lecture 2: Visualization

2025-09-05

Visualization (intro)

Why do we need a visual representation of the data?

Data often comes to us as a large matrix containing many observations over many variables. It’s hard to make sense of the relationships between variables when you’re only looking at a bunch of numbers.

library("DT")
x = read.csv("s1.csv")
dim(x)
## [1] 190 208
datatable(x)

Visualization (intro)

Why do we need a visual representation of the data?

Sometimes a high level view of the data can reveal structure that wouldn’t be obvious any other way.

what’s in this section?

  1. ways to read in data
  2. ways to summarize data
  3. ways to plot data
  4. ways to report findings

1. ways to read in data

what is data?

The type of data you’re working with depends on the project and can range from gene counts from an omics sequencing assay to clinical measurements extracted from a patients’ electronic health records.

It’s the collection of all the information you have available and it takes many forms.

#look at first few rows of one of the builtin datasets.
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If not, you usually have to do some preprocessing (data wrangling) in order to get it into this form.

get it into a data.frame

A data.frame is the typical way of working with rectangular data.

2. ways to summarize data

It’s hard to make sense of your data if you just looking at it as a bunch of numbers.

dim(iris)
## [1] 150   5
datatable(iris)
iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9

Here are some of the most common and useful things to look for when trying to understand your data

How do you calculate these in R?

#summarize a single variable
summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900
#or do it all at once
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
#you can extend the capabilities of builtin summary
mysummary <- function(x){
    c(summary(x), n=length(x), numNA=sum(is.na(x)))
}
mysummary(iris$Sepal.Length)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.          n 
##   4.300000   5.100000   5.800000   5.843333   6.400000   7.900000 150.000000 
##      numNA 
##   0.000000

3. ways to plot the data

Summaries are a good place to start, but they don’t tell the whole picture.

The builtin “anscombe” dataset shows this off nicely. It’s a set of 4 coordinate pairs that all have very similar means, variances and correlations, but very different associations.

dim(anscombe)
## [1] 11  8
datatable(anscombe)
summary(anscombe)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500
cor(anscombe$x1, anscombe$y1)
## [1] 0.8164205
cor(anscombe$x2, anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3, anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4, anscombe$y4)
## [1] 0.8165214
# Plot each of the 4 datasets
par(mfrow = c(2,2))  # 2x2 layout

for (i in 1:4) {
  x <- anscombe[[paste0("x", i)]]
  y <- anscombe[[paste0("y", i)]]
  plot(x, y, main = paste("Dataset", i),
       xlim = c(4, 20), ylim = c(2, 14),
       pch = 19, col = "blue")
  abline(lm(y ~ x), col = "red")  # linear fit
}

There’s a lot more going on here than can be explained with simple summaries.

As another example, we can create two datasets that come from 2 different distributions(unimodal vs bimodal) that have the same mean and variances. A plot of the densities shows off their differences.

n <- 1000
bimodal <- c(rnorm(n/2, mean = 4, sd = 1),
             rnorm(n/2, mean = 8, sd = 1))

# Unimodal: Single normal with same mean and SD
target_mean <- mean(bimodal)
target_sd <- sd(bimodal)
unimodal <- rnorm(n, mean = target_mean, sd = target_sd)

# Compare summaries
mean(bimodal)
## [1] 5.974974
mean(unimodal)
## [1] 6.010529
var(bimodal)
## [1] 5.029983
var(unimodal)
## [1] 5.036837
# Visual comparison
par(mfrow = c(1,2))
hist(bimodal, breaks = 40, col = "skyblue", main = "Bimodal Distribution")
hist(unimodal, breaks = 40, col = "salmon", main = "Unimodal Distribution")

If you only looked at the means (5.9749743, 6.0105295) and variances (5.0299829, 5.0368372), you wouldn’t pick up the differences that a simple plot easily shows.

3. ways to plot the data

Let’s go over common ways to visualize data, starting with a single variable.

If the variable is numeric and continuous, it’s a good idea to start with a histogram.

plot(hist(iris$Sepal.Length))

This breaks up the observed values into ‘bins’ and counts the number of occurances falling in each one. By default, R tries to find a good default for the number bins, but this can be changed by the user

plot(hist(iris$Sepal.Length, breaks=5))

plot(hist(iris$Sepal.Length, breaks=20))

Another way is to plot a ‘smoothed’ version of this histogram call a density plot

plot(density(iris$Sepal.Length))

instead of ‘bins’ you specify a bandwidth parameter to control how ‘smooth’ you want the plot to be

plot(density(iris$Sepal.Length, bw=.1))

plot(density(iris$Sepal.Length, bw=.3))

plot(density(iris$Sepal.Length, bw=.5))

if you’re dealing with discrete numbers, sometimes is better to plot the actual counts rather than binning or smoothing. The built in ‘table’ function counts the occurrences of each level of a categorical variable.

datatable(mtcars)
table(mtcars$carb)
## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1
plot(table(mtcars$carb))

Another common way is to show the data as a boxplot. This combines the typical summary stats as well as ‘outliers’ in a simple chart.

boxplot(mtcars$mpg)

boxplot(mtcars$hp)

#you can even do it all at once
boxplot(mtcars)

Just like the numeric summaries don’t show the full picture, visual representations of these summaries have the same problem. You can address this by showing the actual data points along with the boxplot summaries with a stripchart. The ‘add’ parameter of the stripchart function makes the dots appear on top of the previous plot instead of creating a new one.

# Create single boxplot
boxplot(mtcars$mpg)

stripchart(mtcars$mpg,
           method = "jitter",
           pch = 21,
           vertical = TRUE,
           add = TRUE)

3. ways to plot the data

Most of these plots can still be used even if you have more than one variable.

If you want to show histograms over multiple levels of a grouping factor, you can plot them as separate plots specifying which observations you want in each plot.

hist(iris$Sepal.Length[iris$Species == "setosa"])

hist(iris$Sepal.Length[iris$Species == "versicolor"])

hist(iris$Sepal.Length[iris$Species == "virginica"])

If you want to make a figure with more than one plot, you can use the par function with the mfrow, or mfcol parameter.

par(mfrow=c(3,1))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])

par(mfrow=c(1,3))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])

With some extra coding you can even show them all together on one plot, overlapping with each other.

# Set up colors
colors <- c("setosa" = rgb(1, 0, 0, 0.4),      # semi-transparent red
            "versicolor" = rgb(0, 1, 0, 0.4),  # semi-transparent green
            "virginica" = rgb(0, 0, 1, 0.4))   # semi-transparent blue

# Plot histogram for each species
hist(iris$Sepal.Length[iris$Species == "setosa"],
     col = colors["setosa"],
     xlim = range(iris$Sepal.Length),
     main = "Overlapping Histograms of Sepal.Length by Species",
     xlab = "Sepal Length",
     breaks = 20,
     freq = FALSE)

# Add others
hist(iris$Sepal.Length[iris$Species == "versicolor"],
     col = colors["versicolor"],
     add = TRUE,
     breaks = 20,
     freq = FALSE)

hist(iris$Sepal.Length[iris$Species == "virginica"],
     col = colors["virginica"],
     add = TRUE,
     breaks = 20,
     freq = FALSE)

legend("topright", legend = names(colors), fill = colors, border = NA)

Boxplots have a slightly easier syntax, using R’s “formula” notation: Y~X

boxplot(Sepal.Length ~ Species,
        data = iris,
        col = c("tomato", "skyblue", "palegreen"),
        main = "Sepal Length by Species",
        xlab = "Species", ylab = "Sepal Length",
        outline = FALSE)  # hides outlier symbols (since we’ll show all points)

# Add jittered points
stripchart(Sepal.Length ~ Species,
           data = iris,
           method = "jitter",
           pch = 21,
           bg = "gray80",
           col = "black",
           vertical = TRUE,
           add = TRUE)

For two numerical variables, the most common approach is to do a ‘scatterplot’. This maps the two values to the XY coordinates of a 2D plot.

plot(iris$Sepal.Length, iris$Sepal.Width)

These scatterplots can be customized to show more than one category in the same plot

# Set up colors by species
species_colors <- c("setosa" = "tomato", 
                    "versicolor" = "skyblue", 
                    "virginica" = "palegreen")

# Plot empty plot area
plot(iris$Sepal.Length, iris$Sepal.Width,
     type = "n",  # don't plot points yet
     xlab = "Sepal Length", ylab = "Sepal Width",
     main = "Sepal Dimensions by Species (Base R)")

# Add points by species
for (sp in levels(iris$Species)) {
  points(iris$Sepal.Length[iris$Species == sp],
         iris$Sepal.Width[iris$Species == sp],
         col = species_colors[sp],
         pch = 19)
}

# Add a legend
legend("topright", legend = names(species_colors), 
       col = species_colors, pch = 19, title = "Species")

3. ways to plot the data

Visualization is a useful tool and should always be done after loading in your data.

4. ways to report findings

Once you’ve run your analysis you’ll want to share your results with your collaborators.

4. ways to report findings

This is were rmarkdown and Rstudio come in handy. Your write your analysis in an .rmd file instead of an .r script. The .rmd will contain all the code you used as well as the writeup and conclusions you draw from interpreting the results.