Lecture 2: Visualization

2025-09-05

Visualization (intro)

Why do we need a visual representation of the data?

Data often comes to us as a large matrix containing many observations over many variables. It’s hard to make sense of the relationships between variables when you’re only looking at a bunch of numbers.

library("DT")
x = read.csv("s1.csv")
dim(x)

## [1] 190 208

datatable(x)

Visualization (intro)

Why do we need a visual representation of the data?

Sometimes a high level view of the data can reveal structure that wouldn’t be obvious any other way.

image(as.matrix(x[,-1]))

what’s in this section?

ways to read in data
ways to summarize data
ways to plot data
ways to report findings

1. ways to read in data

what is data?

The type of data you’re working with depends on the project and can range from gene counts from an omics sequencing assay to clinical measurements extracted from a patients’ electronic health records.

It’s the collection of all the information you have available and it takes many forms.

If you’re lucky, it is already in a matrix:
- rows are observations
- columns are the variables for each obs

#look at first few rows of one of the builtin datasets.
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If not, you usually have to do some preprocessing (data wrangling) in order to get it into this form.

what are some ways you receive your data in?
- spreadsheet (excel, numbers, google sheets): This probably the most common way of sharing datasets.
  - downsides are that it’s saved in binary format and uses proprietary methods to save in. You have to make sure all your collaborators have access to the same software or have tools that can handle the formats.
- delimited txt: Easiest and most reliable way of sharing data. It’s saved as plain txt, with each column separated by a delimiter. Most common delimiters are '\t' and ','. Every programming/statistical environment has ways of working with this type of data.
- fixed width: Saved as plain txt, but instead of using a delimiting character, the fields are separated at prespecified ranges.
  - first 2 characters make up age, character 3-4 month of birth, 5-6 # of dependants, 7- Name
  - 450601ALICESMITH
- formatted txt: Data can be in what ever format your collaborators choose.
  - You might have 1 txt file per patient
  - ```
    patient_id: P001  
    name: Alice Smith  
    age: 45  
    diagnosis: Hypertension  
    medication_1: Lisinopril  
    medication_2: Hydrochlorothiazide  
    blood_pressure: 140/90  
    heart_rate: 78  
    oxygen_saturation: 97 
```
- Json: txt based hierarchical nested data structure
- ```
[  
  {  
    "patient_id": "P001",  
    "name": "Alice Smith",  
    "age": 45,  
    "diagnosis": "Hypertension",  
    "medications": ["Lisinopril", "Hydrochlorothiazide"],  
    "vitals": {  
      "blood_pressure": "140/90",  
      "heart_rate": 78,  
      "oxygen_saturation": 97  
    }  
  },  
  {  
    "patient_id": "P002",  
    "name": "Bob Johnson",  
    "age": 60,  
    "diagnosis": "Type 2 Diabetes",  
    "medications": ["Metformin"],  
    "vitals": {  
      "blood_pressure": "130/85",  
      "heart_rate": 72,  
      "oxygen_saturation": 96  
    }  
  }  
]  
```

get it into a data.frame

A data.frame is the typical way of working with rectangular data.

To load your data from a file to a data.frame, use one of the following
- read.table: When the data is store as delimited txt, default delim is ‘tab’
- read.csv: Similar to above, when you know the data is comma delimited. Uses read.table with common default parameters.
- read_xlsx() from the “readxl” package: load in excel files
- read_sheet() from the “googlesheets4” package: load in google spreadsheets
- what if you cant find a package to read your format?
  - save it as txt, then use one of the above

2. ways to summarize data

It’s hard to make sense of your data if you just looking at it as a bunch of numbers.

dim(iris)

## [1] 150   5

datatable(iris)

iris$Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9

Here are some of the most common and useful things to look for when trying to understand your data

What’s the center of the data?
- mean: average value
- median: half the data falls below/ half above
- mode: Most common value
How spread out is the data?
- variance: how far from the mean?
- standard deviation: sqrt of variance
Other descriptive stats?
- n: number of observations
- min/max: extremes at both ends
- missing vars: are there any NA values?
- quantiles:
  - 25%, 50%, 75%, etc.

How do you calculate these in R?

#summarize a single variable
summary(iris$Sepal.Length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900

#or do it all at once
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

#you can extend the capabilities of builtin summary
mysummary <- function(x){
    c(summary(x), n=length(x), numNA=sum(is.na(x)))
}
mysummary(iris$Sepal.Length)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.          n 
##   4.300000   5.100000   5.800000   5.843333   6.400000   7.900000 150.000000 
##      numNA 
##   0.000000

3. ways to plot the data

Summaries are a good place to start, but they don’t tell the whole picture.

The builtin “anscombe” dataset shows this off nicely. It’s a set of 4 coordinate pairs that all have very similar means, variances and correlations, but very different associations.

dim(anscombe)

## [1] 11  8

datatable(anscombe)

summary(anscombe)

##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500

cor(anscombe$x1, anscombe$y1)

## [1] 0.8164205

cor(anscombe$x2, anscombe$y2)

## [1] 0.8162365

cor(anscombe$x3, anscombe$y3)

## [1] 0.8162867

cor(anscombe$x4, anscombe$y4)

## [1] 0.8165214

# Plot each of the 4 datasets
par(mfrow = c(2,2))  # 2x2 layout

for (i in 1:4) {
  x <- anscombe[[paste0("x", i)]]
  y <- anscombe[[paste0("y", i)]]
  plot(x, y, main = paste("Dataset", i),
       xlim = c(4, 20), ylim = c(2, 14),
       pch = 19, col = "blue")
  abline(lm(y ~ x), col = "red")  # linear fit
}

There’s a lot more going on here than can be explained with simple summaries.

As another example, we can create two datasets that come from 2 different distributions(unimodal vs bimodal) that have the same mean and variances. A plot of the densities shows off their differences.

n <- 1000
bimodal <- c(rnorm(n/2, mean = 4, sd = 1),
             rnorm(n/2, mean = 8, sd = 1))

# Unimodal: Single normal with same mean and SD
target_mean <- mean(bimodal)
target_sd <- sd(bimodal)
unimodal <- rnorm(n, mean = target_mean, sd = target_sd)

# Compare summaries
mean(bimodal)

## [1] 5.974974

mean(unimodal)

## [1] 6.010529

var(bimodal)

## [1] 5.029983

var(unimodal)

## [1] 5.036837

# Visual comparison
par(mfrow = c(1,2))
hist(bimodal, breaks = 40, col = "skyblue", main = "Bimodal Distribution")
hist(unimodal, breaks = 40, col = "salmon", main = "Unimodal Distribution")

If you only looked at the means (5.9749743, 6.0105295) and variances (5.0299829, 5.0368372), you wouldn’t pick up the differences that a simple plot easily shows.

3. ways to plot the data

Let’s go over common ways to visualize data, starting with a single variable.

If the variable is numeric and continuous, it’s a good idea to start with a histogram.

plot(hist(iris$Sepal.Length))

This breaks up the observed values into ‘bins’ and counts the number of occurances falling in each one. By default, R tries to find a good default for the number bins, but this can be changed by the user

plot(hist(iris$Sepal.Length, breaks=5))

plot(hist(iris$Sepal.Length, breaks=20))

Another way is to plot a ‘smoothed’ version of this histogram call a density plot

plot(density(iris$Sepal.Length))

instead of ‘bins’ you specify a bandwidth parameter to control how ‘smooth’ you want the plot to be

plot(density(iris$Sepal.Length, bw=.1))

plot(density(iris$Sepal.Length, bw=.3))

plot(density(iris$Sepal.Length, bw=.5))

if you’re dealing with discrete numbers, sometimes is better to plot the actual counts rather than binning or smoothing. The built in ‘table’ function counts the occurrences of each level of a categorical variable.

datatable(mtcars)

table(mtcars$carb)

## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

plot(table(mtcars$carb))

Another common way is to show the data as a boxplot. This combines the typical summary stats as well as ‘outliers’ in a simple chart.

the ‘box’ shows the 25% and 75% quantiles.
the ‘line’ in the middle of the box shows the median
the 2 vertical lines coming out of the box (whiskers) are 1.5 * interquartile range
If any observations lie further than 1.5*IQR they show up at dots.

boxplot(mtcars$mpg)

boxplot(mtcars$hp)

#you can even do it all at once
boxplot(mtcars)

Just like the numeric summaries don’t show the full picture, visual representations of these summaries have the same problem. You can address this by showing the actual data points along with the boxplot summaries with a stripchart. The ‘add’ parameter of the stripchart function makes the dots appear on top of the previous plot instead of creating a new one.

# Create single boxplot
boxplot(mtcars$mpg)

stripchart(mtcars$mpg,
           method = "jitter",
           pch = 21,
           vertical = TRUE,
           add = TRUE)

3. ways to plot the data

Most of these plots can still be used even if you have more than one variable.

If you want to show histograms over multiple levels of a grouping factor, you can plot them as separate plots specifying which observations you want in each plot.

hist(iris$Sepal.Length[iris$Species == "setosa"])

hist(iris$Sepal.Length[iris$Species == "versicolor"])

hist(iris$Sepal.Length[iris$Species == "virginica"])

If you want to make a figure with more than one plot, you can use the par function with the mfrow, or mfcol parameter.

par is short for parameters
mfrow is short for multi figure row fill
mfcol is short for multi figure col fill

par(mfrow=c(3,1))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])

par(mfrow=c(1,3))
hist(iris$Sepal.Length[iris$Species == "setosa"])
hist(iris$Sepal.Length[iris$Species == "versicolor"])
hist(iris$Sepal.Length[iris$Species == "virginica"])

With some extra coding you can even show them all together on one plot, overlapping with each other.

# Set up colors
colors <- c("setosa" = rgb(1, 0, 0, 0.4),      # semi-transparent red
            "versicolor" = rgb(0, 1, 0, 0.4),  # semi-transparent green
            "virginica" = rgb(0, 0, 1, 0.4))   # semi-transparent blue

# Plot histogram for each species
hist(iris$Sepal.Length[iris$Species == "setosa"],
     col = colors["setosa"],
     xlim = range(iris$Sepal.Length),
     main = "Overlapping Histograms of Sepal.Length by Species",
     xlab = "Sepal Length",
     breaks = 20,
     freq = FALSE)

# Add others
hist(iris$Sepal.Length[iris$Species == "versicolor"],
     col = colors["versicolor"],
     add = TRUE,
     breaks = 20,
     freq = FALSE)

hist(iris$Sepal.Length[iris$Species == "virginica"],
     col = colors["virginica"],
     add = TRUE,
     breaks = 20,
     freq = FALSE)

legend("topright", legend = names(colors), fill = colors, border = NA)

Boxplots have a slightly easier syntax, using R’s “formula” notation: Y~X

boxplot(Sepal.Length ~ Species,
        data = iris,
        col = c("tomato", "skyblue", "palegreen"),
        main = "Sepal Length by Species",
        xlab = "Species", ylab = "Sepal Length",
        outline = FALSE)  # hides outlier symbols (since we’ll show all points)

# Add jittered points
stripchart(Sepal.Length ~ Species,
           data = iris,
           method = "jitter",
           pch = 21,
           bg = "gray80",
           col = "black",
           vertical = TRUE,
           add = TRUE)

For two numerical variables, the most common approach is to do a ‘scatterplot’. This maps the two values to the XY coordinates of a 2D plot.

plot(iris$Sepal.Length, iris$Sepal.Width)

These scatterplots can be customized to show more than one category in the same plot

# Set up colors by species
species_colors <- c("setosa" = "tomato", 
                    "versicolor" = "skyblue", 
                    "virginica" = "palegreen")

# Plot empty plot area
plot(iris$Sepal.Length, iris$Sepal.Width,
     type = "n",  # don't plot points yet
     xlab = "Sepal Length", ylab = "Sepal Width",
     main = "Sepal Dimensions by Species (Base R)")

# Add points by species
for (sp in levels(iris$Species)) {
  points(iris$Sepal.Length[iris$Species == sp],
         iris$Sepal.Width[iris$Species == sp],
         col = species_colors[sp],
         pch = 19)
}

# Add a legend
legend("topright", legend = names(species_colors), 
       col = species_colors, pch = 19, title = "Species")

3. ways to plot the data

Visualization is a useful tool and should always be done after loading in your data.

serves a sanity check to ensure your data was read in properly
can give you ideas about your data a show relationships you wouldn’t otherwise think of relying on summaries alone
can tell if you there are problems with your data and/or if there are outlier observations that will need to be addressed in downstream analysis.

4. ways to report findings

Once you’ve run your analysis you’ll want to share your results with your collaborators.

typical approach
- run your analysis in an R script
- open up Microsoft Word
- copy/paste plots and stats results from R console
- write up description in the Word document
- what happens when you have to rerun?
  - a new dataset comes in
  - a mistake was found
  - analysis methods or parameters were changed
- it gets hard to keep track of everything and you have to repeat the entire process everytime something changes.
better approach
- combine the code, results and writeup into a single doc
- when the data/methods change everything is already set up to be rerun with minimal effort.
- you don’t have to go searching through a bunch of scripts to figure out which version was used to generate the result set.
- you know exactly which dataset was used to make the figures/results being presented.

4. ways to report findings

This is were rmarkdown and Rstudio come in handy. Your write your analysis in an .rmd file instead of an .r script. The .rmd will contain all the code you used as well as the writeup and conclusions you draw from interpreting the results.

rmarkdown
- it’s built into rstudio
- write up your documentation in markdown
  - styles, headers, links
- add code into blocks
  - r code gets executed at the time of knit
- result is a single doc that contains everything
- easy to rerun
- lots of different formats

demo??
https://rpubs.com/kpradhan/