Data Vis Final Takehome

Plots

Plot 1

# Option B 
library(GGally)

# data
data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

ggparcoord(iris,
    columns = 1:4, 
    groupColumn = 5, 
    order = "allClass",
    scale = "globalminmax",
    showPoints = TRUE,
    alphaLines = 0.3
    ) + 
  theme_minimal() + 
  labs(title = "PC Plot of Iris Flower Measurements", 
       x = "Flower Measurement", 
       y = "Value") + 
  theme() +  
  scale_x_discrete(labels = c("Petal Length", "Petal Width", "Sepal Length", "Sepal Width"))

This parallel coordinates plot compares the four flower measurements in the iris dataset. Sepal length, sepal width, petal length, and petal width. Each line represents one flower, and the color of the line shows which species the flower belongs to. This plot type is appropriate because it allows multiple numeric variables to be compared at the same time, which makes it useful for identifying overall patterns across species.

Plot 2

# Libraries
library(ggplot2)
library(dplyr)
library(maps)
library(ggiraph)

# data
arrests <- USArrests
arrests$state <- tolower(rownames(USArrests))

# state map mata
states <- map_data("state")

# Join data
map_dat <- left_join(states, arrests, by = c("region" = "state"))

# Create tooltip
map_dat$tooltip <- paste(
  tools::toTitleCase(map_dat$region),
  "Murder arrests:", map_dat$Murder
)

# Make map
p <- ggplot(map_dat, aes(x = long, y = lat,
    group = group,
    fill = Murder,
    tooltip = tooltip,
    data_id = region)) +
  
  geom_polygon_interactive(color = "white") +
  coord_fixed(1.3) +
  scale_fill_gradient(
    low = "lightyellow",
    high = "red",
    name = "Murder Arrests") +
  
  labs(
    title = "Murder Arrest Rates Across U.S. States",
    caption = "Dataset: USArrests"
  ) +
  theme_void()

# Make interactive
girafe(ggobj = p)

This plot shows murder arrest rates across U.S. states using the built-in USArrests dataset. A choropleth map is appropriate because the data are connected to specific geographic regions, so mapping the values makes regional patterns easier to see. States with darker red coloring have higher murder arrest rates, while lighter yellow states have lower murder arrest rates.

Plot 3

# Library
library(tidyverse)
library(ggalluvial)

# data
data("Titanic")

# data frame
titanic_dat <- as.data.frame(Titanic)

# plot option A
ggplot(titanic_dat, aes(
    axis1 = Class,
    axis2 = Sex,
    axis3 = Age,
    axis4 = Survived,
    y = Freq)
) +
  geom_alluvium(aes(fill = Survived),
    alpha = 0.75,
    width = 1/12) +
  geom_stratum(width = 1/10,
    fill = "gray90",
    color = "black") +
  geom_text(stat = "stratum",
    aes(label = after_stat(stratum)),
    size = 3) +
  scale_x_discrete(limits = c("Class", "Sex", "Age", "Survived")) +
  scale_fill_manual(
    values = c("No" = "gray", "Yes" = "lightblue1")) +
  labs(
    title = "Titanic Survival Flow by Class, Sex, and Age",
    subtitle = "Flow width represents the number of passengers in each group",
    x = "Passenger Category",
    y = "Number of Passengers",
    fill = "Survived") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.y = element_blank(),
    panel.grid = element_blank()
  )

The figure shows how Titanic passengers flow through different categories of class, sex, age group, and survival status. The width of each band represents the number of passengers in that pathway, so larger bands show larger groups of people. This plot type is appropriate because the Titanic data set contains several connected categorical variables, and a sankey plot makes it easier to see how combinations of class, sex, and age relate to survival.

Plot 4

# Library
library(ggplot2)
library(ggdist)
library(dplyr)

# data
tooth_clean <- ToothGrowth %>%
  mutate(
    dose = factor(dose),
    supp = factor(
      supp,
      levels = c("OJ", "VC"),
      labels = c("Orange Juice", "Vitamin C"))
  )

# plot option A
ggplot(tooth_clean, aes(x = dose, y = len, fill = supp)) +
  stat_halfeye(
    adjust = 0.5,
    width = 0.5,
    justification = -0.25,
    .width = 0,
    point_colour = NA,
    alpha = 0.6
  ) +
  geom_boxplot(aes(color = supp),
    width = 0.12,
    outlier.shape = NA,
    alpha = 0.5,
    position = position_dodge(width = 0.2)
  ) +
  geom_jitter(
  position = position_jitterdodge(jitter.width = 0.08, dodge.width = 0.2)
) +
  scale_fill_manual(values = c("Orange Juice" = "darkorange2", "Vitamin C" = "blue2")) +
  scale_color_manual(values = c("Orange Juice" = "darkorange2", "Vitamin C" = "blue2")) +
  labs(
    title = "Raincloud Plot of Tooth Length by Dose and Supplement",
    x = "Dose",
    y = "Tooth Length",
    fill = "Supplement",
    color = "Supplement",
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

This raincloud plot compares tooth length across dose levels and supplement types in the ToothGrowth dataset. A raincloud plot is appropriate because it shows the distribution shape, summary statistics, and individual observations in one figure. The half-eye density shape shows where tooth length values are most concentrated, the boxplot shows the median and spread, and the jittered points show the raw data.

Student Portfolio

scatterplot

How does car weight relate to fuel efficiency?

library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  labs(
    title = "Car Weight and Fuel Efficiency",
    x = "Weight",
    y = "Miles Per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal()

This scatterplot shows that heavier cars generally have lower miles per gallon. Cars with 4 cylinders tend to be lighter and more fuel efficient, while cars with 8 cylinders tend to be heavier and less fuel efficient.

A scatterplot is appropriate because both weight and miles per gallon are numeric variables. This plot makes it easy to see the relationship between the two variables and compare cylinder groups using color.

Boxplot

How does miles per gallon differ across cylinders?

library(ggplot2)

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Fuel Efficiency by Cylinder Group",
    x = "Number of Cylinders",
    y = "Miles Per Gallon",
    fill = "Cylinders"
  ) +
  theme_minimal()

This boxplot shows that 4-cylinder cars have the highest miles per gallon overall. Eight-cylinder cars have the lowest miles per gallon, while 6-cylinder cars are in the middle.

A boxplot is appropriate because it compares the distribution of a numeric variable across categories. It shows the median, spread, and range of fuel efficiency for each cylinder group.

bar plot

How many cars are in each cylinder group?

library(ggplot2)

ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +
  geom_bar() +
  labs(
    title = "Number of Cars by Cylinder Group",
    x = "Number of Cylinders",
    y = "Count",
    fill = "Cylinders"
  ) +
  theme_minimal()

This bar plot shows the number of cars in each cylinder group. In this dataset, there are more 8-cylinder cars than 6-cylinder cars, while 4-cylinder cars are also well represented.

A bar plot is appropriate because cylinder group is a categorical variable and does not need mean, median, or error to be shown. The plot clearly shows the count of cars in each category.

density plot

How is horsepower distributed across cylinders?

library(ggplot2)

ggplot(mtcars, aes(x = hp, fill = factor(cyl), color = factor(cyl))) +
  geom_density(alpha = 0.35) +
  labs(title = "Horsepower Distribution by Cylinder Group",
    x = "Horsepower",
    y = "Frequency",
    fill = "Cylinders",
    color = "Cylinders") +
  theme_minimal()

This density plot shows that 4-cylinder cars usually have lower horsepower, while 8-cylinder cars tend to have higher horsepower. Six-cylinder cars are mostly in the middle range.

A density plot is appropriate because horsepower is a numeric variable and the goal is to compare its distribution across groups. The transparent fills make it easier to see where the groups overlap.

scatter plot

How does horsepower relate to miles per gallon within each cylinder group?

library(ggplot2)

ggplot(mtcars, aes(x = hp, y = mpg)) + 
  geom_point(size = 3, color = "black") +
  facet_wrap(~ cyl) +
  labs(title = "Horsepower and Fuel Efficiency by Cylinder Group",
    x = "Horsepower",
    y = "Miles Per Gallon") + 
  theme_minimal()

This faceted scatterplot shows the relationship between horsepower and miles per gallon separately for each cylinder group. In general, cars with more horsepower tend to have lower miles per gallon. Faceting makes it easier to compare this relationship within 4-cylinder, 6-cylinder, and 8-cylinder cars.

A faceted scatterplot is appropriate because it shows the relationship between two numeric variables while separating the data by cylinder group. This keeps the figure simple and makes group comparisons easier.