Final

library(ggplot2)
data("mtcars")
data("USArrests")
data("Titanic")
data("ToothGrowth")
data("iris")

Part 1

Parallel Coordinates Plot

This plot is good for showing many variables together. It can show clusters, outliers, and trends for each variable and can be used to compare them. For the iris dataset, it shows the different types of measurements and how they vary across species.

install.packages("GGally")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(GGally)
install.packages("viridis")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(viridis)

## Loading required package: viridisLite

ggparcoord(iris,
    columns = 1:4, groupColumn = 5, order = "anyClass",
    scale="globalminmax",
    showPoints = TRUE, 
    title = "Iris Measurements by Species",
    alphaLines = 0.3
    ) + 
    scale_color_manual(values=c( "green", "yellow", "purple3"), labels = c("Setosa", "Versicolor", "Virginica") ) +
  scale_x_discrete(labels = c("Petal Length", "Petal Width", "Sepal Length", "Sepal Width"))+
  theme_minimal() +
  xlab("Flower Measurement Category")+
  ylab("Measurement (cm)")

Part 2

Map

This type of map is good for showing geographic variation because it communicates spatial patterns using easily read color schemes and standardized data. It is useful for this dataset because it can show how data can change place to place.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:viridis':
## 
##     unemp

library(tibble)

us_data <- USArrests %>%
  rownames_to_column(var = "state") %>%
  mutate(region = tolower(state)) 

us_map <- map_data("state")

map_df <- us_map %>%
  left_join(us_data, by = "region")

ggplot(map_df, aes(x = long, y = lat, group = group, fill = Murder)) +
  geom_polygon(color = "white", size = 0.2) +
  coord_fixed(1.3) +
  scale_fill_viridis_c(option = "rocket", direction = -1, name = "Murder Rate") +
  labs(
    title = "US Murder Rates by State",
    subtitle = "Rates per 100,000 residents",
    caption = "Source: USArrests dataset"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank()
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Part 3

Flow diagram

This diagram is useful for showing connections because it can show the structure of the data while also showing the flow between steps. It is good for complex data. This is useful for the Titanic dataset because the dataset has multiple categorical variables that otherwise would be difficult to map together.

library(networkD3)
library(dplyr)

data("Titanic")

df <- as.data.frame(Titanic)

df <- df %>%
  group_by(Class, Sex, Age, Survived) %>%
  summarise(Freq = sum(Freq), .groups = "drop")

nodes <- data.frame(
  name = unique(c(df$Class, df$Sex, df$Age, df$Survived))
)

get_id <- function(x) {
  match(x, nodes$name) - 1
}

links_class_sex <- df %>%
  group_by(Class, Sex) %>%
  summarise(value = sum(Freq), .groups = "drop") %>%
  mutate(source = get_id(Class),
         target = get_id(Sex))

links_sex_age <- df %>%
  group_by(Sex, Age) %>%
  summarise(value = sum(Freq), .groups = "drop") %>%
  mutate(source = get_id(Sex),
         target = get_id(Age))

links_age_surv <- df %>%
  group_by(Age, Survived) %>%
  summarise(value = sum(Freq), .groups = "drop") %>%
  mutate(source = get_id(Age),
         target = get_id(Survived))

links <- bind_rows(links_class_sex, links_sex_age, links_age_surv)

links <- links %>% select(source, target, value)

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source",
  Target = "target",
  Value = "value",
  NodeID = "name",
  fontSize = 12,
  nodeWidth = 30
)

## Links is a tbl_df. Converting to a plain data frame.

Part 4

Raincloud Plot

This type of plot is useful because it combines different types of distributions into one plot, allowing for easier interpretation. You can see the density, summary statistics, and raw points in one plot. This is good for this dataset because it allows for comparasion and visuzalization between all of the different factors that can influence teeth growth.

library(ggdist)

teeth_clean <- ToothGrowth %>%
  select(len, supp, dose) %>%
  na.omit()

ggplot(teeth_clean, aes(x = supp, y = len, fill = dose)) +
  stat_halfeye(
    position = position_dodge(width = 0.75),
    adjust = 0.6,
    width = 0.55,
    .width = 0,
    justification = -0.2,
    point_colour = NA,
    alpha = 0.5
  ) +
  geom_boxplot(
    aes(color = dose),
    width = 0.12,
    position = position_dodge(width = 0.75),
    outlier.shape = NA,
    alpha = 0.65,
    linewidth = 0.5
  ) +
  geom_jitter(
    aes(color = dose),
    position = position_jitterdodge(
      jitter.width = 0.1,
      dodge.width = 0.80
    ),
    size = 1.8,
    alpha = 0.25
  ) +
  labs(
    title = "Tooth Length by Supplement and Dose",
    subtitle = "Raincloud plot showing distribution, summary statistics, and individual observations",
    x = "Supplement Delivery Method",
    y = "Tooth Length",
    fill = "Vitamin C Dosage",
    color = "Vitamin C Dosage"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  )

## Warning: The following aesthetics were dropped during statistical transformation: colour
## and fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Portfolio

library(ggplot2)
library(dplyr)
library(patchwork)

data(iris)

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Main research question: How do the three iris species differ in their sepal and petal measurements, and which variables best separate the species?

Plot 1: Scatterplot

Is there a relationship between sepal length and petal length, and do species cluster separately?

ggplot(iris,
             aes(x = Sepal.Length,
                 y = Petal.Length,
                 color = Species)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(title = "Sepal Length vs Petal Length",
       subtitle = "Species by flower dimensions",
       x = "Sepal Length",
       y = "Petal Length") +
  theme_minimal()

Petal length generally increases as sepal length increases. Iris setosa forms a separate cluster with smaller petal lengths, while versicolor and virginica overlap slightly. A scatterplot is appropriate because it shows the relationship between two continuous variables and helps identify clustering, trends, and overlap among species.

Plot 2: Histogram

How is sepal width distributed within each species?

ggplot(iris,
             aes(x = Sepal.Width,
                 fill = Species)) +
  geom_histogram(binwidth = 0.2,
                 color = "black",
                 alpha = 0.7,
                 position = "identity") +
  labs(title = "Distribution of Sepal Width",
       subtitle = "Comparing spread and overlap among species",
       x = "Sepal Width",
       y = "Count") +
  theme_minimal()

Setosa tends to have wider sepals overall, while versicolor and virginica show more overlap in their distributions.A histogram is useful for examining the distribution and frequency of a single numeric variable across groups.

Plot 3: Boxplot

Which species tends to have the largest petal width, and how much variation exists within species?

ggplot(iris,
             aes(x = Species,
                 y = Petal.Width,
                 fill = Species)) +
  geom_boxplot(alpha = 0.8) +
  labs(title = "Petal Width by Species",
       subtitle = "How petal width differs by species",
       x = "Species",
       y = "Petal Width") +
  theme_minimal()

Virginica has the largest petal widths on average, while setosa has the smallest. Variation is also greater in virginica compared with the other species. A boxplot is appropriate because it summarizes median, spread, and potential outliers for a numeric variable across categories.

Plot 4: Density Plot

How do petal length distributions overlap among species?

ggplot(iris,
             aes(x = Petal.Length,
                 fill = Species)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Petal Length",
       subtitle = "How petal length differs by species",
       x = "Petal Length",
       y = "Density") +
  theme_minimal()

Setosa has a distinct petal length distribution, while versicolor and virginica overlap slightly but still differ in their central tendencies. Density plots are useful for comparing the shape and overlap of continuous distributions between groups.

Plot 5: Faceted Scatterplot

How does the relationship between sepal length and sepal width vary within each species individually?

ggplot(iris,
             aes(x = Sepal.Length,
                 y = Sepal.Width,
                 color = Species)) +
  geom_point(size = 2.5) +
  facet_wrap(~Species) +
  labs(title = "Sepal Dimensions by Species",
       subtitle = "Species-specific dimensions",
       x = "Sepal Length",
       y = "Sepal Width") +
  theme_minimal()

Each species shows a different spread and relationship between sepal length and width. Setosa appears more tightly clustered, while virginica shows greater variability. Faceting is appropriate because it separates each species into its own panel, making comparisons easier and reducing visual overlap.

The plots suggest that petal measurements are more effective than sepal measurements for distinguishing iris species. Iris setosa forms a distinct group, while versicolor and virginica show some overlap.

Final

Brooke Kopack Ware

2026-05-02

Part 1

Parallel Coordinates Plot

This plot is good for showing many variables together. It can show clusters, outliers, and trends for each variable and can be used to compare them. For the iris dataset, it shows the different types of measurements and how they vary across species.

Part 2

Map

This type of map is good for showing geographic variation because it communicates spatial patterns using easily read color schemes and standardized data. It is useful for this dataset because it can show how data can change place to place.

Part 3

Flow diagram

Part 4

Raincloud Plot

Portfolio

Main research question: How do the three iris species differ in their sepal and petal measurements, and which variables best separate the species?

Plot 1: Scatterplot

Is there a relationship between sepal length and petal length, and do species cluster separately?

Plot 2: Histogram

How is sepal width distributed within each species?

Setosa tends to have wider sepals overall, while versicolor and virginica show more overlap in their distributions.A histogram is useful for examining the distribution and frequency of a single numeric variable across groups.

Plot 3: Boxplot

Which species tends to have the largest petal width, and how much variation exists within species?

Virginica has the largest petal widths on average, while setosa has the smallest. Variation is also greater in virginica compared with the other species. A boxplot is appropriate because it summarizes median, spread, and potential outliers for a numeric variable across categories.

Plot 4: Density Plot

How do petal length distributions overlap among species?

Setosa has a distinct petal length distribution, while versicolor and virginica overlap slightly but still differ in their central tendencies. Density plots are useful for comparing the shape and overlap of continuous distributions between groups.

Plot 5: Faceted Scatterplot

How does the relationship between sepal length and sepal width vary within each species individually?

The plots suggest that petal measurements are more effective than sepal measurements for distinguishing iris species. Iris setosa forms a distinct group, while versicolor and virginica show some overlap.