Data Visualization

Sample Works

Author

Louise Oh

Code
# load package(s)
library(tidyverse)
library(lubridate)
library(janitor)
library(patchwork)
library(scales)
library(sf)
library(showtext)
library(ggrepel)
library(ggthemes)
font_add_google("Noto Serif", "notoserif")
showtext_auto()
showtext.opts(dpi = 300)

# load northwestern admissions data
admin_dat <- read_csv("data/NU_admission_data.csv") %>% 
  clean_names()

# load steph curry data
steph_curry <- read_delim("data/stephen_curry_shotdata_2014_15.txt", 
                          delim = "|") %>%
  clean_names()

# load ga election & map data
ga_dat <- read_csv("data/ga_election_data.csv") %>% 
  clean_names()
load("data/ga_map.rda")


1. Northwestern University Admission Statistics

Below is the Northwestern University Admission graphic that you can find on Northwestern’s official website [https://www.adminplan.northwestern.edu/ir/data-book/]. They overlaid two plots on one another by using dual y-axes. However, I found that this graph is particuarly difficult to read and less visually pleasing to the readers.

As a result, I decided to recreate two separate plots that display the same information instead of trying to put it all in one single plot. I obtained the NU_admission_data.csv data from Northwestern University. I used Northwestern colors and fonts indicated on their branding tool website [https://www.northwestern.edu/brand/visual-identity/color-palettes/].


A major error of the official Northwestern graphic

Stacking the bars is a major error they make with the bars in their graphic. This makes it hard to show the total applicants per year, and the bar adds up over the applicant numbers. Each variable (e.g. applications, admitted students, matriculants) is a subset of another one, and stacking them does not help visualize these clearly. In order to make a easily readable graph, it would be better to manipulate data and use shades to represent the subsets of the variables in the bar graph.


Using Single dual y-axes plot vs. Two separate plot

I find that the two separate plot approach communicates the information better. First, the units of the y variables for application, admitted students matriculants are in count numbers while that for admission and yield rates are in percentage. Therefore combining them together, it is confusing to read the y axis. Furthermore, the context of the two kinds of variables as mentioned above is quite different; looking at admission numbers and the rates separately would be easier to read and also make the two plots more visibly pleasant.


Code
# bar chart data
bar_data <- admin_dat %>% 
  mutate(cat_a = applications - admitted_students,
         cat_b = admitted_students - matriculants,
         cat_c = matriculants) %>% 
  # or list 'cat_a, cat_b, cat_c'
  select(year, contains("cat_")) %>% 
  # or list 'c("cat_a", "cat_b", "cat_c")'
  pivot_longer(cols = -year,
               names_to = "category")
# bar chart labels
bar_labels <- admin_dat %>% 
  select(-contains("rate")) %>% 
  pivot_longer(cols = -year,
               names_to = "category") %>% 
  mutate(col_labels = prettyNum(value, big.mar = ","))
# line chart data
line_data <- admin_dat %>% 
  select(year, admission_rate, yield_rate) %>% 
  pivot_longer(cols = -year,
               names_to = "category")
# line chart labels
line_labels <- line_data %>% 
  mutate(label = str_c(value, "%", sep = ""))
Code
# bar chart
ggplot(bar_data, aes(x = year, y = value)) +
  geom_col(aes(fill = category), width = 0.85) +
  theme_classic() +
  theme(legend.justification = c(1, 1),
        legend.position = c(0.7, 1),
        legend.direction = "horizontal",
        title = element_text(family = "notoserif", color = "#4E2A84"),
        axis.text.x = element_text(family = "notoserif"),
        axis.text.y = element_text(family = "notoserif"),
        legend.text = element_text(family = "notoserif")) +
  scale_x_continuous(breaks = 1999:2020,
                     expand = c(0, 0.3),
                     name = "Entering Year",
                     guide = guide_axis(angle = 50)) +
  scale_y_continuous(expand = c(0, 0),
                     name = "Applications",
                     labels = c("0", "10,000", "20,000", "30,000", "40,000")) +
  scale_fill_manual(name = NULL,
                    values = c("#B6ACD1", "#836EAA", "#4E2A84"),
                    labels = c("Applicants", 
                               "Admitted Students", 
                               "Matriculants")) +
  geom_text(data = bar_labels, 
            aes(x = year, y = value, label = col_labels),
            size = 1.9,
            vjust = 1,
            color = "white",
            nudge_y = -150,
            family = "notoserif") +
  ggtitle("Northwestern University Undergraduate Admisions\nApplications By Year 1999-2020")

Code
# line chart
ggplot(line_data, aes(x = year, y = value)) +
  geom_line(aes(color = category)) +
  geom_point(aes(color = category, shape = category),
             size = 2) +
  theme_classic() +
  theme(legend.justification = c(1, 1),
        legend.position = c(0.6, 1),
        legend.direction = "horizontal",
        title = element_text(family = "notoserif", color = "#4E2A84"),
        axis.text.x = element_text(family = "notoserif"),
        axis.text.y = element_text(family = "notoserif"),
        legend.text = element_text(family = "notoserif")) +
  scale_x_continuous(breaks = 1999:2020,
                     expand = c(0, 0.5),
                     name = "Entering Year",
                     guide = guide_axis(angle = 50)) +
  scale_y_continuous(expand = c(0, 0),
                     name = "Rate",
                     labels = scales::label_percent(scale = 1),
                     limits = c(0, 63),
                     n.break = 6) +
  scale_color_manual(name = NULL,
                     values = c("#836EAA", "#4E2A84"),
                     labels = c("Admission Rate", "Yield Rate")) +
  scale_shape_manual(name = NULL,
                     values = c(19, 18),
                     labels = c("Admission Rate", "Yield Rate")) +
  geom_text(data = line_labels, 
            aes(x = year, y = value, label = label),
            size = 1.9,
            nudge_y = 1.8,
            family = "notoserif") +
  ggtitle("Northwestern University Undergraduate Admisions\nRates By Year 1999-2020")


2. Stephen Curry Shot Performance

The dataset used for this analysis is stephen_curry_shotdata_2014_15.txt. This data is a record of Stephen Curry’s shots in the 2014-2015 season.

Here is a short description of the variables contained in stephen_curry_shotdata_2014_15.txt:

  • GAME_ID - Unique ID for each game during the season
  • PLAYER_ID - Unique player ID
  • PLAYER_NAME - Player’s name
  • TEAM_ID - Unique team ID
  • TEAM_NAME - Team name
  • PERIOD - Quarter or period of the game
  • MINUTES_REMAINING - Minutes remaining in quarter/period
  • SECONDS_REMAINING - Seconds remaining in quarter/period
  • EVENT_TYPE - Missed Shot or Made Shot
  • SHOT_DISTANCE - Shot distance in feet
  • LOC_X - X location of shot attempt according to tracking system
  • LOC_Y - Y location of shot attempt according to tracking system


Plot 1

Code
# data prep
steph_curry <- steph_curry %>%
  mutate(
    period = factor(
      period,
      levels = c(1, 2, 3, 4, 5),
      labels = c("Q1", "Q2", "Q3", "Q4", "OT")
      )
    )

# box plot
ggplot(steph_curry, aes(x = period, y = shot_distance)) +
  geom_boxplot(varwidth = TRUE) +
  facet_wrap(~ event_type) +
  ggtitle(label = "Stephen Curry",
          subtitle = "2014-2015") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold", size = 12), 
        axis.title = element_text(face = "bold", size = 12), 
        plot.subtitle = element_text(size = 12),
        plot.title = element_text(face = "bold", size = 14), 
        panel.grid.minor.y = element_blank(), 
        panel.grid.major.x = element_blank()) +
  scale_x_discrete(name = "Quarter/Period") +
  scale_y_continuous(name = NULL,
                     breaks = seq(0, 45, 10),
                     labels = c("0 ft", "10 ft", "20 ft", 
                                "30 ft", "40 ft")) 

Summary

The box plot shows that for most quarters, the median shots Curry made were around 20ft away from the hoop. Comparing the medians, he tended to miss the shots when he was nearer to 25ft and tended to make successful shots at a little closer than 20ft away from the hoop in general. Curry’s performance was moderately consistent over the first three quarters, where the median, interquartile ranges, and range for made and missed shots were similar. In quarter 3, he attempted some shots father away from the hoop, but he missed them. In quarter 4, the median of the distance from the hoop for his made shots were lower, being around 15ft. During overtime, shots he made were very close to the hoop and he makes less three-point shots.


Plot 2

Code
# density plot
ggplot(steph_curry, 
       aes(x = shot_distance, 
           color = event_type, 
           fill = event_type)) +
  geom_density(alpha = 0.25, show.legend = FALSE) +  
  ggtitle(label = "Stephen Curry",
          subtitle = "Shot Densities (2014-2015)") +
  theme_minimal() +
  theme(plot.subtitle = element_text(size = 12),
        plot.title = element_text(face = "bold", size = 14),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(), 
        panel.grid.major.y = element_blank()) +
  annotate(geom = "text",
           label = c("Made Shots", "Missed Shots"),
           color = c("#5D3A9B", "#E66100"),
           x = c(3, 27),
           y = c(0.04, 0.07),
           hjust = 0,
           vjust = 0) +
  scale_x_continuous(name = NULL,
                     labels = c("0 ft", "10 ft", "20 ft", 
                                "30 ft", "40 ft"),
                     expand = c(0, 0)) +
  scale_y_continuous(name = NULL,
                     expand = c(0, 0),
                     labels = NULL) +
  scale_color_manual(values = c("#5D3A9B", "#E66100")) +
  scale_fill_manual(values = c("#5D3A9B", "#E66100")) 

Summary

The density plot shows us that Curry attempted the most shots at around 25ft distance, and also at 2ft which is nearer to the hoop. At around 25ft distance, he had a little more misses than shots he makes. On the other hand, when he was nearer to the hoop at between 0 to 3ft distance, he was able to make almost as twice many shots than he missed.


Plot 3

Code
# importing image of NBA half court
court <- grid::rasterGrob(
  jpeg::readJPEG(
    source = "data/nbahalfcourt.jpg"),
  width = unit(1, "npc"), 
  height = unit(1, "npc")
)

# plot
ggplot() +
  annotation_custom(
    grob = court,
    xmin = -250, xmax = 250,
    ymin = -52, ymax = 418
  ) +
  coord_fixed() +
  xlim(250, -250) +
  ylim(-52, 418) +

# hexbins
  geom_hex(data = steph_curry, 
           aes(x = loc_x, y = loc_y),
           bins = 20, alpha = 0.7, color = "grey") +
  labs(title = "Stephen Curry",
       subtitle = "Shot Chart (2014-2015)",
       x = NULL, y = NULL,
       fill = "Shot\nAttempts") +
  theme_minimal() +
  theme(plot.subtitle = element_text(size = 12),
        plot.title = element_text(face = "bold", size = 14),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(), 
        panel.grid.major.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()) +
  scale_fill_gradient(limits = c(0, 15),
                      low = "yellow",
                      high = "red",
                      na.value = "red",
                      label = c("0", "5", "10", "15+"))

Summary

According to the heat map, Curry attempted the most shots right by the hoop or around the three-point line especially nearer to the center of the court. In these areas he made more than 15 shot attempts. He also attempted some free-throw shots, as its area is colored more orange. He made less attempts in other areas. An interesting shot is near the upper right corner of the graphic; this seems like the missed shot he attempted in Q1, which we saw as an outlying value from over 40ft in the previous box plot.


3. 2020 Presidential Election - Biden vs. Trump

This data visualization uses the ga_election_data.csv dataset in conjunction with mapping data ga_map.rda. The ga_election_data.csv dataset contains the state of Georgia’s county level results for the 2020 US presidential election.

Here is a short description of the variables it contains:

  • County - name of county in Georgia
  • Candidate - name of candidate on the ballot,
  • Election Day Votes - number of votes cast on election day for a candidate within a county
  • Absentee by Mail Votes - number of votes cast absentee by mail, pre-election day, for a candidate within a county
  • Advanced Voting Votes - number of votes cast in-person, pre-election day, for a candidate within a county
  • Provisional Votes - number of votes cast on election day for a candidate within a county needing voter eligibility verification
  • Total Votes - total number of votes for a candidate within a county

I have also included the map data for Georgia (ga_map.rda) which was retrieved using tigris::counties().

Holding the 2020 US Presidential election during the COVID-19 pandemic was a massive logistical undertaking. Voter engagement was extremely high which produced a historical high voting rate. Voting operations, headed by states, ran very monthly and encountered few COVID-19 related issues. The state of Georgia did a particularly good job at this by encouraging their residents to use early voting. About 75% of the vote in a typical county voted early! Ignoring county boundaries, about 4 in every 5 voters, 80%, in Georgia voted early.

While it is clear that early voting was the preferred option for Georgia voters, I want to investigate whether or not one candidate/party utilized early voting more than the other – I am focusing on the two major candidates.


Map

Code
# data
ga_graph <- ga_dat %>% 
  # make new column for proportion of votes on pre-election day / early voting
  mutate(
    prop_pre_eday = (absentee_by_mail_votes + advanced_voting_votes) 
    / total_votes) %>% 
  # select all columns except those that include "_vote"
  select(-contains("_vote")) 

# biden map data
biden_map_data <- ga_map %>% 
  # return all rows for candidate Biden and all columns from x and y
  left_join(
    ga_graph %>% 
      filter(candidate == "Joseph R. Biden"),
    by = c("name" ="county")
  )
# trump map data
trump_map_data <- ga_map %>% 
  # return all rows for candidate Trump and all columns from x and y
  left_join(
    ga_graph %>% 
      filter(candidate == "Donald J. Trump"),
    by = c("name" ="county")
  )

# biden plot
biden <- ggplot(biden_map_data) +
  geom_sf(aes(fill = prop_pre_eday), 
          show.legend = FALSE) +
  ggthemes::theme_map() +
  theme(plot.subtitle = element_text(size = 12),
        plot.title = element_text(face = "bold", size = 14)) +
  labs(title = "Joseph R. Biden",
       subtitle = "Democratic Nominee") +
  scale_fill_gradient2(limits = c(0.5, 1),
                       low = "#1AFF1A", high = "#5D3A9B",
                       mid = "white", midpoint = 0.75)
# trump plot
trump <- ggplot(trump_map_data) +
  geom_sf(aes(fill = prop_pre_eday)) +
  ggthemes::theme_map() +
  theme(plot.subtitle = element_text(size = 12),
        plot.title = element_text(face = "bold", size = 14),
        legend.position = c(0.75, 1),
        legend.justification = c(0, 1),
        legend.text = element_text(size = 10)) +
  labs(title = "Donald J. Trump",
       subtitle = "Republican Nominee") +
  scale_fill_gradient2(limits = c(0.5, 1),
                       low = "#1AFF1A", high = "#5D3A9B",
                       mid = "white", midpoint = 0.75,
                       breaks = seq(0.5, 1.0, 0.25),
                       labels = c("50%", "75%", "100%"),
                       name = NULL)

# final plot
(biden + trump) +
  plot_annotation(
    title = "Percentage of votes from early voting",
    caption = "Georgia: 2020 US Presidential Election Results",
    theme = theme(plot.title = element_text(face = "bold", size = 24),
                  plot.caption = element_text(size = 10))
    )

Summary

This graphic compares the percentage of votes from early voting in Georgia for Joseph R. Biden and Donald J. Trump. The colors represent the early voting rates in th specific regions of Georgia. At first sight, we notice that there is more green in Trump’s map than Biden’s, which shows more white and purple. This means that more early voting percents are near 50% for Trump, and most votes are between 75% and 100% for Biden. In general, this map concisely shows us that the percentage of votes from early voting for Biden is higher than that for Trump in Georgia during the 2020 US Presidnetial Election.