STAT 436 - HW1

Coding #1 – Reshaping Practice

Part-(a): Regional murder rates

murders <- read.csv('murders.csv')
murders %>% head()

murder_rates <- read.csv('murder_rates.csv')
murder_rates

Reshaping the murders tibble as required:

murders_reshaped <- murders %>% 
  group_by(region) %>%
  summarize(
    murders = sum(total), 
    population = sum(population), 
    murder_rate = murders/population
    ) %>%
  arrange(murder_rate)

murders_reshaped

Part-(b): Antiobitics and bacteria

antibiotic_wide <- read.csv('antibiotic_wide.csv')
antibiotic_wide %>% head()

antibiotic_tidy <- read.csv('antibiotic_tidy.csv')
antibiotic_tidy %>% head()

Reshaping the antibiotics_wide tibble as required:

antibiotic_longer <- antibiotic_wide %>%
  pivot_longer(cols = starts_with('Unc0'),
               names_to = 'species',
               values_to = 'value',
               values_drop_na = T) %>%
  separate(sample, c('ind','time'), sep = 1, remove = F) %>% 
  arrange(species)

antibiotic_longer$time <- as.numeric(as.character(antibiotic_longer$time))

antibiotic_longer %>% head()

Coding #2 – City Temperatures

Loading the temperature tibbles

temperature <- 
  read_csv("https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/data/temperatures.csv") %>% 
  group_by(city, month)

temperature$city <- factor(temperature$city, levels = c('Death Valley', 'Honolulu', 'Chicago', 'Barrow'))

Part-(a): Generate a line plot on temperature in all cities

fig23_line_plot <-
ggplot(temperature, aes(date, temperature, col=city)) +
  geom_line(linewidth=1.25) +
  scale_colour_manual(values = c('#FFC000', '#0096FF', '#097969', '#DA70D6')) +
  scale_x_date(date_labels = '%b')  +
  scale_y_continuous(breaks = seq(-20,100,20)) +
  labs(
    title = 'Daily temperature normals for four selected locations in the U.S.',
    x='month', 
    y='Temperature (°F)') +
  theme(
    panel.grid.major = element_line(color = "grey", linewidth = 0.5, linetype = 1), 
    plot.title = element_text(hjust = 0.5))

fig23_line_plot

### Part-(b): Summarize mean temperatures across cities and months

temp_summary <- temperature %>%
  group_by(city, month) %>%
  summarise(across(temperature, mean))

temp_summary$month <- factor(
  month.abb[as.numeric(as.character(temp_summary$month))], 
  levels=c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))

temp_summary %>% head()

Part-(c): Generate a heatmap on summarized tibble

theme_set(theme_classic())
fig24_heat_map <- ggplot(temp_summary, aes(x=month, y=city, fill=temperature)) +
  geom_tile(colour="white", linewidth=1) +
  coord_equal() +
  scale_fill_viridis_c(option = "magma", breaks=seq(-20, 110,by=20)) +
  labs(title='Monthly normal mean temperatures for four locations in the US',
       x='month', 
       y='') +
  theme( plot.title = element_text(hjust = 0.5),
         axis.ticks = element_blank(),
         axis.line  = element_blank(),
         plot.margin=grid::unit(c(0,0,0,0), "mm"))

fig24_heat_map

Part-(d): Display and compare two plots

In the line plot, though we have plotted using daily data, the visualization shows more of a quarterly trend. Though we are able to compare the temperature ranges in the four cities, the heat-map (tile plot) does a better job of that. The distance between colours in the scale is easier for approximate comparision than the distance between the curves in the line plot. Therefore, it is a trade-off for visual appeal with accurate comparision. As a layman, I would be more drawn to the heat-map over the line plot.

Coding #3 – Soccer Code Review

Review snippet #1

win_props <- 
  read_csv(
    "https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/exercises/data/understat_per_game.csv"
    ) %>%
  group_by(team, year) %>%
  summarise(n_games = n(), wins = sum(wins) / n_games)

The above code is concise and needs no further optimization. The grouping of the data by teams and year made it easier to summarize the win odds per team each year.

Review snippet #2

best_teams_coll <- win_props %>%
  ungroup() %>%
  slice_max(wins, prop = 0.2) %>%
  pull(team)

My colleague did well by ungrouping all the grouped variables here so that they could extract the top 20 percentile of teams based on average wins. But there were duplicate instances due to this as one team might have performed well more than once throughout the years. So this output can be more concise by extracting the unique character vector to save on memory. Said change has been implemented below:

best_teams_me <- unique(
  win_props %>% 
    ungroup() %>%
    slice_max(wins, prop = 0.2) %>%
    pull(team))

Review snippet #3

win_props %>%
  filter(team %in% best_teams_coll) %>%
  ggplot() + 
    geom_point(aes(year, team, size = n_games, alpha = wins))

Points of improvement for the plot:

We can change the dimensions by pulling the wins to the x-axis and faceting out the plot into sub-plots by the team name. It helps with easier comparison of the relative performances of the teams.
Grid-lines can be added to the plots to know the relative position of the points better.
We could also add a colour component to the ‘teams’ dimension for better viewing experience, and move around the guide (legend) so that the plot gets more screen space.

These change proposals are implemented as below:

win_props %>%
  filter(team %in% best_teams_me) %>%
  ggplot(aes(year, wins), alpha = 0.6) +
  geom_point(aes(size = n_games)) +
  geom_line(linewidth = 0.5, color='darkgrey')+
  scale_size(range=c(0.5,4)) +
  facet_wrap(. ~ team, ncol=5) +
  labs(y='Win ratio', x='') +
  guides(nrow=1) + 
  theme(
    panel.grid.major = element_line(color = "grey", linewidth = 0.5, linetype = 2),
    legend.position = c(0.8,0.035),
    legend.direction = 'horizontal',
    panel.spacing.x = unit(1, 'lines')
    )

Coding #4 – Visual Redesign

Part-(a)

Q. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.

Part-(b)

Q. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?

Ans: The main takeaway I got from the visualization are:

It shows which countries consume ramen the most, and among these countries, which type of packaging is most preferred by the consumers. The height of each bar show the number of ramens rated per country, and the colours within the stacked bar represent the number of ramens consumed per packaging style. The decreasing order arrangement of the bars makes it easy to visualize the extremes easity.
The takeaway is partially consistent with the intended message. The only potential problem I am able to call out is to know the accurate measurement of the number of ramens per packaging style.
The most important comparison I wanted to highlight with the visualization was to show which is the most preferred packaging style by the brand on these top 15 countries. With the current plot, it is difficult to answer that in a glance or two, unless the cumulative height of each of the colours in the bars is calculated explicitly.

Part-(c)

Q. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?

Ans: The original visualization is quite legible due to being de-cluttered.The axes are properly labeled, font type is simple and in sans-serif format, and the colours are well-distinguishable as well. The only potential difficulty I see would be the accurate measurement of the number of ramens per packaging style. Due to the stacked bar layout of the plot, we have to find the absolute difference between the start and end of the color within the bar w.r.t. the y-axis for that. So, while the visualization is less cluttered and pleasing to the eye, drawing insights is compromised here.

Part-(d)

Q. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.

I propose a scatter plot (including some jitter as there are overlapping rating values) of these 15 countries against the ramen ratings, faceted by the 5 packaging styles of packaging of the ramen.

theme_set(theme_bw())
ramenratings <- read_csv("ramenratings.csv")

extract_top <- function(df, x, num) {
  (
    df %>% 
      group_by({{x}}) %>% 
      summarise(freq=n()) %>% 
      ungroup() %>% 
      arrange(-freq) %>% 
      pull({{x}})
  )[1:{{num}}]
}

top_15_countries <- ramenratings %>% extract_top(Country, 15)

top_5_styles <- ramenratings %>% extract_top(Style, 5)

ramenratings  %>% 
  group_by(Country, Brand, Style) %>%
  filter(Country %in% top_15_countries & Style %in% top_5_styles & Stars %in% seq(3.5,5.5)) %>%
  ggplot(aes(Country, Stars)) + 
    geom_jitter(alpha=0.3) +
    facet_grid(factor(Style, levels=c('Box', 'Tray', 'Bowl', 'Cup', 'Pack')) ~ .) +
    scale_x_discrete(name ="", limits=top_15_countries) +
    scale_y_discrete(limits = seq(3.5,5.5,0.5)) + 
    labs(title = 'Ramen ratings for top 5 packaging styles in 15 most popular countries') +
    theme(
      plot.title = element_text(hjust = 0.5, size = 18),
      axis.text.x = element_text(angle = 45, hjust=1))

Discussion #1 – Antibiotics Comparison

Part-(a): Review on Approach #1

Well-suited visualization comparison: This plot properly shows which ind value has the most impact on every species. It also helps compare svalue over time across species given a specific ind.
Poorly-suited visualization comparison: It is difficult to see lower values of the svalue when all the points are cluttered in a specific region of the plot

Part-(b): Review on Approach #2

Well-suited visualization comparison: This plot quite clearly shows the high extremes due to the gradient scale of the svalue.
Poorly-suited visualization comparison: On the other hand, due to the gradient colour scale, the lower svalue are quite indistinguishable.

Part-(c) Review on Approach #3

Well-suited visualization comparison: The comparison of svalue for a specific species across the ind values.
Poorly-suited visualization comparison: relative difference of svalue of the species within a subplot when the all the svalue are crowded within a close region.

Part-(d) Sketching - Implementing Approach #2

antibiotic <- read_csv("https://uwmadison.box.com/shared/static/5jmd9pku62291ek20lioevsw1c588ahx.csv")

theme_set(theme_grey())

ggplot(antibiotic, aes(x=time, y=ind, fill=value)) +
  geom_tile() +
  facet_grid(species ~ .) + 
  scale_fill_gradient(low = "#EEF3FF", high = "#2471A3") + 
  scale_x_continuous(breaks = seq(0, 50, 10), expand = c(0, 0)) + 
  theme(strip.text.y.right = element_text(angle = 0))

Discussion #2 – Homelessness

Chosen screenshot from the article:

Q. What do you think was the underlying data behind the current view? What were the rows, and what were the columns? What were the data types of each of the columns?

Ans: I expect the underlying data to contain the following columns at the very least:

Destination country: The rows would contain a character string of the name of the country the homeless were flown to
Destination city: The rows would contain a character string of the city in the destination country the homeless were flown to
Distance from NYC: The rows would contain a numeric value of the distance of the destination city from NYC in miles
Number: The rows would contain an integer number of homeless people flown to the destination city

Q. What encodings were used? How are properties of marks on the page derived from the underlying abstract data?

Ans:

Line Encoding is used to represent the distance of the destination city (in mi) from NYC, and also the radius of the parabola that connects the origina and the destination
Size encoding was used to visualize the number of people sent to the destination city
Label Encoding is used to highlight a few notable countries the homeless were flown to

Q. Is multi-view composition being used? If so, how?

Ans: Multi-view composition is being used in this plot in the sense that we can see two types of plots merged into one. The first one is a redesigned version of a line plot connecting the origin (NYC) with it’s destination. The second is a scatter plot with a size encoding to show how many number of people were moved to the destination.