murders <- read.csv('murders.csv')
murders %>% head()
murder_rates <- read.csv('murder_rates.csv')
murder_rates
Reshaping the murders tibble as required:
murders_reshaped <- murders %>%
group_by(region) %>%
summarize(
murders = sum(total),
population = sum(population),
murder_rate = murders/population
) %>%
arrange(murder_rate)
murders_reshaped
antibiotic_wide <- read.csv('antibiotic_wide.csv')
antibiotic_wide %>% head()
antibiotic_tidy <- read.csv('antibiotic_tidy.csv')
antibiotic_tidy %>% head()
Reshaping the antibiotics_wide tibble as required:
antibiotic_longer <- antibiotic_wide %>%
pivot_longer(cols = starts_with('Unc0'),
names_to = 'species',
values_to = 'value',
values_drop_na = T) %>%
separate(sample, c('ind','time'), sep = 1, remove = F) %>%
arrange(species)
antibiotic_longer$time <- as.numeric(as.character(antibiotic_longer$time))
antibiotic_longer %>% head()
temperature <-
read_csv("https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/data/temperatures.csv") %>%
group_by(city, month)
temperature$city <- factor(temperature$city, levels = c('Death Valley', 'Honolulu', 'Chicago', 'Barrow'))
fig23_line_plot <-
ggplot(temperature, aes(date, temperature, col=city)) +
geom_line(linewidth=1.25) +
scale_colour_manual(values = c('#FFC000', '#0096FF', '#097969', '#DA70D6')) +
scale_x_date(date_labels = '%b') +
scale_y_continuous(breaks = seq(-20,100,20)) +
labs(
title = 'Daily temperature normals for four selected locations in the U.S.',
x='month',
y='Temperature (°F)') +
theme(
panel.grid.major = element_line(color = "grey", linewidth = 0.5, linetype = 1),
plot.title = element_text(hjust = 0.5))
fig23_line_plot
### Part-(b): Summarize mean temperatures across cities and months
temp_summary <- temperature %>%
group_by(city, month) %>%
summarise(across(temperature, mean))
temp_summary$month <- factor(
month.abb[as.numeric(as.character(temp_summary$month))],
levels=c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
temp_summary %>% head()
theme_set(theme_classic())
fig24_heat_map <- ggplot(temp_summary, aes(x=month, y=city, fill=temperature)) +
geom_tile(colour="white", linewidth=1) +
coord_equal() +
scale_fill_viridis_c(option = "magma", breaks=seq(-20, 110,by=20)) +
labs(title='Monthly normal mean temperatures for four locations in the US',
x='month',
y='') +
theme( plot.title = element_text(hjust = 0.5),
axis.ticks = element_blank(),
axis.line = element_blank(),
plot.margin=grid::unit(c(0,0,0,0), "mm"))
fig24_heat_map
In the line plot, though we have plotted using daily data, the visualization shows more of a quarterly trend. Though we are able to compare the temperature ranges in the four cities, the heat-map (tile plot) does a better job of that. The distance between colours in the scale is easier for approximate comparision than the distance between the curves in the line plot. Therefore, it is a trade-off for visual appeal with accurate comparision. As a layman, I would be more drawn to the heat-map over the line plot.
win_props <-
read_csv(
"https://raw.githubusercontent.com/krisrs1128/stat479_s22/main/exercises/data/understat_per_game.csv"
) %>%
group_by(team, year) %>%
summarise(n_games = n(), wins = sum(wins) / n_games)
The above code is concise and needs no further optimization. The grouping of the data by teams and year made it easier to summarize the win odds per team each year.
best_teams_coll <- win_props %>%
ungroup() %>%
slice_max(wins, prop = 0.2) %>%
pull(team)
My colleague did well by ungrouping all the grouped variables here so that they could extract the top 20 percentile of teams based on average wins. But there were duplicate instances due to this as one team might have performed well more than once throughout the years. So this output can be more concise by extracting the unique character vector to save on memory. Said change has been implemented below:
best_teams_me <- unique(
win_props %>%
ungroup() %>%
slice_max(wins, prop = 0.2) %>%
pull(team))
win_props %>%
filter(team %in% best_teams_coll) %>%
ggplot() +
geom_point(aes(year, team, size = n_games, alpha = wins))
Points of improvement for the plot:
These change proposals are implemented as below:
win_props %>%
filter(team %in% best_teams_me) %>%
ggplot(aes(year, wins), alpha = 0.6) +
geom_point(aes(size = n_games)) +
geom_line(linewidth = 0.5, color='darkgrey')+
scale_size(range=c(0.5,4)) +
facet_wrap(. ~ team, ncol=5) +
labs(y='Win ratio', x='') +
guides(nrow=1) +
theme(
panel.grid.major = element_line(color = "grey", linewidth = 0.5, linetype = 2),
legend.position = c(0.8,0.035),
legend.direction = 'horizontal',
panel.spacing.x = unit(1, 'lines')
)
Q. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.
Q. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?
Ans: The main takeaway I got from the visualization are:
Q. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?
Ans: The original visualization is quite legible due to being de-cluttered.The axes are properly labeled, font type is simple and in sans-serif format, and the colours are well-distinguishable as well. The only potential difficulty I see would be the accurate measurement of the number of ramens per packaging style. Due to the stacked bar layout of the plot, we have to find the absolute difference between the start and end of the color within the bar w.r.t. the y-axis for that. So, while the visualization is less cluttered and pleasing to the eye, drawing insights is compromised here.
Q. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.
I propose a scatter plot (including some jitter as there are overlapping rating values) of these 15 countries against the ramen ratings, faceted by the 5 packaging styles of packaging of the ramen.
theme_set(theme_bw())
ramenratings <- read_csv("ramenratings.csv")
extract_top <- function(df, x, num) {
(
df %>%
group_by({{x}}) %>%
summarise(freq=n()) %>%
ungroup() %>%
arrange(-freq) %>%
pull({{x}})
)[1:{{num}}]
}
top_15_countries <- ramenratings %>% extract_top(Country, 15)
top_5_styles <- ramenratings %>% extract_top(Style, 5)
ramenratings %>%
group_by(Country, Brand, Style) %>%
filter(Country %in% top_15_countries & Style %in% top_5_styles & Stars %in% seq(3.5,5.5)) %>%
ggplot(aes(Country, Stars)) +
geom_jitter(alpha=0.3) +
facet_grid(factor(Style, levels=c('Box', 'Tray', 'Bowl', 'Cup', 'Pack')) ~ .) +
scale_x_discrete(name ="", limits=top_15_countries) +
scale_y_discrete(limits = seq(3.5,5.5,0.5)) +
labs(title = 'Ramen ratings for top 5 packaging styles in 15 most popular countries') +
theme(
plot.title = element_text(hjust = 0.5, size = 18),
axis.text.x = element_text(angle = 45, hjust=1))
antibiotic <- read_csv("https://uwmadison.box.com/shared/static/5jmd9pku62291ek20lioevsw1c588ahx.csv")
theme_set(theme_grey())
ggplot(antibiotic, aes(x=time, y=ind, fill=value)) +
geom_tile() +
facet_grid(species ~ .) +
scale_fill_gradient(low = "#EEF3FF", high = "#2471A3") +
scale_x_continuous(breaks = seq(0, 50, 10), expand = c(0, 0)) +
theme(strip.text.y.right = element_text(angle = 0))
Chosen screenshot from the article:
Q. What do you think was the underlying data behind the current view? What were the rows, and what were the columns? What were the data types of each of the columns?
Ans: I expect the underlying data to contain the following columns at the very least:
Q. What encodings were used? How are properties of marks on the page derived from the underlying abstract data?
Ans:
Q. Is multi-view composition being used? If so, how?
Ans: Multi-view composition is being used in this plot in the sense that we can see two types of plots merged into one. The first one is a redesigned version of a line plot connecting the origin (NYC) with it’s destination. The second is a scatter plot with a size encoding to show how many number of people were moved to the destination.