The objective of this assignment is two fold. First you are required to read and summarize the key aspects of both readings Cleveland and McGill 1985 and Heer and Bostock 2010. This summary should explore the similarities and differences in both studies approaches and in their findings.
It provides overall theries of graphical perception between statistical methods and human perception of visualization. They emphisize quantifying. This studies has focus on the ability of viewers to interpret visual encodings as well as decoding information in graphs.
This study explains Crowdsourcing which benefits with cost reduction in terms of visualization and graphical perception. This is empirical research paper to validate theory and more applicable into real world applications such as online and web-based interface.
Cleveland and McGill 1985 and Heer and Bostock 2010
The purpose of this two articles are to describe and illustrate several of graphical methods, and convey some theoretical and experimental investigations of graphical perception. It mainly investegates how visual variables such as position, length, area, angle, slope, shape, volume and color impact the effectiveness of data visualizations.
When we look at a graph, quantitative and qualitative information is encoded, chiefly through position, length, angle, size, color and so on. It is very important to know the importance level of theses visual variables impact the accuracy of data encode. Because no matter how technologically impressive the decoding, it fails if the decoding process fails.
In the two articles, Cleveland and McGill conducted several experimental results together with theoretical arguments to find out what kind of tasks facilitate the highest accuracy of praphical perception. The graphs would better use the tasks with greater accuracy when they decode the information in the graph. They order these tasks on the basis of accuracy as follows: position along a common scale, position on identical but nonaligned scales, length, angle, slope (with 0 not too close to 0, 7r/2, or ir radians), area, volumn, density, color saturation, color hue.
The task of this graphic was to evaluate the average arrival delays of flights in and out of NYC. I was curious if there was a strong variability or pattern across different carrier.
Negative times represent early departures/arrivals. It seems HA and AS carriers don’t have arrial time delay much compared to other carriers.
# setup
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(nycflights13)
# Manipulate data in prep for graphic
dat2 <- flights %>%
select(carrier,arr_delay) %>%
group_by(carrier) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
as.data.frame()
# Create vis
ggplot(dat2, aes(factor(carrier),delay)) +
geom_bar(stat= "identity") +
xlab("Carrier") + ylab("Average Flight Arrival Delay (minutes)") +
ggtitle("Annual Variability in NYC Flight Delays") +
ggthemes::theme_tufte()
### My Vis 2
In this version I attempt to look at arrival delay without summarizing the data, but rather looking at each point of data across carrier.
it is interesting HA has one event which has the highest arrival delay time.
This type of information was not shown in plot 1 because plot 1 was an average number.
ggplot(flights, aes(carrier, arr_delay)) +
geom_point(aes(colour=carrier), alpha=I(0.7), na.rm=TRUE) +
ggtitle("Relation between arrival delay time and carrier ")
In this one, I want to add arrival delay time over the months across different carrier. But typical x and y axis graph is too noisy to hold variable information at once inlcuding more than three info, in this case, month, carrier and arrival delay time.
Color does not help much either in visualization when it already holds too much information in one graph.
ggplot(data=flights, aes(x=factor(month), y=arr_delay, colour=carrier, group=carrier)) +
geom_line(linetype="dotted", size=1) +
geom_point(size=3, shape=19, na.rm=TRUE) +
ggtitle("Time Seires Graph, Arrival delay time from Jan to Dec in 2013.")
## Warning: Removed 16 rows containing missing values (geom_path).
### My Vis 4 If I draw graph for each carreir seperately, it’s more visually appropriate to interprete.
For a rare delay event for HA, we can see it happened in Jan.
In general across all season, AS is the best carrier in terms of not having much arriaval delay time.
ggplot(flights, aes(x = factor(month), y=arr_delay)) +
geom_point(na.rm=TRUE)+
facet_wrap(~ carrier) +
guides(fill = FALSE) + # to remove the legend
theme_bw() # for clean look overall
Here is an example that shows NOT ONLY THE ANSWER but also my process.
The task of this graphic was to evaluate the average delay time of flights in and out of NYC. I was curious if there was a strong variability across months due the seasonality of weather in the northeast.
# setup
library(dplyr)
library(ggplot2)
library(nycflights13)
# Load the data (abbreviate name for ease)
dat <- flights
# Manipulate data in prep for graphic
dat2 <- flights %>%
select(month, dep_delay) %>%
group_by(month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
as.data.frame()
# Create vis
ggplot(dat2, aes(factor(month),delay)) +
geom_bar(stat= "identity") +
xlab("Month") + ylab("Average Flight Delay (minutes)") +
ggtitle("Annual Variability in NYC Flight Delays") +
ggthemes::theme_tufte()
In this version I attempt to look at departure delay without summarizing the data, but rather looking at a smooth of the data across 2013. To use a bivariate approach I must treat month
as a continuous variable.
# Manipulate data in prep for graphic
dat3 <- flights %>%
sample_n(1000) %>% # This dataset is large so I take a random sample
select(month, dep_delay) %>%
as.data.frame()
# Create vis
ggplot(dat3, aes(month,dep_delay)) +
geom_point(na.rm = TRUE) +
geom_smooth(na.rm = TRUE) +
xlab("Month") + ylab("Flight Delay (minutes)") +
ggtitle("Annual Variability in NYC Flight Delays")
## `geom_smooth()` using method = 'gam'
But it due to the delay time range I can not really visualize the variability over time. I will try to correct that by eliminating the geom_points()
# Create vis
ggplot(dat3, aes(month,dep_delay)) +
#geom_point(na.rm = TRUE) +
geom_smooth(na.rm = TRUE) +
xlab("Month") + ylab("Flight Delay (minutes)") +
ggtitle("Annual Variability in NYC Flight Delays")
## `geom_smooth()` using method = 'gam'
Finally one more option to allow me to look at the raw data variability but not treat month as a continuous variable.
# Create vis
ggplot(dat3, aes(factor(month),dep_delay)) +
geom_boxplot() +
xlab("Month") + ylab("Flight Delay (minutes)") +
ylim(-10, 100) +
ggtitle("Annual Variability in NYC Flight Delays")
## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
Finally, the graphical primative that I am applying in each of these examples is position along a common scale. In my opinion the sheer size of this data makes it difficult to visualize the average differences between months without using a summary statistical approach. Given that caveat I think the most effective way to visualize mean departure delay by month was in my first graphic using a bar chart approach. I added a Tufte-ian theme to improve overall aesthetics using the package ggthemes