Chester Ismay
Office: ETC 223
Great data visualizations almost always
“The simple graph has brought more information to the data analyst’s mind than any other device.”
“A picture is not merely worth a thousand words, it is much more likely to be scrutinized than words are to be read.”
WTF Visualizations : http://viz.wtf/
ggplot2 in Rlibrary(pnwflights14); data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>%
select(-year, -origin)
pdx_flights %>% str()## Classes 'tbl_df', 'tbl' and 'data.frame': 53335 obs. of 14 variables:
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 1 8 28 526 541 549 559 602 606 618 ...
## $ dep_delay: num 96 13 -2 -4 1 24 -1 -3 6 -2 ...
## $ arr_time : int 235 548 800 1148 911 907 916 1204 746 1135 ...
## $ arr_delay: num 70 -4 -23 15 4 12 -9 7 3 -30 ...
## $ carrier : chr "AS" "UA" "US" "UA" ...
## $ tailnum : chr "N508AS" "N37422" "N547UW" "N813UA" ...
## $ flight : int 145 1609 466 229 1569 649 796 1573 406 1650 ...
## $ dest : chr "ANC" "IAH" "CLT" "IAH" ...
## $ air_time : num 194 201 251 217 130 122 125 203 87 184 ...
## $ distance : num 1542 1825 2282 1825 991 ...
## $ hour : num 0 0 0 5 5 5 5 6 6 6 ...
## $ minute : num 1 8 28 26 41 49 59 2 6 18 ...
pdx_early_flights <- pdx_flights %>% na.omit() %>%
filter(arr_delay < -30)pdx_early_flights %>% ggplot(aes(x = distance, y = arr_delay)) +
geom_point()pdx_early_flights %>% ggplot(aes(x = distance, y = arr_delay)) +
geom_point()pdx_early_flights %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
ggplot(aes(x = distance, y = arr_delay)) +
geom_point(aes(color = hawaii_dest))pdx_early_flights %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
ggplot(aes(x = distance, y = arr_delay)) +
geom_point(aes(color = hawaii_dest))pdx_early_flights %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
ggplot(aes(x = hawaii_dest , y = arr_delay)) +
geom_boxplot()pdx_early_flights %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
ggplot(aes(x = hawaii_dest , y = arr_delay)) +
geom_boxplot()date_string <- paste0("2014-",
pdx_early_flights$month, "-",
pdx_early_flights$day)
pdx_early_flights <- pdx_early_flights %>%
mutate(day_of_year = lubridate::ymd(date_string))pdx_early_flights %>% ggplot(aes(x = day_of_year, y = arr_delay)) +
geom_point()pdx_early_flights %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
ggplot(aes(x = day_of_year, y = arr_delay)) +
geom_point(aes(color = hawaii_dest))flights %>% na.omit() %>%
mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"),
"Hawaiian", "Not Hawaiian")) %>%
group_by(hawaii_dest) %>%
summarize(perc_early = sum(arr_delay < -30) / n() * 100)## Source: local data frame [2 x 2]
##
## hawaii_dest perc_early
## (chr) (dbl)
## 1 Hawaiian 9.959855
## 2 Not Hawaiian 1.366410
dep_delay?summary_sea_flights <- flights %>% na.omit() %>%
filter(origin == "SEA") %>%
group_by(carrier) %>%
summarize(mean_dep_delay = mean(dep_delay))summary_sea_flights %>% ggplot(aes(x = carrier, y = mean_dep_delay)) +
geom_bar()summary_sea_flights %>% ggplot(aes(x = carrier, y = mean_dep_delay)) +
geom_bar(stat = "identity")summary_sea_flights %>% ggplot(aes(x = reorder(carrier, mean_dep_delay),
y = mean_dep_delay)) +
geom_bar(stat = "identity", colour = "red")summary_sea_flights %>% ggplot(aes(x = reorder(carrier, mean_dep_delay),
y = mean_dep_delay)) +
geom_bar(stat = "identity", fill = "red") +
xlab("Airline Carrier") +
ylab("Mean Departure Delay") +
ggtitle("Seattle Departure Delays for 2014")flights %>% filter(arr_delay < -30) %>%
ggplot(aes(x = arr_delay)) +
geom_histogram(stat = "bin", binwidth = 1, colour = "blue")flights_seasons <- flights %>% filter(arr_delay < -30) %>%
na.omit() %>%
mutate(season = ifelse(month %in% 3:5, "spring",
ifelse(month %in% 6:8, "summer",
ifelse(month %in% 9:11, "autumn",
"winter"))))flights_seasons %>%
ggplot(aes(x = arr_delay, fill = season)) +
geom_histogram(stat = "bin", binwidth = 1)flights_seasons %>% ggplot(aes(x = arr_delay)) +
geom_histogram(stat = "bin", binwidth = 1) +
facet_grid(. ~ season)flights_seasons %>% ggplot(aes(x = arr_delay)) +
geom_histogram(stat = "bin", binwidth = 1) +
facet_grid(season ~ .)flights_seasons %>% ggplot(aes(x = season, y = arr_delay)) +
geom_boxplot()In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars).
data.frame in R whenever possible.Your assignment : Right click on me and Save
Solutions (Rmd) : Right click and Save
Solutions (HTML) : Click away…after you’ve tried!
All workshops in ETC 211 from 4 - 5 PM
September 16 - Data analysis with Stata
September 23 - Data analysis with R
September 30 - Data visualization using R
Slides available at http://rpubs.com/cismay/dvur_workshop_2015