Data visualization

1. Background

Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data.[1]

To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message.[2] Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables.

Data visualization is both an art and a science.[3] It is viewed as a branch of descriptive statistics by some, but also as a grounded theory development tool by others. Increased amounts of data created by Internet activity and an expanding number of sensors in the environment are referred to as “big data” or Internet of things. Processing, analyzing and communicating this data present ethical and analytical challenges for data visualization.[4] The field of data science and practitioners called data scientists help address this challenge.[5]

2. Brief history of Data visualization

If you’re like me, and want to know what happened between the 2nd century (the creation of the first table) and the 17th century (Descartes invents the graph), Michael Friendly’s 43-page e-book on the subject is guaranteed to fill a few knowledge gaps. Through the use of storytelling and imagery, he organizes data visualization history into epochs, each of which he conveniently characterized by themes and accomplishments (statistical graphics, atlases, the introduction of geometric figures, etc.).

3. Characteristics of effective graphical displays

In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines ‘graphical displays’ and principles for effective graphical display in the following passage: “Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:

show the data
induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
avoid distorting what the data has to say
present many numbers in a small space
make large data sets coherent
encourage the eye to compare different pieces of data
reveal the data at several levels of detail, from a broad overview to the fine structure
serve a reasonably clear purpose: description, exploration, tabulation or decoration
be closely integrated with the statistical and verbal descriptions of a data set. Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations."

4. Data visualization with R chunks

Import all required data for analysis.

## Warning: package 'dslabs' was built under R version 3.5.3

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ----- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts -------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

head(murders)

r <- murders %>%
  summarize(rate = sum(total)/sum(population)*10^6) %>%
  pull(rate)

Smoothing

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

data("gapminder")

Scatter plot

gapminder %>% filter(year %in% c(1962,2012)) %>%
  ggplot(aes(fertility, life_expectancy,
             color = continent)) +
  geom_point() +
  facet_grid(continent ~ year)

years <- c(1962,1980,1990,2000,2012)
continents <- c("Europe", "Asia")
gapminder %>% filter(year %in% years &
                       continent %in% continents) %>%
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year)

Boxplot

gapminder <- gapminder %>% mutate(dollars_per_day = gdp/population/365)

gapminder %>%
  filter(year == 1970 & !is.na(gdp)) %>%
  mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%
  ggplot(aes(region, dollars_per_day)) +
  geom_boxplot(aes(fill = region)) +
  labs(x = "region",
       y = "dollar per day",
       title = "Gapminder data") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "none")

gapminder <- gapminder %>% 
  mutate(group = case_when(
    region %in% c("Western Europe", "Northern Europe","Southern Europe", 
                    "Northern America", "Australia and New Zealand") ~ "West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))

gapminder <- gapminder %>% 
  mutate(group = factor(group, 
                        levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))

Multi density plot

## Warning: package 'ggridges' was built under R version 3.5.3

## 
## Attaching package: 'ggridges'

## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual

## Picking joint bandwidth of 2.71

Boxplot

gapminder %>% filter(year %in% c(1970,2000) & !is.na(gdp)) %>% 
  mutate(year = as.factor(year),
         group = reorder(group,
                         dollars_per_day,
                         FUN = median)) %>%
  ggplot(aes(group, dollars_per_day, fill = year)) +
  geom_boxplot() + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90,hjust = 1)) +
  labs(x = "",
       y = "dollars per day")

N <- seq(100, 5000, len = 100)
p <- 0.5
se <- sqrt(p*(1-p)/N)
plot(se)

e8 <- data.frame(N = N, se = se)
e8 %>% filter(se <=0.01) %>% arrange(desc(se)) %>% .[1,]

library(dslabs)
data("polls_us_election_2016")
names(polls_us_election_2016)

##  [1] "state"            "startdate"        "enddate"         
##  [4] "pollster"         "grade"            "samplesize"      
##  [7] "population"       "rawpoll_clinton"  "rawpoll_trump"   
## [10] "rawpoll_johnson"  "rawpoll_mcmullin" "adjpoll_clinton" 
## [13] "adjpoll_trump"    "adjpoll_johnson"  "adjpoll_mcmullin"

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.5.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

polls <- polls_us_election_2016 %>% 
  filter(enddate >=ymd(20161031))
nrow(polls)

## [1] 882

colSums(is.na(polls_us_election_2016))

##            state        startdate          enddate         pollster 
##                0                0                0                0 
##            grade       samplesize       population  rawpoll_clinton 
##              429                1                0                0 
##    rawpoll_trump  rawpoll_johnson rawpoll_mcmullin  adjpoll_clinton 
##                0             1409             4178                0 
##    adjpoll_trump  adjpoll_johnson adjpoll_mcmullin 
##                0             1409             4178

polls <- gather(polls_us_election_2016, 'rawpoll_clinton','rawpoll_trump','rawpoll_johnson', 'rawpoll_mcmullin','adjpoll_clinton','adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', key = 'candidate',value = 'pollratio')

ex <- c("rawpoll_trump","adjpoll_trump")
str_sub(ex,1,as.numeric(regexec("_",ex))-1)

## [1] "rawpoll" "adjpoll"

args(gsub)

## function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
##     fixed = FALSE, useBytes = FALSE) 
## NULL

gsub("\\w*_","",ex)

## [1] "trump" "trump"

str_extract(ex, "\\w{7}")

## [1] "rawpoll" "adjpoll"

polls <- polls %>% mutate(competion = str_extract(candidate,"\\w{7}"))
polls <- polls %>% mutate(cand = gsub("\\w*_","",candidate))
names(polls)

##  [1] "state"      "startdate"  "enddate"    "pollster"   "grade"     
##  [6] "samplesize" "population" "candidate"  "pollratio"  "competion" 
## [11] "cand"

polls %>% mutate(cand = reorder(cand, pollratio, FUN = mean)) %>% ggplot(aes(cand, pollratio, fill = competion)) +
  geom_boxplot() +
  labs(x = "candidate",
       y = "poll ratio")

## Warning: Removed 11174 rows containing non-finite values (stat_boxplot).

polls_na <- drop_na(polls, pollratio)
polls_na <- na.omit(polls_na)

library(forecast)

## Warning: package 'forecast' was built under R version 3.5.3

polls_na %>% filter(competion == "rawpoll", cand %in% c("clinton","trump")) %>%
  mutate(state = reorder(state, pollratio, FUN = median)) %>%
  ggplot(aes(state, pollratio, fill = cand)) + 
  geom_col(position = "dodge") +
  geom_hline(yintercept = 0.5, color = "grey", size = 0.5) +
  coord_flip() +
  facet_grid(.~cand)

Look on the poll ratio of each candidate

polls_na %>% filter(cand %in% c("clinton","trump")) %>% group_by(cand, competion) %>% summarize(overall_ratio = sum(pollratio*samplesize)/sum(samplesize))

polls_na %>% ggplot(aes(x = enddate, y = pollratio)) +
  geom_smooth(aes(color = cand))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

polls_na %>% filter(cand %in% c("clinton", "trump") & enddate >= "2016-10-31") %>% ggplot(aes(cand, pollratio, fill = competion)) +
  geom_col(position = "dodge")

Obama election

library(dslabs)
d <- 0.039
Ns <- c(1298, 533, 1342, 897, 774, 254, 812, 324, 1291, 1056, 2172, 516)
p <- (d + 1) / 2

polls <- map_df(Ns, function(N) {
  x <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
  x_hat <- mean(x)
  se_hat <- sqrt(x_hat * (1 - x_hat) / N)
  list(estimate = 2 * x_hat - 1, 
    low = 2*(x_hat - 1.96*se_hat) - 1, 
    high = 2*(x_hat + 1.96*se_hat) - 1,
    sample_size = N)
}) %>% mutate(poll = seq_along(Ns))

polls %>% ggplot(aes(x = poll, y = estimate, ymin = low, ymax = high, color = "blue")) +
  geom_pointrange(lwd = 0.5) +
  geom_errorbar() +
  geom_hline(yintercept = 0, ) +
  geom_hline(yintercept = 0.039,lty = 2, color = "grey") +
  coord_flip() +
  labs(x = "pool",
       y = "estimate") +
  theme_classic() +
  theme(legend.position = "none")

sum(polls$sample_size)

## [1] 11269

Estimate of spread

d_hat <- polls %>% 
  summarize(avg = sum(estimate*sample_size)/sum(sample_size)) %>%
  pull(avg)

library(ggridges)
polls_na %>% mutate(clt = pollratio/sqrt(samplesize)) %>% mutate(cand = reorder(cand, clt, FUN = median)) %>% ggplot(aes(clt, cand, fill = cand)) +
  geom_density_ridges() +
  coord_cartesian(xlim = c(0,4))+
  scale_fill_brewer(palette = "Blues")+
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "",
       y = "",
       title = "Who's winning the popular vote")

## Picking joint bandwidth of 0.0751

Data visualization

Bao Long

1. Background

2. Brief history of Data visualization

3. Characteristics of effective graphical displays

4. Data visualization with R chunks

Smoothing

Scatter plot

Boxplot

Multi density plot

Boxplot