Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data.[1]
To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message.[2] Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables.
Data visualization is both an art and a science.[3] It is viewed as a branch of descriptive statistics by some, but also as a grounded theory development tool by others. Increased amounts of data created by Internet activity and an expanding number of sensors in the environment are referred to as “big data” or Internet of things. Processing, analyzing and communicating this data present ethical and analytical challenges for data visualization.[4] The field of data science and practitioners called data scientists help address this challenge.[5]
If you’re like me, and want to know what happened between the 2nd century (the creation of the first table) and the 17th century (Descartes invents the graph), Michael Friendly’s 43-page e-book on the subject is guaranteed to fill a few knowledge gaps. Through the use of storytelling and imagery, he organizes data visualization history into epochs, each of which he conveniently characterized by themes and accomplishments (statistical graphics, atlases, the introduction of geometric figures, etc.).
In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines ‘graphical displays’ and principles for effective graphical display in the following passage: “Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:
Import all required data for analysis.
## Warning: package 'dslabs' was built under R version 3.5.3
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ----- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.2
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts -------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
head(murders)
r <- murders %>%
summarize(rate = sum(total)/sum(population)*10^6) %>%
pull(rate)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
data("gapminder")
gapminder %>% filter(year %in% c(1962,2012)) %>%
ggplot(aes(fertility, life_expectancy,
color = continent)) +
geom_point() +
facet_grid(continent ~ year)
years <- c(1962,1980,1990,2000,2012)
continents <- c("Europe", "Asia")
gapminder %>% filter(year %in% years &
continent %in% continents) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(~year)
gapminder <- gapminder %>% mutate(dollars_per_day = gdp/population/365)
gapminder %>%
filter(year == 1970 & !is.na(gdp)) %>%
mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%
ggplot(aes(region, dollars_per_day)) +
geom_boxplot(aes(fill = region)) +
labs(x = "region",
y = "dollar per day",
title = "Gapminder data") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none")
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe",
"Northern America", "Australia and New Zealand") ~ "West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
gapminder <- gapminder %>%
mutate(group = factor(group,
levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))
## Warning: package 'ggridges' was built under R version 3.5.3
##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
## Picking joint bandwidth of 2.71
gapminder %>% filter(year %in% c(1970,2000) & !is.na(gdp)) %>%
mutate(year = as.factor(year),
group = reorder(group,
dollars_per_day,
FUN = median)) %>%
ggplot(aes(group, dollars_per_day, fill = year)) +
geom_boxplot() +
theme_minimal()+
theme(axis.text.x = element_text(angle = 90,hjust = 1)) +
labs(x = "",
y = "dollars per day")
N <- seq(100, 5000, len = 100)
p <- 0.5
se <- sqrt(p*(1-p)/N)
plot(se)
e8 <- data.frame(N = N, se = se)
e8 %>% filter(se <=0.01) %>% arrange(desc(se)) %>% .[1,]
library(dslabs)
data("polls_us_election_2016")
names(polls_us_election_2016)
## [1] "state" "startdate" "enddate"
## [4] "pollster" "grade" "samplesize"
## [7] "population" "rawpoll_clinton" "rawpoll_trump"
## [10] "rawpoll_johnson" "rawpoll_mcmullin" "adjpoll_clinton"
## [13] "adjpoll_trump" "adjpoll_johnson" "adjpoll_mcmullin"
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
polls <- polls_us_election_2016 %>%
filter(enddate >=ymd(20161031))
nrow(polls)
## [1] 882
colSums(is.na(polls_us_election_2016))
## state startdate enddate pollster
## 0 0 0 0
## grade samplesize population rawpoll_clinton
## 429 1 0 0
## rawpoll_trump rawpoll_johnson rawpoll_mcmullin adjpoll_clinton
## 0 1409 4178 0
## adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 0 1409 4178
polls <- gather(polls_us_election_2016, 'rawpoll_clinton','rawpoll_trump','rawpoll_johnson', 'rawpoll_mcmullin','adjpoll_clinton','adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', key = 'candidate',value = 'pollratio')
ex <- c("rawpoll_trump","adjpoll_trump")
str_sub(ex,1,as.numeric(regexec("_",ex))-1)
## [1] "rawpoll" "adjpoll"
args(gsub)
## function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
## fixed = FALSE, useBytes = FALSE)
## NULL
gsub("\\w*_","",ex)
## [1] "trump" "trump"
str_extract(ex, "\\w{7}")
## [1] "rawpoll" "adjpoll"
polls <- polls %>% mutate(competion = str_extract(candidate,"\\w{7}"))
polls <- polls %>% mutate(cand = gsub("\\w*_","",candidate))
names(polls)
## [1] "state" "startdate" "enddate" "pollster" "grade"
## [6] "samplesize" "population" "candidate" "pollratio" "competion"
## [11] "cand"
polls %>% mutate(cand = reorder(cand, pollratio, FUN = mean)) %>% ggplot(aes(cand, pollratio, fill = competion)) +
geom_boxplot() +
labs(x = "candidate",
y = "poll ratio")
## Warning: Removed 11174 rows containing non-finite values (stat_boxplot).
polls_na <- drop_na(polls, pollratio)
polls_na <- na.omit(polls_na)
library(forecast)
## Warning: package 'forecast' was built under R version 3.5.3
polls_na %>% filter(competion == "rawpoll", cand %in% c("clinton","trump")) %>%
mutate(state = reorder(state, pollratio, FUN = median)) %>%
ggplot(aes(state, pollratio, fill = cand)) +
geom_col(position = "dodge") +
geom_hline(yintercept = 0.5, color = "grey", size = 0.5) +
coord_flip() +
facet_grid(.~cand)
Look on the poll ratio of each candidate
polls_na %>% filter(cand %in% c("clinton","trump")) %>% group_by(cand, competion) %>% summarize(overall_ratio = sum(pollratio*samplesize)/sum(samplesize))
polls_na %>% ggplot(aes(x = enddate, y = pollratio)) +
geom_smooth(aes(color = cand))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
polls_na %>% filter(cand %in% c("clinton", "trump") & enddate >= "2016-10-31") %>% ggplot(aes(cand, pollratio, fill = competion)) +
geom_col(position = "dodge")
Obama election
library(dslabs)
d <- 0.039
Ns <- c(1298, 533, 1342, 897, 774, 254, 812, 324, 1291, 1056, 2172, 516)
p <- (d + 1) / 2
polls <- map_df(Ns, function(N) {
x <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat * (1 - x_hat) / N)
list(estimate = 2 * x_hat - 1,
low = 2*(x_hat - 1.96*se_hat) - 1,
high = 2*(x_hat + 1.96*se_hat) - 1,
sample_size = N)
}) %>% mutate(poll = seq_along(Ns))
polls %>% ggplot(aes(x = poll, y = estimate, ymin = low, ymax = high, color = "blue")) +
geom_pointrange(lwd = 0.5) +
geom_errorbar() +
geom_hline(yintercept = 0, ) +
geom_hline(yintercept = 0.039,lty = 2, color = "grey") +
coord_flip() +
labs(x = "pool",
y = "estimate") +
theme_classic() +
theme(legend.position = "none")
sum(polls$sample_size)
## [1] 11269
Estimate of spread
d_hat <- polls %>%
summarize(avg = sum(estimate*sample_size)/sum(sample_size)) %>%
pull(avg)
library(ggridges)
polls_na %>% mutate(clt = pollratio/sqrt(samplesize)) %>% mutate(cand = reorder(cand, clt, FUN = median)) %>% ggplot(aes(clt, cand, fill = cand)) +
geom_density_ridges() +
coord_cartesian(xlim = c(0,4))+
scale_fill_brewer(palette = "Blues")+
theme_minimal() +
theme(legend.position = "none") +
labs(x = "",
y = "",
title = "Who's winning the popular vote")
## Picking joint bandwidth of 0.0751