Visualization Best Practice in R

1. Proportions of a whole
2. Point data
3. Single distributions
4. Comparing distributions

This report is a summary of the lesson by Nicholas Strayer, DataCamp

1. Proportions of a whole

Pie charts : 단일 그룹의 비율 시각화에 유용하지만 정확도가 떨어짐
Waffle charts : 3개 그룹 이상의 비율 시각화에 유용하며 pie chart보다 정확도가 높음
Stacked bars : 동일한 x 축을 사용하기에 그룹 간 비교가 더 쉬워짐

Pie chart

disease_counts <- who_disease %>% 
  mutate(disease = ifelse(disease %in% c("measles", "mumps"), disease, "others")) %>% 
  group_by(disease) %>% 
  summarise(total_cases = sum(cases))

ggplot(disease_counts, aes(x = 1, y = total_cases, fill = disease)) +
  geom_col() +
  coord_polar(theta = "y") +
  theme_void() +
  ggtitle("Proportion of diseases using Pie chart")

Waffle chart

## 비율을 별도로 계산해야함
disease_counts <- who_disease %>% 
  group_by(disease) %>% 
  summarise(total_case = sum(cases)) %>% 
  mutate(percent = round(total_case / sum(total_case) * 100))

## 비율 벡터 생성
case_counts <- disease_counts$percent
names(case_counts) <- disease_counts$disease

## waffle 함수
waffle(case_counts, title = "Proportion of disease using waffle chart")

Stacked bars

disease_counts <- who_disease %>% 
  mutate(
    disease = ifelse(disease %in% c("measles", "mumps"), disease, "other") %>% 
    ## ordering stack for readability
    factor(levels = c("measles", "other", "mumps"))
  ) %>% 
  group_by(disease, year) %>% 
  summarise(total_cases = sum(cases))

ggplot(disease_counts, aes(year, total_cases, fill = disease)) +
  geom_col(position = "fill") +
  ggtitle("Proportion of disease using Stacked bar")

2. Point data

Bars and dots

bar chart는 대부분 유형의 데이터에 유용한 시각화 방법이지만 주의사항이 있다. Stacking rule: 누적이 가능한 데이터(data that have some sort of accumulating property)를 사용해야 한다. 즉, 수익이나 실패 가능성 등의 데이터에는 부적합하다.

geom_col : y축 개별 설정 가능
geom_bar : y축은 기본적으로 count로 x축만 설정해도 사용 가능

who_disease %>% 
    filter(country == "India", year == 1980) %>% 
    ggplot(aes(x = disease, y = cases)) +
    geom_col() +
  ggtitle("Using geom_col")

who_disease %>% 
    filter(cases > 1000) %>% 
    ggplot(aes(x = region)) +
    geom_bar() +
  ggtitle("Using geom_bar")

Point charts

Can’t stackable
- percentiles, ratio or sensor readings like temperature
- non-linear transformation(log, squar root, exponentiation)

interestingCountries <- c("NGA", "SDN", "FRA", "NPL", "MYS", "TZA", "YEM", "UKR", "BGD", "VNM")

who_subset <- who_disease %>% 
  filter(countryCode %in% interestingCountries,
         disease == "measles",
         year %in% c(1992, 2002)
  ) %>% 
  mutate(year = paste0("cases_", year)) %>% 
  arrange(year, region) %>% 
  pivot_wider(names_from = year, values_from = cases)

## reorder 함수로 내림차순으로 출력
ggplot(who_subset, aes(x = log10(cases_1992), y = reorder(country, cases_1992))) +
  geom_point() +
  ggtitle("Ordered point chart of cases_1992")

Adding visual anchors

cases_1992와 cases_2002 변화율 측정하기 위해 로그 변환(log transformation)을 사용
단순히 cases_2002 / cases_1992을 사용하면 값이 증가하거나 감소할 때 숫자가 비대칭적이기 때문에 시각적으로 비교하기 힘들어짐

때문에 로그 변환(log2)을 적용하여 값을 대칭적으로 변환시킴

+1 이면 2배 증가
-1 이면 2배 감소
0 이면 변화 없음(0을 기준으로 시각화하면 좋음)
log2값이 0보다 크면 증가, 작으면 감소하므로 더 직관적으로 데이터 이해 가능

who_subset %>% 
  ## log2 transformation
  mutate(logFoldChange = log2(cases_2002 / cases_1992)) %>% 
  ggplot(aes(x = logFoldChange, y = reorder(country, logFoldChange))) +
  geom_point() +
  ## 기준점 0 설정
  geom_vline(xintercept = 0) +
  ggtitle("log2 transformation 1992 vs. 2002")

who_subset %>% 
  mutate(logFoldChange = log2(cases_2002 / cases_1992)) %>% 
  ggplot(aes(x = logFoldChange, y = reorder(country, logFoldChange))) +
  geom_point() +
  geom_vline(xintercept = 0) +
  xlim(-6, 6) +
  facet_grid(region ~ ., scale = "free_y") +
  ggtitle("log2 transformation 1992 vs. 2002 plus facet_grid")

3. Single distributions

Standard plots

histogram : 단일 분포
boxplot : 복수의 분포

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(alpha = 0.7) +
  theme_minimal()

ggplot(diamonds) +
  geom_histogram(aes(x = carat, y = stat(density)), 
                 alpha = 0.8) +
  theme_minimal()

Histogram nuances

Bin number best practices

if length(data$x) > 150 -> bins = 100
Otherwise, play around to get a good sense of the data

digit preference

사람들이 특정 숫자를 더 선호하는 현상으로 특히 반올림이나 특정 숫자를 더 많이 기록하는 현상이 있는 데이터에서 주로 나타남
특히 혈압(blood pressure) 데이터의 경우 사람들은 120, 130, 140 처럼 딱 떨어지는 숫자를 선호하는 경향이 있어 실제로 혈압이 127이 나와도 130으로 기록하는 경우가 많음
이로 인해 데이터가 실제보다 특정 숫자에 몰려서 나타나고 이는 히스토그램을 왜곡시킬 수 있음.
하지만 자동으로 수집된 데이터는 그대로의 값을 저장하기 때문에 사람이 기록한 것보다 훨씬 더 정확한 분포가 나옴

데이터의 특성에 따라 bin number를 잘 조정해야 하며 너무 많으면 데이터가 너무 세밀하게 나뉘어서 패턴을 보기 어렵고, 너무 적으면 중요한 패턴이 사라져 왜곡된 분포가 출력될 수 있음

The kernel density estimator(KDE)

histogram’s alternatives로 bins, binwidth에서 벗어날 수 있음

KDE는 개별 데이터 포인트가 일종의 kernel function을 가지고 이들의 전체 분포를 출력하는 방식
kernel function은 다양한 형태를 가질 수 있으며, Uniform Kernel(균등분포)를 사용하면 기존의 히스토그램이 출력됨.
하지만 가장 일반적으로 사용되는 함수는 Gaussian kernel(정규분표)임

geom_density로 쉽게 구현 가능하며 고려해야 할 변수는 bw (standard dev. of kernel)로 개별 데이터 분포의 표준편차를 의미함
bw = "nrd0으로 자동 설정 가능

ggplot(diamonds) +
  geom_density(aes(x = carat), bw = 0.08, fill = "steelblue", alpha = 0.7) +
  geom_rug() +
  labs(title = "Gaussian kernel SD = 0.08")

4. Comparing distributions

Box plot

geom_boxplot : 데이터가 얼마나 포함됐는지도 표현하기 위해 geom_jitter와 함께 사용하면 좋음

diamonds %>% 
  filter(color == "E", cut %in% c("Fair", "Good")) %>% 
  ggplot(aes(x = cut, y = carat)) +
  geom_jitter(alpha = 0.3, color = "steelblue") +
  geom_boxplot(alpha = 0) +
  labs(title = "Carat of color E by cut")

diamonds %>% 
  filter(cut %in% c("Fair", "Good")) %>% 
  ggplot(aes(x = cut, y = carat)) +
  geom_jitter(alpha = 0.3, color = "steelblue") +
  geom_boxplot(alpha = 0) +
  facet_wrap(vars(color)) +
  coord_cartesian(ylim = c(0, 3)) +
  labs(title = "Carat of different color by cut")

Beeswarms and violins

geom_beeswarm() in the ggbeeswarm package
- goem_jitter의 상위 버전으로 geom_jitter와 다르게 데이터 포인트들이 겹치지 않고 stack하여 출력됨
- 하지만 데이터가 너무 많은 경우에는 부적절한 방법
- Arbitrary stacking : 데이터 포인트가 쌓이는게 다소 임의적으로 확인 필요
geom_violin
- 대형 데이터에도 적합
- KDE와 마찬가지로 bw 설정이 필요함

diamonds %>% 
  filter(color == "E", cut %in% c("Fair", "Good")) %>% 
  ggplot(aes(x = cut, y = carat)) +
  geom_beeswarm(alpha = 0.5, cex = 0.3, width = 1) +
  geom_boxplot(alpha = 0) +
  labs(title = "Using geom_beeswarm")

diamonds %>% 
  filter(color == "E", cut %in% c("Fair", "Good")) %>% 
  ggplot(aes(x = cut, y = carat)) +
  geom_violin(bw = 0.08) +
  geom_boxplot(alpha = 0, width = 0.3) +
  geom_point(alpha = 0.3, size = 0.5) +
  ggtitle("Using geom_violin")

diamonds %>% 
  filter(cut %in% c("Fair", "Good")) %>% 
  ggplot(aes(x = cut, y = carat)) +
  geom_violin(bw = 0.08, alpha = 0.3, fill = "steelblue") +
  geom_boxplot(alpha = 0, width = 0.3) +
  facet_wrap(vars(color)) +
  ggtitle("Using geom_violin with facet_wrap")

Spatially-related distributions

spatially connected axes: 범주 데이터들이 ordinal(서수)을 가지고 있는 경우 ex) months of the year: Jan < Feb < Mar < …

geom_density_ridges in the ggridges package

diamonds %>% 
  filter(color == "D") %>% 
  ggplot(aes(carat, cut)) +
  geom_point(alhpa = 0.2, shape = "|", position = position_nudge(y = -0.05)) +
  geom_density_ridges(bandwidth = 0.08, alpha = 0.7) +
  scale_x_continuous(limits = c(0, 4), expand = c(0, 0)) +
  theme(axis.ticks.y = element_blank())

  labs(title = "Gaussian kernel SD = 0.08")

## $title
## [1] "Gaussian kernel SD = 0.08"
## 
## attr(,"class")
## [1] "labels"

Caption