VISUALISASI NILAI DATA

Fuel Economy Data Set

Sebagai ilustrasi, akan digunakan data mpg yang tersedia pada package ggplot2.

library(tidyverse)
data(mpg)
mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

Kita dapat gunakan fungsi help() untuk melihat deskripsi dari data tersebut.

help(mpg)

Description of mpg data set

Berikut adalah penjelasan untuk masing-masing peubah yang ada pada data set tersebut:

  • manufacturer : nama perusahaan
  • model : nama model
  • displ : perpindahan mesin, dalam liter
  • year : tahun produksi
  • cyl : jumlah silinder
  • trans : jenis transmisi
  • drv : jenis kereta penggerak, dimana f = penggerak roda depan, r = penggerak roda belakang, 4 = 4wd
  • cty : jarak tempuh di dalam kota (per galon)
  • hwy : jarak tempuh di jalan raya (per galon)
  • fl : jenis bahan bakar
  • class : jenis mobil

Seandainya kita ingin mengetahui banyaknya masing-masing jenis mobil, kita dapat menampilkannya dalam bentuk barplot berikut:

ggplot(data=mpg) +
  geom_bar(mapping=aes(x=class))

Pada umumnya, barplot ditampilkan secara terurut dari frekuensi yang paling besar. Sebagai tambahan, kita dapat pula memberikan warna pada plot dengan menambahkan argumen fill, serta warna border setiap bar dengan argumen color, seperti contoh berikut ini.

mpg %>%
    count(class) %>%
    mutate(class = fct_reorder(class, n, .desc = TRUE)) %>%
    ggplot(aes(x = class, y = n, fill=class)) + geom_bar(stat = 'identity')

Barplot digunakan untuk menyajikan data kategorik. Untuk data numerik, terdapat cukup banyak jenis visualisasi yang dapat digunakan untuk menampilkan nilai data, di antaranya adalah dotplot, histogram, density plot, dsb.

ggplot(mpg, aes(x = hwy)) +
  geom_dotplot(dotsize=0.4) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(x="Highway Miles per Gallon")

g <- ggplot(mpg, aes(class, cty))
g + geom_violin() + 
  labs(title="Violin plot", 
       subtitle="City Mileage vs Class of vehicle",
       caption="Source: mpg",
       x="Class of Vehicle",
       y="City Mileage")

Email Campaign Funnel Data Set

options(scipen = 999)  # turns of scientific notations like 1e+40
# Read data
email_campaign_funnel <- read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
## Rows: 42 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Stage, Gender
## dbl (1): Users
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
email_campaign_funnel
## # A tibble: 42 x 3
##    Stage                                  Gender      Users
##    <chr>                                  <chr>       <dbl>
##  1 Stage 01: Browsers                     Male   -14927619.
##  2 Stage 02: Unbounced Users              Male   -12862663.
##  3 Stage 03: Email Signups                Male   -11361896.
##  4 Stage 04: Email Confirmed              Male    -9411708.
##  5 Stage 05: Campaign-Email Opens         Male    -8074317.
##  6 Stage 06: Campaign-Email Clickthroughs Male    -6958512.
##  7 Stage 07: Buy Button Page              Male    -6045363.
##  8 Stage 08: Buy Button Clickers          Male    -5029954.
##  9 Stage 09: Cart Confirmation Page       Male    -4008034.
## 10 Stage 10: Address Verification Page    Male    -3172555.
## # ... with 32 more rows
library(ggthemes)


# X Axis Breaks and Labels 
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")

# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) +   # Fill column
                              geom_bar(stat = "identity", width = .6) +   # draw the bars
                              scale_y_continuous(breaks = brks,   # Breaks
                                                 labels = lbls) + # Labels
                              coord_flip() +  # Flip axes
                              labs(title="Email Campaign Funnel") +
                              theme_tufte() +  # Tufte theme from ggfortify
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   # Centre plot title
                              scale_fill_brewer(palette = "Dark2")  # Color palette

VISUALISASI PROPORSI & KOMPOSISI

Pie Chart

library(scales)                                      # automatically determining breaks/labels 
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Data preparation
plotdata <- mpg %>%
  count(class) %>%
  arrange(desc(class)) %>%
  mutate(prop = round(n*100/sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5*prop)
# Create Pie chart
mycols <-2:8
ggplot(plotdata, aes(x = "", y = prop, fill = class)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+
  geom_text(aes(y = lab.ypos, label = prop), color = "white")+
  scale_fill_manual(values = mycols) +
  theme_void()

Sebagai alternatif, dapat pula ditampilkan dalam bentuk donut chart berikut.

ggplot(plotdata, aes(x = 2, y = prop, fill = class)) +
  geom_bar(stat = "identity", color = "white") +
  coord_polar(theta = "y", start = 0)+
  geom_text(aes(y = lab.ypos, label = prop), color = "white")+
  scale_fill_manual(values = mycols) +
  theme_void()+
  xlim(0.5, 2.5)

Stacked Bar & Density Chart

Pada kasus tertentu, kita perlu menyajikan grafik dari beberapa peubah untuk menyampaikan informasi tentang komposisi tertentu. Beberapa pendekatan dapat digunakan, di antaranya adalah stacked bar plot, stacked hitogram, pyramid, dsb.

Misalnya, kita ingin menampilkan jenis-jenis mesin penggerak pada berbagai tipe mobil. Pertama, kita dapat menggunakan stacked bar chart seperti pada contoh berikut ini.

ggplot(mpg) + 
  # fill scales to 100%
  geom_bar(aes(x = class, fill = drv), position = "fill") +
  scale_fill_manual(values = 2:4) +
  scale_fill_discrete(name = "Drive Train", labels = c("4wd", "front-wheel", "rear-wheel")) +
  labs(title="Drive Train Types on Different Cars",y="proportion")

Cara lain untuk menampilkannya adalah menggunakan segmented bar chart berikut.

library(scales)
# create a summary dataset (data manipulation)
plotdata <- mpg %>%
  group_by(class, drv) %>%
  dplyr::summarize(n = n()) %>%
  mutate(pct = n/sum(n),
         lbl = scales::percent(pct))
# create segmented bar chart
# adding labels to each segmen
ggplot(plotdata, 
       aes(x = factor(class),
           y = pct,
        fill = factor(drv))) +
  geom_bar(stat = "identity",
           position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent)+
  geom_text(aes(label = lbl),
            size = 3,
            position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette = "Set2") +
  scale_fill_discrete(name = "Drive Train", labels = c("4wd", "front-wheel", "rear-wheel")) +
  theme_minimal() +                                  # use a minimal theme
  labs(y = "Percent",
       fill = "Drive Train",
       x = "Class",
       title = "Automobile Drive by Class") +
  theme_minimal()

ggplot(data=mpg, aes(x=displ, group=class, fill=class)) +
    geom_density(adjust=1.5, position="fill", adjust=1.5, alpha=.4) 

Mosaic Plots & Treemaps

Menurut Wilke (2019), mosaic plot sekilas nampak serupa dengan stacked bar chart namun mosaic plot memiliki lebar axis pada sumbu x dan y yang berbeda, sesuai dengan frekuensi masing-masing kategori. Treemap pun nampak mirip seperti mosaic plot. Bedanya, treemap menampilkan subkategori yang tersarang pada kategori lain.

Ilustrasi Mosaic Plot

df_bin=data.frame(Age=c('old','old','old','old','young','young','young','young'),
                    Favorite=c(rep('bubble gum',2),rep('coffee',2),rep('bubble gum',2),rep('coffee',2)),
                    Music=c(rep(c('classical','rock'),4)),
                    Freq=c(1,1,3,1,2,5,1,0))
df_unbin = data.frame(Age =c(rep("old",6), rep("young", 8)), 
                      Favorite = c(rep("bubble gum", 2),rep("coffee", 4), rep("bubble gum", 7), "coffee"),
                      Music = c("classical", "rock", rep("classical", 3), "rock", rep("classical", 2), rep("rock", 5), "classical"))
#install.packages("ggmosaic")
library(ggmosaic)
ggplot(data = df_unbin)+ 
  geom_mosaic(aes(x = product(Music, Age), fill = Music))+
  labs(x = "Age", y = "Music", title = "Mosaic Plot: Spliting on Age, then Music")+
  theme(plot.title = element_text(hjust = 0.5))

Ilustrasi Treemap

#install.packages("treemap")
library(treemap)

#Load population dataset from Github
df <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/11_SevCatOneNumNestedOneObsPerGroup.csv", header=T, sep=";")
df[ which(df$value ==-1),"value"] <- 1
colnames(df) <- c("Continent", "Region", "Country", "Population")
treemap(df, #dataframe
        index=c("Continent"),  #categorical hierarchy variable, Continent in this case
        vSize = "Population",  #quantitative variable, Population in this case
        type="index",
        title = "Treemap: Population by Continents")#colors are determined by the index variables. Different branches in the hierarchical tree get different colors

treemap(df, #dataframe
        index=c("Continent", "Region", "Country"),  #categorical variables in the order of highest level of the hierarchy to lowest
        vSize = "Population",                     #quantitative variable
        type="index",
        # Labels
        fontsize.labels=c(15,8,5),                # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
        fontcolor.labels=c("black","orange", "white"),    # Color of labels
        fontface.labels=c(2,1,1),                  # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
        bg.labels=c("transparent"),              # Background color of labels
        align.labels=list(
        c("center", "center"), 
        c("right", "bottom"),
        c("left", "top")),                      # Where to place labels in the rectangle?
        overlap.labels=0.5,                     # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1  means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5  times their area size.
        inflate.labels=F,
        title = "Treemaps: Population by Continents, Regions and Countries") 

VISUALISASI HUBUNGAN PEUBAH

Sebagai ilustrasi, akan digunakan data yang diambil dari Yau (2011).

crime<-read_csv('http://datasets.flowingdata.com/crimeRatesByState2005.csv')
## Rows: 52 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): state
## dbl (8): murder, forcible_rape, robbery, aggravated_assault, burglary, larce...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
crime
## # A tibble: 52 x 9
##    state    murder forcible_rape robbery aggravated_assa~ burglary larceny_theft
##    <chr>     <dbl>         <dbl>   <dbl>            <dbl>    <dbl>         <dbl>
##  1 United ~    5.6          31.7   141.              291.     727.         2286.
##  2 Alabama     8.2          34.3   141.              248.     954.         2650 
##  3 Alaska      4.8          81.1    80.9             465.     622.         2599.
##  4 Arizona     7.5          33.8   144.              327.     948.         2965.
##  5 Arkansas    6.7          42.9    91.1             387.    1085.         2711.
##  6 Califor~    6.9          26     176.              317.     693.         1916.
##  7 Colorado    3.7          43.4    84.6             265.     745.         2735.
##  8 Connect~    2.9          20     113               139.     437.         1824.
##  9 Delaware    4.4          44.7   155.              428.     689.         2144 
## 10 Distric~   35.4          30.2   672.              721.     650.         2695.
## # ... with 42 more rows, and 2 more variables: motor_vehicle_theft <dbl>,
## #   population <dbl>

Seandainya ingin dilihat hubungan antara tindakan kriminal pembunuhan dan perampokan, kita dapat menggunakan scatterplot seperti pada contoh berikut ini.

ggplot(crime ) +
  geom_point(aes(x=murder, y=burglary))

Berdasarkan plot tersebut terlihat ada kecenderungan pola hubungan positif antara kedua peubah tersebut. Namun, hal itu menjadi kurang terlihat karena adanya pengamatan pencilan.

Seandainya kita kemudian hanya ingin fokus pada daerah tertentu saja, maka kita dapat melakukannya seperti pada contoh berikut ini.

crime %>% subset(state!="District of Columbia") %>%
  subset(state!="United States") %>%
  ggplot() +
  geom_point(aes(x=murder, y=burglary))

Untuk membantu mengenali pola data, kita dapat pula menggunakan smoothing dengan menambahkan fungsi geom_smooth().

crime %>% subset(state!="District of Columbia") %>%
  subset(state!="United States") %>%
  ggplot(aes(x=murder, y=burglary)) +
  geom_point() +
  geom_smooth(method='loess')

Apabila ingin melihat hubungan antar beberapa peubah, kita dapat melakukannya dengan cara berikut.

library(GGally)
crime %>% subset(state!="District of Columbia") %>%
  subset(state!="United States") %>%
  ggpairs(columns=2:8,lower = list(continuous = "smooth"))

crime %>% subset(state!="District of Columbia") %>%
  subset(state!="United States") %>%
  ggcorr(label = TRUE, label_round = 2, size=2)

VISUALISASI TIME SERIES DATA & TREND

Ilustrasi berikut diambil dari Indra (2018), menggunakan data economics yang tersedia pada package ggplot2.

head(economics)
## # A tibble: 6 x 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

Data ini terdiri dari 478 pengamatan dan 6 peubah, yaitu:

Pola data deret waktu dapat dieksplorasi menggunakan line plot, menggunakan fungsi geom_line().

ggplot(data = economics, aes(x = date, y = pop))+ geom_line(color = "#00AFBB", size = 1) +
xlab("Date of data")  + ylab("Total population")

# Base plot with date axis
p <- ggplot(economics, aes(x = date, y = psavert)) + 
     geom_line(color = "#00AFBB", size = 1)
p

# Set axis limits c(min, max)
min <- as.Date("2002-1-1")
max <- NA
p + scale_x_date(limits = c(min, max))
## Warning: Removed 414 row(s) containing missing values (geom_path).

Seandainya, kita ingin membandingkan pola deret waktu antara peubah psavert dan uempmed, maka terlebih dulu kita modifikasi data set sehingga keduanya ada dalam kolom yang sama.

data <- economics %>%
  select(date, psavert, uempmed) %>% gather(key = "variable", value = "value", -date)
# Multiple line plot
ggplot(data, aes(x = date, y = value)) + geom_line(aes(color = variable), size = 1) +
  scale_color_manual(values = c("#00AFBB", "#E7B800")) +  theme_minimal()

# Area plot
ggplot(data, aes(x = date, y = value)) + 
geom_area(aes(color = variable, fill = variable), alpha = 0.5, position = position_dodge(0.8)) +
  scale_color_manual(values = c("#00AFBB", "#E7B800")) + 
  scale_fill_manual(values = c("#00AFBB", "#E7B800"))

Selain dari ilustrasi di atas, visualisasi untuk data time series dapat pula berupa dumbell plot dan slope graph, yang dapat dilihat pada link berikut: Kabacoff (2018).

References

[CC19] Community contributions for EDAV fall 2019. (2019, December 13). https://jtr13.github.io/cc19/

Holtz, Y. (n.d.). The R Graph Gallery. https://www.r-graph-gallery.com/

Indra. (2018, May 28). Plot time series data using GGPlot. Kaggle: Your Machine Learning and Data Science Community. https://www.kaggle.com/indra90/plot-time-series-data-using-ggplot

Kabacoff, R. (2018, September 3). Data visualization with R · GitHub Pages. https://rkabacoff.github.io/datavis/Time.html#dummbbell-charts

Kliegman, J., Dombrowski, M. P., & Kavalar, M. (2019, May 2). Best ggplot visualizations. Nextjournal. https://nextjournal.com/jk/best-ggplot

Prabhakaran, S. (n.d.). Top 50 ggplot2 visualizations - The master list (With full R code). Tutorials on Advanced Stats and Machine Learning With R. https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

Wilke, C. O. (2019). Fundamentals of data visualization. Claus O. Wilke. https://clauswilke.com/dataviz/nested-proportions.html

Yau, N. (2011). Visualize this: the FlowingData guide to design, visualization, and statistics. John Wiley & Sons.


  1. Department of Statistics, IPB University↩︎