VISUALISASI NILAI DATA
Fuel Economy Data Set
Sebagai ilustrasi, akan digunakan data mpg
yang tersedia pada package ggplot2
.
library(tidyverse)
data(mpg)
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## # ... with 224 more rows
Kita dapat gunakan fungsi help()
untuk melihat deskripsi dari data tersebut.
help(mpg)
Description of mpg data set
Berikut adalah penjelasan untuk masing-masing peubah yang ada pada data set tersebut:
- manufacturer : nama perusahaan
- model : nama model
- displ : perpindahan mesin, dalam liter
- year : tahun produksi
- cyl : jumlah silinder
- trans : jenis transmisi
- drv : jenis kereta penggerak, dimana f = penggerak roda depan, r = penggerak roda belakang, 4 = 4wd
- cty : jarak tempuh di dalam kota (per galon)
- hwy : jarak tempuh di jalan raya (per galon)
- fl : jenis bahan bakar
- class : jenis mobil
Seandainya kita ingin mengetahui banyaknya masing-masing jenis mobil, kita dapat menampilkannya dalam bentuk barplot berikut:
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class))
Pada umumnya, barplot ditampilkan secara terurut dari frekuensi yang paling besar. Sebagai tambahan, kita dapat pula memberikan warna pada plot dengan menambahkan argumen fill
, serta warna border setiap bar dengan argumen color
, seperti contoh berikut ini.
%>%
mpg count(class) %>%
mutate(class = fct_reorder(class, n, .desc = TRUE)) %>%
ggplot(aes(x = class, y = n, fill=class)) + geom_bar(stat = 'identity')
Barplot digunakan untuk menyajikan data kategorik. Untuk data numerik, terdapat cukup banyak jenis visualisasi yang dapat digunakan untuk menampilkan nilai data, di antaranya adalah dotplot, histogram, density plot, dsb.
ggplot(mpg, aes(x = hwy)) +
geom_dotplot(dotsize=0.4) +
scale_y_continuous(NULL, breaks = NULL) +
labs(x="Highway Miles per Gallon")
ggplot(mpg, aes(class, cty))
g <-+ geom_violin() +
g labs(title="Violin plot",
subtitle="City Mileage vs Class of vehicle",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
Email Campaign Funnel Data Set
options(scipen = 999) # turns of scientific notations like 1e+40
# Read data
read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv") email_campaign_funnel <-
## Rows: 42 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Stage, Gender
## dbl (1): Users
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
email_campaign_funnel
## # A tibble: 42 x 3
## Stage Gender Users
## <chr> <chr> <dbl>
## 1 Stage 01: Browsers Male -14927619.
## 2 Stage 02: Unbounced Users Male -12862663.
## 3 Stage 03: Email Signups Male -11361896.
## 4 Stage 04: Email Confirmed Male -9411708.
## 5 Stage 05: Campaign-Email Opens Male -8074317.
## 6 Stage 06: Campaign-Email Clickthroughs Male -6958512.
## 7 Stage 07: Buy Button Page Male -6045363.
## 8 Stage 08: Buy Button Clickers Male -5029954.
## 9 Stage 09: Cart Confirmation Page Male -4008034.
## 10 Stage 10: Address Verification Page Male -3172555.
## # ... with 32 more rows
library(ggthemes)
# X Axis Breaks and Labels
seq(-15000000, 15000000, 5000000)
brks <- paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")
lbls =
# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) + # Fill column
geom_bar(stat = "identity", width = .6) + # draw the bars
scale_y_continuous(breaks = brks, # Breaks
labels = lbls) + # Labels
coord_flip() + # Flip axes
labs(title="Email Campaign Funnel") +
theme_tufte() + # Tufte theme from ggfortify
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) + # Centre plot title
scale_fill_brewer(palette = "Dark2") # Color palette
VISUALISASI PROPORSI & KOMPOSISI
Pie Chart
library(scales) # automatically determining breaks/labels
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Data preparation
mpg %>%
plotdata <- count(class) %>%
arrange(desc(class)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
# Create Pie chart
2:8
mycols <-ggplot(plotdata, aes(x = "", y = prop, fill = class)) +
geom_bar(width = 1, stat = "identity", color = "white") +
coord_polar("y", start = 0)+
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
scale_fill_manual(values = mycols) +
theme_void()
Sebagai alternatif, dapat pula ditampilkan dalam bentuk donut chart berikut.
ggplot(plotdata, aes(x = 2, y = prop, fill = class)) +
geom_bar(stat = "identity", color = "white") +
coord_polar(theta = "y", start = 0)+
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
scale_fill_manual(values = mycols) +
theme_void()+
xlim(0.5, 2.5)
Stacked Bar & Density Chart
Pada kasus tertentu, kita perlu menyajikan grafik dari beberapa peubah untuk menyampaikan informasi tentang komposisi tertentu. Beberapa pendekatan dapat digunakan, di antaranya adalah stacked bar plot, stacked hitogram, pyramid, dsb.
Misalnya, kita ingin menampilkan jenis-jenis mesin penggerak pada berbagai tipe mobil. Pertama, kita dapat menggunakan stacked bar chart seperti pada contoh berikut ini.
ggplot(mpg) +
# fill scales to 100%
geom_bar(aes(x = class, fill = drv), position = "fill") +
scale_fill_manual(values = 2:4) +
scale_fill_discrete(name = "Drive Train", labels = c("4wd", "front-wheel", "rear-wheel")) +
labs(title="Drive Train Types on Different Cars",y="proportion")
Cara lain untuk menampilkannya adalah menggunakan segmented bar chart berikut.
library(scales)
# create a summary dataset (data manipulation)
mpg %>%
plotdata <- group_by(class, drv) %>%
dplyr::summarize(n = n()) %>%
mutate(pct = n/sum(n),
lbl = scales::percent(pct))
# create segmented bar chart
# adding labels to each segmen
ggplot(plotdata,
aes(x = factor(class),
y = pct,
fill = factor(drv))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent)+
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
scale_fill_discrete(name = "Drive Train", labels = c("4wd", "front-wheel", "rear-wheel")) +
theme_minimal() + # use a minimal theme
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal()
ggplot(data=mpg, aes(x=displ, group=class, fill=class)) +
geom_density(adjust=1.5, position="fill", adjust=1.5, alpha=.4)
Mosaic Plots & Treemaps
Menurut Wilke (2019), mosaic plot sekilas nampak serupa dengan stacked bar chart namun mosaic plot memiliki lebar axis pada sumbu x dan y yang berbeda, sesuai dengan frekuensi masing-masing kategori. Treemap pun nampak mirip seperti mosaic plot. Bedanya, treemap menampilkan subkategori yang tersarang pada kategori lain.
Ilustrasi Mosaic Plot
data.frame(Age=c('old','old','old','old','young','young','young','young'),
df_bin=Favorite=c(rep('bubble gum',2),rep('coffee',2),rep('bubble gum',2),rep('coffee',2)),
Music=c(rep(c('classical','rock'),4)),
Freq=c(1,1,3,1,2,5,1,0))
data.frame(Age =c(rep("old",6), rep("young", 8)),
df_unbin =Favorite = c(rep("bubble gum", 2),rep("coffee", 4), rep("bubble gum", 7), "coffee"),
Music = c("classical", "rock", rep("classical", 3), "rock", rep("classical", 2), rep("rock", 5), "classical"))
#install.packages("ggmosaic")
library(ggmosaic)
ggplot(data = df_unbin)+
geom_mosaic(aes(x = product(Music, Age), fill = Music))+
labs(x = "Age", y = "Music", title = "Mosaic Plot: Spliting on Age, then Music")+
theme(plot.title = element_text(hjust = 0.5))
Ilustrasi Treemap
#install.packages("treemap")
library(treemap)
#Load population dataset from Github
read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/11_SevCatOneNumNestedOneObsPerGroup.csv", header=T, sep=";")
df <-which(df$value ==-1),"value"] <- 1
df[ colnames(df) <- c("Continent", "Region", "Country", "Population")
treemap(df, #dataframe
index=c("Continent"), #categorical hierarchy variable, Continent in this case
vSize = "Population", #quantitative variable, Population in this case
type="index",
title = "Treemap: Population by Continents")#colors are determined by the index variables. Different branches in the hierarchical tree get different colors
treemap(df, #dataframe
index=c("Continent", "Region", "Country"), #categorical variables in the order of highest level of the hierarchy to lowest
vSize = "Population", #quantitative variable
type="index",
# Labels
fontsize.labels=c(15,8,5), # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
fontcolor.labels=c("black","orange", "white"), # Color of labels
fontface.labels=c(2,1,1), # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
bg.labels=c("transparent"), # Background color of labels
align.labels=list(
c("center", "center"),
c("right", "bottom"),
c("left", "top")), # Where to place labels in the rectangle?
overlap.labels=0.5, # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1 means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5 times their area size.
inflate.labels=F,
title = "Treemaps: Population by Continents, Regions and Countries")
VISUALISASI HUBUNGAN PEUBAH
Sebagai ilustrasi, akan digunakan data yang diambil dari Yau (2011).
read_csv('http://datasets.flowingdata.com/crimeRatesByState2005.csv') crime<-
## Rows: 52 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): state
## dbl (8): murder, forcible_rape, robbery, aggravated_assault, burglary, larce...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
crime
## # A tibble: 52 x 9
## state murder forcible_rape robbery aggravated_assa~ burglary larceny_theft
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 United ~ 5.6 31.7 141. 291. 727. 2286.
## 2 Alabama 8.2 34.3 141. 248. 954. 2650
## 3 Alaska 4.8 81.1 80.9 465. 622. 2599.
## 4 Arizona 7.5 33.8 144. 327. 948. 2965.
## 5 Arkansas 6.7 42.9 91.1 387. 1085. 2711.
## 6 Califor~ 6.9 26 176. 317. 693. 1916.
## 7 Colorado 3.7 43.4 84.6 265. 745. 2735.
## 8 Connect~ 2.9 20 113 139. 437. 1824.
## 9 Delaware 4.4 44.7 155. 428. 689. 2144
## 10 Distric~ 35.4 30.2 672. 721. 650. 2695.
## # ... with 42 more rows, and 2 more variables: motor_vehicle_theft <dbl>,
## # population <dbl>
Seandainya ingin dilihat hubungan antara tindakan kriminal pembunuhan dan perampokan, kita dapat menggunakan scatterplot seperti pada contoh berikut ini.
ggplot(crime ) +
geom_point(aes(x=murder, y=burglary))
Berdasarkan plot tersebut terlihat ada kecenderungan pola hubungan positif antara kedua peubah tersebut. Namun, hal itu menjadi kurang terlihat karena adanya pengamatan pencilan.
Seandainya kita kemudian hanya ingin fokus pada daerah tertentu saja, maka kita dapat melakukannya seperti pada contoh berikut ini.
%>% subset(state!="District of Columbia") %>%
crime subset(state!="United States") %>%
ggplot() +
geom_point(aes(x=murder, y=burglary))
Untuk membantu mengenali pola data, kita dapat pula menggunakan smoothing dengan menambahkan fungsi geom_smooth()
.
%>% subset(state!="District of Columbia") %>%
crime subset(state!="United States") %>%
ggplot(aes(x=murder, y=burglary)) +
geom_point() +
geom_smooth(method='loess')
Apabila ingin melihat hubungan antar beberapa peubah, kita dapat melakukannya dengan cara berikut.
library(GGally)
%>% subset(state!="District of Columbia") %>%
crime subset(state!="United States") %>%
ggpairs(columns=2:8,lower = list(continuous = "smooth"))
%>% subset(state!="District of Columbia") %>%
crime subset(state!="United States") %>%
ggcorr(label = TRUE, label_round = 2, size=2)
VISUALISASI TIME SERIES DATA & TREND
Ilustrasi berikut diambil dari Indra (2018), menggunakan data economics
yang tersedia pada package ggplot2
.
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
Data ini terdiri dari 478 pengamatan dan 6 peubah, yaitu:
- date : waktu pengumpulan data (bulan)
- psavert : tingkat tabungan perorangan
- pce : pengeluaran konsumsi perorangan, dalam milyaran dolar
- unemploy : banyaknya pengangguran, dalam ribuan
- uempmed : median dari durasi menganggur, dalam minggu
- pop : total populasi, dalam ribuan
Pola data deret waktu dapat dieksplorasi menggunakan line plot, menggunakan fungsi geom_line()
.
ggplot(data = economics, aes(x = date, y = pop))+ geom_line(color = "#00AFBB", size = 1) +
xlab("Date of data") + ylab("Total population")
# Base plot with date axis
ggplot(economics, aes(x = date, y = psavert)) +
p <- geom_line(color = "#00AFBB", size = 1)
p
# Set axis limits c(min, max)
as.Date("2002-1-1")
min <- NA
max <-+ scale_x_date(limits = c(min, max)) p
## Warning: Removed 414 row(s) containing missing values (geom_path).
Seandainya, kita ingin membandingkan pola deret waktu antara peubah psavert
dan uempmed
, maka terlebih dulu kita modifikasi data set sehingga keduanya ada dalam kolom yang sama.
economics %>%
data <- select(date, psavert, uempmed) %>% gather(key = "variable", value = "value", -date)
# Multiple line plot
ggplot(data, aes(x = date, y = value)) + geom_line(aes(color = variable), size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) + theme_minimal()
# Area plot
ggplot(data, aes(x = date, y = value)) +
geom_area(aes(color = variable, fill = variable), alpha = 0.5, position = position_dodge(0.8)) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
Selain dari ilustrasi di atas, visualisasi untuk data time series dapat pula berupa dumbell plot dan slope graph, yang dapat dilihat pada link berikut: Kabacoff (2018).
References
[CC19] Community contributions for EDAV fall 2019. (2019, December 13). https://jtr13.github.io/cc19/
Holtz, Y. (n.d.). The R Graph Gallery. https://www.r-graph-gallery.com/
Indra. (2018, May 28). Plot time series data using GGPlot. Kaggle: Your Machine Learning and Data Science Community. https://www.kaggle.com/indra90/plot-time-series-data-using-ggplot
Kabacoff, R. (2018, September 3). Data visualization with R · GitHub Pages. https://rkabacoff.github.io/datavis/Time.html#dummbbell-charts
Kliegman, J., Dombrowski, M. P., & Kavalar, M. (2019, May 2). Best ggplot visualizations. Nextjournal. https://nextjournal.com/jk/best-ggplot
Prabhakaran, S. (n.d.). Top 50 ggplot2 visualizations - The master list (With full R code). Tutorials on Advanced Stats and Machine Learning With R. https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
Wilke, C. O. (2019). Fundamentals of data visualization. Claus O. Wilke. https://clauswilke.com/dataviz/nested-proportions.html
Yau, N. (2011). Visualize this: the FlowingData guide to design, visualization, and statistics. John Wiley & Sons.
Department of Statistics, IPB University↩︎