Materi ini akan membahas tentang beberapa pacakge yang menunjang dalam eksplorasi dan visualisasi data.
Data
Data yang digunakan adalah data yang bersumber dari Bank Portugis mengenai marketing campaigns melalui telepon. Data ini memiliki 16 Variabel prediktor untuk memprediksi apakah klien akan berlangganan term deposit atau tidak. Data bersumber dari UCI.
Import Data
Melakukan Import Data dari file CSV. Dapat di download dengan link ini https://ipb.link/praktikum-visualisasidata
Bank <-read.csv("bank.csv")
head(Bank)## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous y weight hight
## 1 oct 79 1 -1 0 no 63.2 160.9
## 2 may 220 1 339 4 no 68.5 165.3
## 3 apr 185 1 330 1 no 78.5 173.5
## 4 jun 199 4 -1 0 no 95.3 187.2
## 5 may 226 1 -1 0 no 60.1 158.3
## 6 feb 141 2 176 3 no 94.8 186.8
1.DataExplorer
Automated data exploration process for analytic tasks and predictive modeling, so that users could focus on understanding data and extracting insights. The package scans and analyzes each variable, and visualizes them with typical graphical techniques. Common data processing methods are also available to treat and format data.
#install.packages("DataExplorer")
library("DataExplorer")Histogram
plot_histogram(Bank)plot_density(Bank,geom_density_args = list(fill="blue"))BarPlot
plot_bar(Bank)ScatterplotPlot
plot_scatterplot(Bank,by = "age")Plot Correlation
type untuk . . .
plot_correlation(Bank,type = "c")2.corplot
Package Description: R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
corrplot is very easy to use and provides a rich array of plotting options in visualization method, graphic layout, color, legend, text labels, etc. It also provides p-values and confidence intervals to help users determine the statistical significance of the correlations.
#install.packages("corrplot")
library("corrplot")
library("tidyverse") #untuk menggunakan symbol %>%corrplot(Bank %>% select(where(is.numeric)) %>% cor,
method = "pie",type = "lower",
diag = FALSE)3.ggpubr
Package Description: The 'ggplot2' package is excellent and flexible for elegant data visualization in R. However the default generated plots requires some formatting before we can send them for publication. Furthermore, to customize a 'ggplot', the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills. 'ggpubr' provides some easy-to-use functions for creating and customizing 'ggplot2'- based publication ready plots
#install.packages("ggpubr")
library("ggpubr")Histogram
gghistogram(Bank,x="age",fill="pink")+scale_y_continuous(expand = c(0,0))## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
Dencity Plot
ggdensity(Bank,x ="age",fill ="pink")+
scale_y_continuous(expand = c(0,0))+
scale_x_continuous(expand = c(0,0))names_cont <- colnames(Bank %>% select(where(is.numeric)))
p1 <- map(names_cont,~
ggdensity(Bank,x = .x,fill ="pink")+
scale_y_continuous(expand = c(0,0))+
scale_x_continuous(expand = c(0,0))
)
ggarrange(plotlist = p1)Scatterplot
ggscatter(Bank,x = "age",y="duration",color = "pink")ggscatter(Bank,x = "age",y="duration",color="pink",
add = "reg.line" , # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = "blue",
fill = "lightgray")
)## `geom_smooth()` using formula 'y ~ x'
3.ggplot2
#install.packages("ggplot2")
library(ggplot2)One variable
Data Continuous
a <- ggplot(Bank, aes(x=weight))
a + geom_area(stat = "bin", color="blue",fill="skyblue",size=1) +
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
a + geom_histogram(binwidth = 10)Data Discrete
b <- ggplot(Bank, aes(day))
b + geom_bar(color="blue", fill = "pink") +
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)Two variable
Continuous X, Continuous Y
f <- ggplot(Bank, aes(weight, hight))
f + geom_jitter(color="pink", size=2, shape="o")Discrete X, Continuous Y
g <- ggplot(Bank, aes(day, weight))
g + geom_bar(stat = "identity", color = "blue") +
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)g <- ggplot(Bank, aes(day, weight))
g + geom_boxplot(aes(group=day))+
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)g1 <- ggplot(Bank, aes(job, age))
g1 + geom_violin(scale = "area", color="#993399", fill = "#993399") +
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)Discrete X, Discrete Y
h <- ggplot(Bank, aes(education, job))
h + geom_jitter(colour="red", size=1)+
labs(
x = "x",
y = "y",
title = "Title",
subtitle = "Subtitle",
caption = "caption"
)