Literacy Rate di Jawa Tengah
Package
library(graphics)
library(aplore3)## Warning: package 'aplore3' was built under R version 4.0.5
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)## Warning: package 'readxl' was built under R version 4.0.5
library(ggplot2)
library(PerformanceAnalytics)## Warning: package 'PerformanceAnalytics' was built under R version 4.0.5
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(hrbrthemes)## Warning: package 'hrbrthemes' was built under R version 4.0.5
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)## Warning: package 'viridis' was built under R version 4.0.5
## Loading required package: viridisLite
## Warning: package 'viridisLite' was built under R version 4.0.5
library(gridExtra)## Warning: package 'gridExtra' was built under R version 4.0.5
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggrepel)## Warning: package 'ggrepel' was built under R version 4.0.5
library(plotly)## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(caret)## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Data
Data yang digunakan merupakan data
Literacy Rate Kabupaten/Kota di Provinsi Jawa Tengah Th. 1996 - 2020
yang didapatkan dari Indonesia Database for Policy and Economic
Research. Literacy rate merupakan tingkat melek huruf yang
ditentukan oleh persentase suatu penduduk dari kelompok usia tertentu
yang dapat membaca dan menulis. Dalam data ini, literacy rate yang
digunakan merupakan persentase untuk kelompok usia 15 tahun keatas.
Data terdiri dari Literacy rata (tingkat melek huruf) di 35 kab/kota di Jawa Tengah dengan selang pengamatan selama 25 tahun dari 1996 - 2020.
Import Data
rate <- read_excel("D:/VISUALISASI DATA/Literacy.xlsx")
rateTipe Data
untuk mengetahui tipe data di setiap kolom digunakan perintah
str
str(rate)## tibble [35 x 26] (S3: tbl_df/tbl/data.frame)
## $ Kabupaten: chr [1:35] "Banjarnegara" "Banyumas" "Batang" "Blora" ...
## $ 1996 : num [1:35] 78.3 89.8 84.2 71.8 78.8 ...
## $ 1997 : num [1:35] 86.3 89.3 87.7 90.8 81.2 ...
## $ 1998 : num [1:35] 84.2 89.2 86.1 74.8 79 ...
## $ 1999 : num [1:35] 85.9 91.2 85.8 74.1 81.4 ...
## $ 2000 : num [1:35] 84.8 91.6 85.6 79.5 83.9 ...
## $ 2001 : num [1:35] 81.9 88.1 83.5 75.1 78.2 ...
## $ 2002 : num [1:35] 82.4 89.7 85 80.6 81.9 ...
## $ 2003 : num [1:35] 84.1 92 85.2 79.6 81.7 ...
## $ 2004 : num [1:35] 85 92.8 84.3 81.2 84.2 ...
## $ 2005 : num [1:35] 87 92 83.6 78.5 84.9 ...
## $ 2006 : num [1:35] 88.2 91.2 87 80.6 84.5 ...
## $ 2007 : num [1:35] 87.3 93.3 86.3 81.3 86 ...
## $ 2008 : num [1:35] 88.2 94.7 88.5 84.1 85 ...
## $ 2009 : num [1:35] 86.8 93.3 87.3 81 84.5 ...
## $ 2010 : num [1:35] 87.4 93.9 88.1 81.8 85.8 ...
## $ 2011 : num [1:35] 88.2 94.1 90.4 85.1 88.6 ...
## $ 2012 : num [1:35] 87.6 94.2 89.9 83.4 87.5 ...
## $ 2013 : num [1:35] 91.2 94.8 92.3 86.6 88.1 ...
## $ 2014 : num [1:35] 93.4 94.8 94.4 89.4 88.5 ...
## $ 2015 : num [1:35] 95.1 96.3 93.1 87 91.5 ...
## $ 2016 : num [1:35] 91.9 96 93 88.7 90.4 ...
## $ 2017 : num [1:35] 92.2 96.1 93.5 87.3 90.4 ...
## $ 2018 : num [1:35] 90.7 94.9 94.6 88.3 91 ...
## $ 2019 : num [1:35] 91.2 94.6 93.2 88 90.7 ...
## $ 2020 : num [1:35] 93 96 94.1 88 92.8 ...
Ringkasan Data
Sebelum melakukan visualisasi data , perlu diketahui gambaran umum
data melalui summary
summary(rate)## Kabupaten 1996 1997 1998
## Length:35 Min. :68.40 Min. :71.40 Min. :67.51
## Class :character 1st Qu.:78.62 1st Qu.:82.32 1st Qu.:80.92
## Mode :character Median :82.00 Median :84.78 Median :85.96
## Mean :82.30 Mean :86.01 Mean :85.00
## 3rd Qu.:86.13 3rd Qu.:90.24 3rd Qu.:89.36
## Max. :95.22 Max. :94.73 Max. :95.06
## 1999 2000 2001 2002
## Min. :71.56 Min. :74.24 Min. :71.85 Min. :75.30
## 1st Qu.:83.06 1st Qu.:84.15 1st Qu.:81.11 1st Qu.:82.56
## Median :85.76 Median :85.86 Median :84.31 Median :86.46
## Mean :85.35 Mean :86.23 Mean :84.12 Mean :86.34
## 3rd Qu.:88.98 3rd Qu.:88.83 3rd Qu.:87.46 3rd Qu.:88.91
## Max. :95.72 Max. :94.50 Max. :94.36 Max. :95.64
## 2003 2004 2005 2006
## Min. :73.29 Min. :72.62 Min. :74.89 Min. :76.35
## 1st Qu.:82.86 1st Qu.:84.92 1st Qu.:84.96 1st Qu.:86.57
## Median :86.37 Median :86.61 Median :87.93 Median :88.87
## Mean :86.59 Mean :87.35 Mean :88.03 Mean :88.98
## 3rd Qu.:90.31 3rd Qu.:90.83 3rd Qu.:91.34 3rd Qu.:91.69
## Max. :96.21 Max. :95.60 Max. :96.53 Max. :97.08
## 2007 2008 2009 2010
## Min. :80.94 Min. :79.24 Min. :81.03 Min. :81.84
## 1st Qu.:86.74 1st Qu.:88.36 1st Qu.:86.67 1st Qu.:88.00
## Median :88.71 Median :89.38 Median :90.19 Median :90.74
## Mean :89.21 Mean :89.86 Mean :89.84 Mean :90.51
## 3rd Qu.:92.42 3rd Qu.:92.91 3rd Qu.:92.22 3rd Qu.:93.43
## Max. :97.17 Max. :96.91 Max. :96.68 Max. :96.80
## 2011 2012 2013 2014
## Min. :83.22 Min. :82.07 Min. :82.91 Min. :85.94
## 1st Qu.:88.95 1st Qu.:88.84 1st Qu.:90.58 1st Qu.:91.98
## Median :91.36 Median :90.94 Median :92.31 Median :93.61
## Mean :90.82 Mean :90.90 Mean :92.18 Mean :93.29
## 3rd Qu.:92.70 3rd Qu.:93.14 3rd Qu.:94.21 3rd Qu.:95.19
## Max. :96.96 Max. :97.52 Max. :98.11 Max. :98.26
## 2015 2016 2017 2018
## Min. :82.42 Min. :83.22 Min. :86.47 Min. :87.19
## 1st Qu.:92.17 1st Qu.:92.50 1st Qu.:92.48 1st Qu.:92.11
## Median :94.21 Median :93.69 Median :93.60 Median :94.20
## Mean :93.49 Mean :93.55 Mean :93.65 Mean :93.93
## 3rd Qu.:95.20 3rd Qu.:95.26 3rd Qu.:94.96 3rd Qu.:95.09
## Max. :98.30 Max. :97.71 Max. :98.13 Max. :98.57
## 2019 2020
## Min. :86.70 Min. :88.01
## 1st Qu.:93.02 1st Qu.:93.58
## Median :93.93 Median :94.69
## Mean :93.84 Mean :94.73
## 3rd Qu.:94.70 3rd Qu.:96.38
## Max. :98.61 Max. :99.00
Univariate Exploration
Histogram
Eksplorasi univariate digunakan data tingkat melek huruf pada tahun 1996, 2010, dan 2020. Pemilihan tahun dengan pertimbangan selang yang tidak terpaut jauh. Selain itu dengan memperhatikan tingkat perkembangan Literacy rate yang bisa dilihat peningkatannya.
hist(rate$`1996`, breaks = 25, col = "coral", xlab = "Literacy Rate (Persen)", main = "Tingkat Melek Huruf di Prov. Jawa Tengah", xlim = c(60,100))
hist(rate$`2010`, col='green', add=TRUE)
hist(rate$`2020`,col='yellow', add=TRUE)Dari Histogram diatas dapat dilihat bahwa tingkat melek huruf di kab./kota di Jawa Tengah dari tahun 1996, 2010, hingga 2020 melangami peningkatan yang cukup signifikan.
Boxplot
boxplot(rate$`1996`, horizontal = T, col = 'coral', xlab="Literacy Rate (Persen)")
boxplot(rate$`2010`, horizontal = T, col = 'green', add = TRUE)
boxplot(rate$`2020`, horizontal = T, col = 'yellow', add = TRUE)Berdasarkan Boxplot Literacy Rate dari tahun 1996, 2010, hingga 2020 mengalami peningkatan hal ini bisa dilihat secara jelas dari nilai median ataupun nilia mu-nya. Selain itu, tidak nampak pencilan yang cukup berarti dari boxplot diatas.
Kernel Density Plot
Kernel density plot adalah salah satu tipe plot yang menampilkan sebaran nilai dari dataset dengan menggunakan kurva kontinu.
#data frame
rate1 <- rate$`1996`
rate2 <- rate$`2010`
rate3 <- rate$`2020`
#plot first kernel density plot
kd1 <- density(rate2)
plot(kd1, col='coral', lwd=3 )
#plot second kernel density plot
kd2 <- density(rate1)
lines(kd2, col='green', lwd=3)
#plot third kernel density plot
kd3 <- density(rate3)
lines(kd3, col='blue', lwd=3)Kernel Density Plot memperlihatkan sebaran data yang cukup beragam.
Dari Gambar diatas terlihat bahwa kurva yang warna coral
memperlihatkan sebaran data yang mendekati sebaran normal.
Barplot
# Barplot
barplot(rate1)barplot(rate2)barplot(rate3)Barplot yang ditampilkan kurang memenuhi relevansi karena dataset yang digunakan tidak memiliki variabel kategori. Namun barplot masih bisa digunakn untuk menganalisis tingkat melek huruf ditahun 1996, 2010, dan 2020 yang terlihat cukup dinamis antara satu kab/kota dengan lainnya.
Piechart
pie(rate1) #1996pie(rate2) #2010pie(rate3) #2020Piechart memperlihatkan baik pada tahun 1996, 2010, maupun 2020 tingkat melek huruf antara satu wilayah dengan wilayah yang lain tidak memperlihatkan perbedaan yang ekstrim.
Bivariate Exploration
Scatter Plot
Scatter Plot merupakan Grafik 2 dimensi yang menampilkan axis
x dan y. Setiap point pada plot mewakili satu
observasi.
plot(rate2, rate3, ylab="Literacy Rate 2020",
xlab = "Literacy Rate 2010",
pch = 19, col = "coral", cex = 1.5)Dari Scatter Plot diatas terlihat bahwa korelasi antara tingkat melek huruf pada tahun 2010 linear positif dengan tahun 2020. Plot menyebar di sekitar garis normal sehingga dapat diasumsikan data berdistribusi normal.
Correlation Matrix
pairs(rate[,c(16,26)], pch=19, col="coral", cex=1.5)Dari plot diatas terlihat bahwa korelasi antara tingkat melek huruf tahun 2010 dengan tahun 2020 berjalan secara linear positif dan menyebar normal.
Chart Correlation
chart.Correlation(rate[,c(2,16,26)], histogram = TRUE, pch= 19)Chart Correlation menunjukkan kekuatan hubungan antara satu variabel dengan variabel yang lain. Nilai korelasi untuk tingkat melek huruf tahun 1996 denngan 2010 sebesar 0.84 (84%), korelasi 1996 dengan 2020 sebesar 0.88 (88%), dan tahun 2010 dengan 2020 sebesar 0.83 (83%). Dari angka-angka korelasi tersebut menunjukkan bahwa tingkat hubungan antara ketiga variabel tersebut cukup kuat.
Pasangan Matriks Korelasi
pairs(rate[,c(2,16,26)], pch=19, col="coral", cex=1.5)Ketiga variabel terlihat berkorelasi positif dan menyebar normal.
2D Density Plot
ggplot(rate, aes(x=rate1, y=rate3) ) +
geom_point()ggplot(rate, aes(x=rate1, y=rate3) ) +
geom_bin2d() +
theme_bw()# Bin size control + color palette
ggplot(rate, aes(x=rate1, y=rate3) ) +
geom_bin2d(bins = 70) +
scale_fill_continuous(type = "viridis") +
theme_bw()Kontur
ggplot(rate, aes(x=rate1, y=rate3) ) +
geom_density_2d()Area
ggplot(rate, aes(x=rate1, y=rate3) ) +
stat_density_2d(aes(fill = ..level..), geom = "polygon")Area dan Kontur
ggplot(rate, aes(x=rate1, y=rate3)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white")Bubble Plot
fig <- plot_ly(rate, x = ~rate1, y = ~rate3, type = 'scatter', mode = 'markers',
marker = list(size = ~rate1, opacity = 0.3))
fig <- fig %>% layout(title = 'Tingkat Melek Huruf di Jawa Tengah',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE))
figLocal Regression (LOESS)
#DataFrame
df <- data.frame(x=rate2, y=rate3)
#view first six rows of data frame
head(df)#fit several LOESS regression models to dataset
loess50 <- loess(y ~ x, data=df, span=.5)
smooth50 <- predict(loess50)
loess75 <- loess(y ~ x, data=df, span=.75)
smooth75 <- predict(loess75)
loess90 <- loess(y ~ x, data=df, span=.9)
smooth90 <- predict(loess90)
#create scatterplot with each regression line overlaid
plot(df$x, df$y, pch=19, main='Loess Regression Models')
lines(smooth50, x=df$x, col='red')
lines(smooth75, x=df$x, col='purple')
lines(smooth90, x=df$x, col='blue')
legend('bottomright', legend=c('.5', '.75', '.9'),
col=c('red', 'purple', 'blue'), pch=19, title='Smoothing Span')#define k-fold cross validation method
ctrl <- trainControl(method = "cv", number = 5)
grid <- expand.grid(span = seq(0.5, 0.9, len = 5), degree = 1)
#perform cross-validation using smoothing spans ranginf from 0.5 to 0.9
model <- train(y ~ x, data = df, method = "gamLoess", tuneGrid=grid, trControl = ctrl)## Loading required package: gam
## Warning: package 'gam' was built under R version 4.0.5
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.5
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded gam 1.20
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## extrapolation not allowed with blending
#print results of k-fold cross-validation
print(model)## Generalized Additive Model using LOESS
##
## 35 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 28, 28, 27, 28, 29
## Resampling results across tuning parameters:
##
## span RMSE Rsquared MAE
## 0.5 1.595631 0.6543796 1.103961
## 0.6 1.559878 0.6703389 1.090204
## 0.7 1.475092 0.7085617 1.052077
## 0.8 1.454115 0.7175444 1.041339
## 0.9 1.446529 0.7212316 1.041767
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were span = 0.9 and degree = 1.
Berdasarkan grafik LOESS Regression terlihat bahwa model regresi yang cukup mewakili model data adalah regresi lokal dengan span 0.9. Dilihat dari nilai RMSE, model terbaik adalah model yang memiliki nilai RMSE terkecil, dari output diatas nilai RMSE terkecil sebesar 1.495821 dengan span 0.9. Sehingga dapat disimpulkan bahwa baik berdasarkan grafik maupun uji validasi silang model regresi lokal terbaik adalah regresi lokal (LOESS) dengan span 0.9.
Referensi
Blog: https://profeksis.blogspot.com/, Email: mistereko@apps.ipb.ac.id, Rpubs: https://rpubs.com/profeksis↩︎