Literacy Rate di Jawa Tengah

Package

library(graphics)
library(aplore3)
## Warning: package 'aplore3' was built under R version 4.0.5
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
library(ggplot2)
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.0.5
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.0.5
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Warning: package 'viridis' was built under R version 4.0.5
## Loading required package: viridisLite
## Warning: package 'viridisLite' was built under R version 4.0.5
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.5
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.0.5
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Data

Data yang digunakan merupakan data Literacy Rate Kabupaten/Kota di Provinsi Jawa Tengah Th. 1996 - 2020 yang didapatkan dari Indonesia Database for Policy and Economic Research. Literacy rate merupakan tingkat melek huruf yang ditentukan oleh persentase suatu penduduk dari kelompok usia tertentu yang dapat membaca dan menulis. Dalam data ini, literacy rate yang digunakan merupakan persentase untuk kelompok usia 15 tahun keatas.

Data terdiri dari Literacy rata (tingkat melek huruf) di 35 kab/kota di Jawa Tengah dengan selang pengamatan selama 25 tahun dari 1996 - 2020.

Import Data

rate <- read_excel("D:/VISUALISASI DATA/Literacy.xlsx")
rate

Tipe Data

untuk mengetahui tipe data di setiap kolom digunakan perintah str

str(rate)
## tibble [35 x 26] (S3: tbl_df/tbl/data.frame)
##  $ Kabupaten: chr [1:35] "Banjarnegara" "Banyumas" "Batang" "Blora" ...
##  $ 1996     : num [1:35] 78.3 89.8 84.2 71.8 78.8 ...
##  $ 1997     : num [1:35] 86.3 89.3 87.7 90.8 81.2 ...
##  $ 1998     : num [1:35] 84.2 89.2 86.1 74.8 79 ...
##  $ 1999     : num [1:35] 85.9 91.2 85.8 74.1 81.4 ...
##  $ 2000     : num [1:35] 84.8 91.6 85.6 79.5 83.9 ...
##  $ 2001     : num [1:35] 81.9 88.1 83.5 75.1 78.2 ...
##  $ 2002     : num [1:35] 82.4 89.7 85 80.6 81.9 ...
##  $ 2003     : num [1:35] 84.1 92 85.2 79.6 81.7 ...
##  $ 2004     : num [1:35] 85 92.8 84.3 81.2 84.2 ...
##  $ 2005     : num [1:35] 87 92 83.6 78.5 84.9 ...
##  $ 2006     : num [1:35] 88.2 91.2 87 80.6 84.5 ...
##  $ 2007     : num [1:35] 87.3 93.3 86.3 81.3 86 ...
##  $ 2008     : num [1:35] 88.2 94.7 88.5 84.1 85 ...
##  $ 2009     : num [1:35] 86.8 93.3 87.3 81 84.5 ...
##  $ 2010     : num [1:35] 87.4 93.9 88.1 81.8 85.8 ...
##  $ 2011     : num [1:35] 88.2 94.1 90.4 85.1 88.6 ...
##  $ 2012     : num [1:35] 87.6 94.2 89.9 83.4 87.5 ...
##  $ 2013     : num [1:35] 91.2 94.8 92.3 86.6 88.1 ...
##  $ 2014     : num [1:35] 93.4 94.8 94.4 89.4 88.5 ...
##  $ 2015     : num [1:35] 95.1 96.3 93.1 87 91.5 ...
##  $ 2016     : num [1:35] 91.9 96 93 88.7 90.4 ...
##  $ 2017     : num [1:35] 92.2 96.1 93.5 87.3 90.4 ...
##  $ 2018     : num [1:35] 90.7 94.9 94.6 88.3 91 ...
##  $ 2019     : num [1:35] 91.2 94.6 93.2 88 90.7 ...
##  $ 2020     : num [1:35] 93 96 94.1 88 92.8 ...

Ringkasan Data

Sebelum melakukan visualisasi data , perlu diketahui gambaran umum data melalui summary

summary(rate)
##   Kabupaten              1996            1997            1998      
##  Length:35          Min.   :68.40   Min.   :71.40   Min.   :67.51  
##  Class :character   1st Qu.:78.62   1st Qu.:82.32   1st Qu.:80.92  
##  Mode  :character   Median :82.00   Median :84.78   Median :85.96  
##                     Mean   :82.30   Mean   :86.01   Mean   :85.00  
##                     3rd Qu.:86.13   3rd Qu.:90.24   3rd Qu.:89.36  
##                     Max.   :95.22   Max.   :94.73   Max.   :95.06  
##       1999            2000            2001            2002      
##  Min.   :71.56   Min.   :74.24   Min.   :71.85   Min.   :75.30  
##  1st Qu.:83.06   1st Qu.:84.15   1st Qu.:81.11   1st Qu.:82.56  
##  Median :85.76   Median :85.86   Median :84.31   Median :86.46  
##  Mean   :85.35   Mean   :86.23   Mean   :84.12   Mean   :86.34  
##  3rd Qu.:88.98   3rd Qu.:88.83   3rd Qu.:87.46   3rd Qu.:88.91  
##  Max.   :95.72   Max.   :94.50   Max.   :94.36   Max.   :95.64  
##       2003            2004            2005            2006      
##  Min.   :73.29   Min.   :72.62   Min.   :74.89   Min.   :76.35  
##  1st Qu.:82.86   1st Qu.:84.92   1st Qu.:84.96   1st Qu.:86.57  
##  Median :86.37   Median :86.61   Median :87.93   Median :88.87  
##  Mean   :86.59   Mean   :87.35   Mean   :88.03   Mean   :88.98  
##  3rd Qu.:90.31   3rd Qu.:90.83   3rd Qu.:91.34   3rd Qu.:91.69  
##  Max.   :96.21   Max.   :95.60   Max.   :96.53   Max.   :97.08  
##       2007            2008            2009            2010      
##  Min.   :80.94   Min.   :79.24   Min.   :81.03   Min.   :81.84  
##  1st Qu.:86.74   1st Qu.:88.36   1st Qu.:86.67   1st Qu.:88.00  
##  Median :88.71   Median :89.38   Median :90.19   Median :90.74  
##  Mean   :89.21   Mean   :89.86   Mean   :89.84   Mean   :90.51  
##  3rd Qu.:92.42   3rd Qu.:92.91   3rd Qu.:92.22   3rd Qu.:93.43  
##  Max.   :97.17   Max.   :96.91   Max.   :96.68   Max.   :96.80  
##       2011            2012            2013            2014      
##  Min.   :83.22   Min.   :82.07   Min.   :82.91   Min.   :85.94  
##  1st Qu.:88.95   1st Qu.:88.84   1st Qu.:90.58   1st Qu.:91.98  
##  Median :91.36   Median :90.94   Median :92.31   Median :93.61  
##  Mean   :90.82   Mean   :90.90   Mean   :92.18   Mean   :93.29  
##  3rd Qu.:92.70   3rd Qu.:93.14   3rd Qu.:94.21   3rd Qu.:95.19  
##  Max.   :96.96   Max.   :97.52   Max.   :98.11   Max.   :98.26  
##       2015            2016            2017            2018      
##  Min.   :82.42   Min.   :83.22   Min.   :86.47   Min.   :87.19  
##  1st Qu.:92.17   1st Qu.:92.50   1st Qu.:92.48   1st Qu.:92.11  
##  Median :94.21   Median :93.69   Median :93.60   Median :94.20  
##  Mean   :93.49   Mean   :93.55   Mean   :93.65   Mean   :93.93  
##  3rd Qu.:95.20   3rd Qu.:95.26   3rd Qu.:94.96   3rd Qu.:95.09  
##  Max.   :98.30   Max.   :97.71   Max.   :98.13   Max.   :98.57  
##       2019            2020      
##  Min.   :86.70   Min.   :88.01  
##  1st Qu.:93.02   1st Qu.:93.58  
##  Median :93.93   Median :94.69  
##  Mean   :93.84   Mean   :94.73  
##  3rd Qu.:94.70   3rd Qu.:96.38  
##  Max.   :98.61   Max.   :99.00

Univariate Exploration

Histogram

Eksplorasi univariate digunakan data tingkat melek huruf pada tahun 1996, 2010, dan 2020. Pemilihan tahun dengan pertimbangan selang yang tidak terpaut jauh. Selain itu dengan memperhatikan tingkat perkembangan Literacy rate yang bisa dilihat peningkatannya.

hist(rate$`1996`, breaks = 25, col = "coral", xlab = "Literacy Rate (Persen)", main = "Tingkat Melek Huruf di Prov. Jawa Tengah", xlim = c(60,100))
hist(rate$`2010`, col='green', add=TRUE)
hist(rate$`2020`,col='yellow', add=TRUE)

Dari Histogram diatas dapat dilihat bahwa tingkat melek huruf di kab./kota di Jawa Tengah dari tahun 1996, 2010, hingga 2020 melangami peningkatan yang cukup signifikan.

Boxplot

boxplot(rate$`1996`, horizontal = T, col = 'coral', xlab="Literacy Rate (Persen)")
boxplot(rate$`2010`, horizontal = T, col = 'green', add = TRUE)
boxplot(rate$`2020`, horizontal = T, col = 'yellow', add = TRUE)

Berdasarkan Boxplot Literacy Rate dari tahun 1996, 2010, hingga 2020 mengalami peningkatan hal ini bisa dilihat secara jelas dari nilai median ataupun nilia mu-nya. Selain itu, tidak nampak pencilan yang cukup berarti dari boxplot diatas.

Kernel Density Plot

Kernel density plot adalah salah satu tipe plot yang menampilkan sebaran nilai dari dataset dengan menggunakan kurva kontinu.

#data frame 
rate1 <- rate$`1996`
rate2 <- rate$`2010`
rate3 <- rate$`2020`

#plot first kernel density plot
kd1 <- density(rate2)
plot(kd1, col='coral', lwd=3 )

#plot second kernel density plot
kd2 <- density(rate1)
lines(kd2, col='green', lwd=3)

#plot third kernel density plot
kd3 <- density(rate3)
lines(kd3, col='blue', lwd=3)

Kernel Density Plot memperlihatkan sebaran data yang cukup beragam. Dari Gambar diatas terlihat bahwa kurva yang warna coral memperlihatkan sebaran data yang mendekati sebaran normal.

Barplot

# Barplot
barplot(rate1)

barplot(rate2)

barplot(rate3)

Barplot yang ditampilkan kurang memenuhi relevansi karena dataset yang digunakan tidak memiliki variabel kategori. Namun barplot masih bisa digunakn untuk menganalisis tingkat melek huruf ditahun 1996, 2010, dan 2020 yang terlihat cukup dinamis antara satu kab/kota dengan lainnya.

Piechart

pie(rate1) #1996

pie(rate2) #2010

pie(rate3) #2020

Piechart memperlihatkan baik pada tahun 1996, 2010, maupun 2020 tingkat melek huruf antara satu wilayah dengan wilayah yang lain tidak memperlihatkan perbedaan yang ekstrim.

Bivariate Exploration

Scatter Plot

Scatter Plot merupakan Grafik 2 dimensi yang menampilkan axis x dan y. Setiap point pada plot mewakili satu observasi.

plot(rate2, rate3, ylab="Literacy Rate 2020",
     xlab = "Literacy Rate 2010",
     pch = 19, col = "coral", cex = 1.5)

Dari Scatter Plot diatas terlihat bahwa korelasi antara tingkat melek huruf pada tahun 2010 linear positif dengan tahun 2020. Plot menyebar di sekitar garis normal sehingga dapat diasumsikan data berdistribusi normal.

Correlation Matrix

pairs(rate[,c(16,26)], pch=19, col="coral", cex=1.5)

Dari plot diatas terlihat bahwa korelasi antara tingkat melek huruf tahun 2010 dengan tahun 2020 berjalan secara linear positif dan menyebar normal.

Chart Correlation

chart.Correlation(rate[,c(2,16,26)], histogram = TRUE, pch= 19)

Chart Correlation menunjukkan kekuatan hubungan antara satu variabel dengan variabel yang lain. Nilai korelasi untuk tingkat melek huruf tahun 1996 denngan 2010 sebesar 0.84 (84%), korelasi 1996 dengan 2020 sebesar 0.88 (88%), dan tahun 2010 dengan 2020 sebesar 0.83 (83%). Dari angka-angka korelasi tersebut menunjukkan bahwa tingkat hubungan antara ketiga variabel tersebut cukup kuat.

Pasangan Matriks Korelasi

pairs(rate[,c(2,16,26)], pch=19, col="coral", cex=1.5)

Ketiga variabel terlihat berkorelasi positif dan menyebar normal.

2D Density Plot

ggplot(rate, aes(x=rate1, y=rate3) ) +
  geom_point()

ggplot(rate, aes(x=rate1, y=rate3) ) +
  geom_bin2d() +
  theme_bw()

# Bin size control + color palette
ggplot(rate, aes(x=rate1, y=rate3) ) +
  geom_bin2d(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

Kontur

ggplot(rate, aes(x=rate1, y=rate3) ) +
  geom_density_2d()

Area

ggplot(rate, aes(x=rate1, y=rate3) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon")

Area dan Kontur

ggplot(rate, aes(x=rate1, y=rate3)) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white")

Bubble Plot

fig <- plot_ly(rate, x = ~rate1, y = ~rate3, type = 'scatter', mode = 'markers',
        marker = list(size = ~rate1, opacity = 0.3))
fig <- fig %>% layout(title = 'Tingkat Melek Huruf di Jawa Tengah',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))

fig

Local Regression (LOESS)

#DataFrame
df <- data.frame(x=rate2, y=rate3)

#view first six rows of data frame
head(df)
#fit several LOESS regression models to dataset
loess50 <- loess(y ~ x, data=df, span=.5)
smooth50 <- predict(loess50) 

loess75 <- loess(y ~ x, data=df, span=.75)
smooth75 <- predict(loess75) 

loess90 <- loess(y ~ x, data=df, span=.9)
smooth90 <- predict(loess90) 

#create scatterplot with each regression line overlaid
plot(df$x, df$y, pch=19, main='Loess Regression Models')
lines(smooth50, x=df$x, col='red')
lines(smooth75, x=df$x, col='purple')
lines(smooth90, x=df$x, col='blue')
legend('bottomright', legend=c('.5', '.75', '.9'),
        col=c('red', 'purple', 'blue'), pch=19, title='Smoothing Span')

#define k-fold cross validation method
ctrl <- trainControl(method = "cv", number = 5)
grid <- expand.grid(span = seq(0.5, 0.9, len = 5), degree = 1)

#perform cross-validation using smoothing spans ranginf from 0.5 to 0.9
model <- train(y ~ x, data = df, method = "gamLoess", tuneGrid=grid, trControl = ctrl)
## Loading required package: gam
## Warning: package 'gam' was built under R version 4.0.5
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.5
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded gam 1.20
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## eval 96.8
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## upperlimit 96.402
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.6, degree = 1)"]], z, w, span = 0.6, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.7, degree = 1)"]], z, w, span = 0.7, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.8, degree = 1)"]], z, w, span = 0.8, :
## extrapolation not allowed with blending
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## eval 81.84
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## lowerlimit 81.956
## Warning in gam.lo(data[["lo(x, span = 0.9, degree = 1)"]], z, w, span = 0.9, :
## extrapolation not allowed with blending
#print results of k-fold cross-validation
print(model)
## Generalized Additive Model using LOESS 
## 
## 35 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28, 28, 27, 28, 29 
## Resampling results across tuning parameters:
## 
##   span  RMSE      Rsquared   MAE     
##   0.5   1.595631  0.6543796  1.103961
##   0.6   1.559878  0.6703389  1.090204
##   0.7   1.475092  0.7085617  1.052077
##   0.8   1.454115  0.7175444  1.041339
##   0.9   1.446529  0.7212316  1.041767
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were span = 0.9 and degree = 1.

Berdasarkan grafik LOESS Regression terlihat bahwa model regresi yang cukup mewakili model data adalah regresi lokal dengan span 0.9. Dilihat dari nilai RMSE, model terbaik adalah model yang memiliki nilai RMSE terkecil, dari output diatas nilai RMSE terkecil sebesar 1.495821 dengan span 0.9. Sehingga dapat disimpulkan bahwa baik berdasarkan grafik maupun uji validasi silang model regresi lokal terbaik adalah regresi lokal (LOESS) dengan span 0.9.