Penjelasan

Menyusun Model Prediksi itu menantang dan menyenangkan, Permasalahan yang sering dihadapi adalah bagaimana prediksi itu bisa dihitung dan diperkirakan. Penjelasan melalui model “Black Box” bukanlah soal yang mudah, oleh karenanya kita butuh penjelasan bagaimana proses itu dihitung dengan gamblang.

Data

Kita coba dengan data hasil ujian PISA 2012 yang datanya bisa di download dan diinstall librarynya pada [Data Pisa] (https://github.com/pbiecek/PISA2012lite)

Untuk kasus ini dipilih : Indonesia, Malaysia, Thailand, Singapore, Vietnam, dan Cina

## 'data.frame':    18002 obs. of  14 variables:
##  $ CNT     : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ OECD    : Factor w/ 2 levels "Non-OECD","OECD": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST04Q01 : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
##  $ ST05Q01 : Factor w/ 3 levels "No  ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
##  $ ST08Q01 : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 2 2 1 2 3 ...
##  $ ST09Q01 : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST115Q01: num  1 1 1 1 1 2 1 1 2 2 ...
##  $ ST20Q01 : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST27Q01 : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
##  $ ST27Q02 : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
##  $ ST46Q04 : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ WEALTH  : num  -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
##  $ ISCEDO  : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ PV1MATH : num  408 385 529 388 482 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:180945] 4 6 7 9 10 11 17 18 19 20 ...
##   .. ..- attr(*, "names")= chr [1:180945] "4" "6" "7" "9" ...
##         CNT             OECD         ST04Q01    
##  Indonesia:3535   Non-OECD:18002   Female:9596  
##  Malaysia :3262   OECD    :    0   Male  :8406  
##  Singapore:3648                                 
##  Thailand :4333                                 
##  Vietnam  :3224                                 
##                                                 
##                         ST05Q01                       ST08Q01     
##  No                         : 2823   None                 :13271  
##  Yes, for one year or less  : 3462   One or two times     : 3625  
##  Yes, for more than one year:11717   Three or four times  :  688  
##                                      Five or more times   :  418  
##                                                                   
##                                                                   
##                   ST09Q01         ST115Q01                ST20Q01     
##  None                 :15158   Min.   :1.000   Country of test:17420  
##  One or two times     : 2318   1st Qu.:1.000   Other country  :  582  
##  Three or four times  :  351   Median :1.000                          
##  Five or more times   :  175   Mean   :1.231                          
##                                3rd Qu.:1.000                          
##                                Max.   :4.000                          
##           ST27Q01               ST27Q02                  ST46Q04    
##  None         :  354   None         : 517   Strongly agree   :3158  
##  One          : 2261   One          :8681   Agree            :9440  
##  Two          : 3113   Two          :5739   Disagree         :4926  
##  Three or more:12274   Three or more:3065   Strongly disagree: 478  
##                                                                     
##                                                                     
##      WEALTH                  ISCEDO         PV1MATH     
##  Min.   :-6.650   General       :16223   Min.   :107.7  
##  1st Qu.:-2.300   Pre-Vocational:    0   1st Qu.:384.3  
##  Median :-1.320   Vocational    : 1757   Median :453.5  
##  Mean   :-1.369   Modular       :   22   Mean   :464.9  
##  3rd Qu.:-0.410                          3rd Qu.:538.1  
##  Max.   : 3.030                          Max.   :885.9
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##     CNT    OECD ST04Q01 ST05Q01 ST08Q01 ST09Q01 ST20Q01 ST27Q01 ST27Q02 
##       5       1       2       3       4       4       2       4       4 
## ST46Q04  ISCEDO 
##       4       3

Tampilan Data dalam tabulasi

library(knitr)
library(dplyr)
pisa[1:10,] %>%
  knitr::kable(caption = "First 10 rows")
First 10 rows
PV1MATH CNT ST04Q01 ST05Q01 ST08Q01 ST09Q01 ST115Q01 ST20Q01 ST27Q01 ST27Q02 ST46Q04 WEALTH ISCEDO
221390 407.5479 Indonesia Female Yes, for more than one year None None 1 Country of test One None Agree -1.87 General
221392 385.1924 Indonesia Male Yes, for more than one year None None 1 Country of test Three or more Three or more Agree 0.05 General
221394 528.9843 Indonesia Female Yes, for more than one year None None 1 Country of test Three or more Three or more Agree -0.96 General
221395 388.2303 Indonesia Female Yes, for more than one year None None 1 Country of test Three or more Two Agree -1.41 General
221396 482.3259 Indonesia Female Yes, for more than one year None None 1 Country of test Three or more Three or more Strongly agree -0.23 General
221397 263.6781 Indonesia Female Yes, for one year or less One or two times None 2 Country of test Three or more One Agree -1.05 Vocational
221398 368.7568 Indonesia Female Yes, for one year or less One or two times None 1 Country of test One One Agree -4.34 Vocational
221401 387.2176 Indonesia Male No None None 1 Country of test None None Strongly agree -4.34 Vocational
221403 430.6823 Indonesia Female Yes, for one year or less One or two times None 2 Country of test One One Agree -3.67 Vocational
221404 413.2341 Indonesia Female No Three or four times None 2 Country of test Two One Strongly agree -3.16 Vocational

Tampilan Norma

library(OneR)
library(ggplot2)
pisa$PIV1MATH.l <- bin(pisa$PV1MATH, nbins = 3, method = "content")
intervals <- paste(levels(pisa$PIV1MATH.l), collapse = " ")
intervals <- gsub("\\(|]", "", intervals)
intervals <- gsub(",", " ", intervals)
intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]]))
pisa %>%
  ggplot() +
  geom_density(aes(x = PV1MATH), color = "blue", fill = "blue", alpha = 0.4) +
  geom_vline(xintercept = intervals[2]) +
  geom_vline(xintercept = intervals[3])

intervals[2]
## [1] 407
intervals[3]
## [1] 505

Tampilan Grafik perbandingan Negara

pisa <- select(pisa, -PV1MATH) %>%
  mutate(PIV1MATH.l = plyr::revalue(PIV1MATH.l, c("(107,407]" = "low", "(407,505]" = "medium", "(505,887]" = "high")))

pisa %>%
  ggplot(aes(x = CNT, fill = PIV1MATH.l)) +
  geom_bar(position = "dodge", alpha = 0.7) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        plot.margin = unit(c(0, 0, 0, 1.5), "cm")) +
  scale_fill_brewer(palette = "Set1")

str(pisa)
## 'data.frame':    18002 obs. of  13 variables:
##  $ CNT       : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST04Q01   : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
##  $ ST05Q01   : Factor w/ 3 levels "No  ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
##  $ ST08Q01   : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 2 2 1 2 3 ...
##  $ ST09Q01   : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST115Q01  : num  1 1 1 1 1 2 1 1 2 2 ...
##  $ ST20Q01   : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST27Q01   : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
##  $ ST27Q02   : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
##  $ ST46Q04   : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ WEALTH    : num  -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
##  $ ISCEDO    : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ PIV1MATH.l: Factor w/ 3 levels "low","medium",..: 2 1 3 1 2 1 1 1 2 2 ...

Menguji dengan Model Regresi Linier

library(caret)
## Loading required package: lattice
library(foreach)
set.seed(42)
index <- createDataPartition(pisa$PIV1MATH.l, p = 0.7, list = FALSE)
train_data <- pisa[index, ]
test_data  <- pisa[-index, ]

#linier Model
model <- lm(PIV1MATH.l~., data = train_data)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

Mulai Memprediksi

Kita mencoba memprediksi untuk siswa dengan data urutan ke 4500

library(breakDown)
library(ggplot2)
new_observation <- test_data[4500,]
explain_1 <- broken(model, new_observation)
explain_1
##                                      contribution
## (Intercept)                               2.0e+00
## CNT = Vietnam                             4.6e-01
## ST05Q01 = Yes, for one year or less      -7.0e-02
## ST08Q01 = None                            4.2e-02
## ST09Q01 = None                            3.3e-02
## ST27Q01 = Three or more                   1.9e-02
## ISCEDO = General                          1.6e-02
## ST04Q01 = Male                            1.2e-02
## WEALTH = -1.32                            5.4e-03
## ST46Q04 = Disagree                        4.3e-03
## ST27Q02 = Two                            -3.4e-03
## ST20Q01 = Country of test                -2.8e-03
## ST115Q01 = 1                              7.7e-05
## final_prognosis                           2.5e+00
## baseline:  0
plot(explain_1) + ggtitle("breakDown plot for random forest predictors of PISA Score")

print(new_observation)
##           CNT ST04Q01                    ST05Q01 ST08Q01 ST09Q01 ST115Q01
## 15028 Vietnam    Male Yes, for one year or less   None    None          1
##               ST20Q01       ST27Q01 ST27Q02  ST46Q04 WEALTH  ISCEDO
## 15028 Country of test Three or more     Two Disagree  -1.32 General
##       PIV1MATH.l
## 15028       high

Dari perkiraan model score akan jatuh di level tinggi = 2.5 , sedangkan data nyatanya ke 4500 nyatanya ada di tingkat + High tinggi juga artinya model dan prediksi dan data sudah cocok

Kita mencoba memprediksi untuk siswa dengan data urutan ke 750

new_observation <- pisa[750,]
explain_1 <- broken(model, new_observation)
explain_1
##                                       contribution
## (Intercept)                                2.0e+00
## CNT = Indonesia                           -3.7e-01
## WEALTH = -2.75                            -1.8e-01
## ST08Q01 = One or two times                -1.0e-01
## ST05Q01 = Yes, for more than one year      7.5e-02
## ST09Q01 = None                             3.3e-02
## ST27Q01 = Two                             -2.9e-02
## ISCEDO = General                           1.6e-02
## ST46Q04 = Strongly agree                   1.3e-02
## ST04Q01 = Male                             1.2e-02
## ST20Q01 = Country of test                 -2.8e-03
## ST27Q02 = One                             -2.2e-03
## ST115Q01 = 1                               7.7e-05
## final_prognosis                            1.5e+00
## baseline:  0
plot(explain_1) + ggtitle("breakDown plot for linear predictors of PISA SCore")

print(new_observation)
##           CNT ST04Q01                     ST05Q01            ST08Q01
## 750 Indonesia    Male Yes, for more than one year One or two times  
##     ST09Q01 ST115Q01         ST20Q01 ST27Q01 ST27Q02        ST46Q04 WEALTH
## 750  None          1 Country of test     Two     One Strongly agree  -2.75
##      ISCEDO PIV1MATH.l
## 750 General        low

Dari model perkiraan dengan data yang ada diprediksi di level rendah, sedangkan angka nyatanya juga di level rendah. terbukti lagi model prediksinya sudah cocok.