Prediksi Score PISA Matematika

Penjelasan

Menyusun Model Prediksi itu menantang dan menyenangkan, Permasalahan yang sering dihadapi adalah bagaimana prediksi itu bisa dihitung dan diperkirakan. Penjelasan melalui model “Black Box” bukanlah soal yang mudah, oleh karenanya kita butuh penjelasan bagaimana proses itu dihitung dengan gamblang.

Data

Kita coba dengan data hasil ujian PISA 2012 yang datanya bisa di download dan diinstall librarynya pada [Data Pisa] (https://github.com/pbiecek/PISA2012lite)

Untuk kasus ini dipilih : Indonesia, Malaysia, Thailand, Singapore, Vietnam, dan Cina

## 'data.frame':    18002 obs. of  14 variables:
##  $ CNT     : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ OECD    : Factor w/ 2 levels "Non-OECD","OECD": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST04Q01 : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
##  $ ST05Q01 : Factor w/ 3 levels "No  ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
##  $ ST08Q01 : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 2 2 1 2 3 ...
##  $ ST09Q01 : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST115Q01: num  1 1 1 1 1 2 1 1 2 2 ...
##  $ ST20Q01 : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST27Q01 : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
##  $ ST27Q02 : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
##  $ ST46Q04 : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ WEALTH  : num  -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
##  $ ISCEDO  : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ PV1MATH : num  408 385 529 388 482 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:180945] 4 6 7 9 10 11 17 18 19 20 ...
##   .. ..- attr(*, "names")= chr [1:180945] "4" "6" "7" "9" ...

##         CNT             OECD         ST04Q01    
##  Indonesia:3535   Non-OECD:18002   Female:9596  
##  Malaysia :3262   OECD    :    0   Male  :8406  
##  Singapore:3648                                 
##  Thailand :4333                                 
##  Vietnam  :3224                                 
##                                                 
##                         ST05Q01                       ST08Q01     
##  No                         : 2823   None                 :13271  
##  Yes, for one year or less  : 3462   One or two times     : 3625  
##  Yes, for more than one year:11717   Three or four times  :  688  
##                                      Five or more times   :  418  
##                                                                   
##                                                                   
##                   ST09Q01         ST115Q01                ST20Q01     
##  None                 :15158   Min.   :1.000   Country of test:17420  
##  One or two times     : 2318   1st Qu.:1.000   Other country  :  582  
##  Three or four times  :  351   Median :1.000                          
##  Five or more times   :  175   Mean   :1.231                          
##                                3rd Qu.:1.000                          
##                                Max.   :4.000                          
##           ST27Q01               ST27Q02                  ST46Q04    
##  None         :  354   None         : 517   Strongly agree   :3158  
##  One          : 2261   One          :8681   Agree            :9440  
##  Two          : 3113   Two          :5739   Disagree         :4926  
##  Three or more:12274   Three or more:3065   Strongly disagree: 478  
##                                                                     
##                                                                     
##      WEALTH                  ISCEDO         PV1MATH     
##  Min.   :-6.650   General       :16223   Min.   :107.7  
##  1st Qu.:-2.300   Pre-Vocational:    0   1st Qu.:384.3  
##  Median :-1.320   Vocational    : 1757   Median :453.5  
##  Mean   :-1.369   Modular       :   22   Mean   :464.9  
##  3rd Qu.:-0.410                          3rd Qu.:538.1  
##  Max.   : 3.030                          Max.   :885.9

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

##     CNT    OECD ST04Q01 ST05Q01 ST08Q01 ST09Q01 ST20Q01 ST27Q01 ST27Q02 
##       5       1       2       3       4       4       2       4       4 
## ST46Q04  ISCEDO 
##       4       3

Tampilan Data dalam tabulasi

library(knitr)
library(dplyr)
pisa[1:10,] %>%
  knitr::kable(caption = "First 10 rows")

First 10 rows
	PV1MATH	CNT	ST04Q01	ST05Q01	ST08Q01	ST09Q01	ST115Q01	ST20Q01	ST27Q01	ST27Q02	ST46Q04	WEALTH	ISCEDO
221390	407.5479	Indonesia	Female	Yes, for more than one year	None	None	1	Country of test	One	None	Agree	-1.87	General
221392	385.1924	Indonesia	Male	Yes, for more than one year	None	None	1	Country of test	Three or more	Three or more	Agree	0.05	General
221394	528.9843	Indonesia	Female	Yes, for more than one year	None	None	1	Country of test	Three or more	Three or more	Agree	-0.96	General
221395	388.2303	Indonesia	Female	Yes, for more than one year	None	None	1	Country of test	Three or more	Two	Agree	-1.41	General
221396	482.3259	Indonesia	Female	Yes, for more than one year	None	None	1	Country of test	Three or more	Three or more	Strongly agree	-0.23	General
221397	263.6781	Indonesia	Female	Yes, for one year or less	One or two times	None	2	Country of test	Three or more	One	Agree	-1.05	Vocational
221398	368.7568	Indonesia	Female	Yes, for one year or less	One or two times	None	1	Country of test	One	One	Agree	-4.34	Vocational
221401	387.2176	Indonesia	Male	No	None	None	1	Country of test	None	None	Strongly agree	-4.34	Vocational
221403	430.6823	Indonesia	Female	Yes, for one year or less	One or two times	None	2	Country of test	One	One	Agree	-3.67	Vocational
221404	413.2341	Indonesia	Female	No	Three or four times	None	2	Country of test	Two	One	Strongly agree	-3.16	Vocational

Tampilan Norma

library(OneR)
library(ggplot2)
pisa$PIV1MATH.l <- bin(pisa$PV1MATH, nbins = 3, method = "content")
intervals <- paste(levels(pisa$PIV1MATH.l), collapse = " ")
intervals <- gsub("\\(|]", "", intervals)
intervals <- gsub(",", " ", intervals)
intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]]))
pisa %>%
  ggplot() +
  geom_density(aes(x = PV1MATH), color = "blue", fill = "blue", alpha = 0.4) +
  geom_vline(xintercept = intervals[2]) +
  geom_vline(xintercept = intervals[3])

intervals[2]

## [1] 407

intervals[3]

## [1] 505

Tampilan Grafik perbandingan Negara

pisa <- select(pisa, -PV1MATH) %>%
  mutate(PIV1MATH.l = plyr::revalue(PIV1MATH.l, c("(107,407]" = "low", "(407,505]" = "medium", "(505,887]" = "high")))

pisa %>%
  ggplot(aes(x = CNT, fill = PIV1MATH.l)) +
  geom_bar(position = "dodge", alpha = 0.7) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        plot.margin = unit(c(0, 0, 0, 1.5), "cm")) +
  scale_fill_brewer(palette = "Set1")

str(pisa)

## 'data.frame':    18002 obs. of  13 variables:
##  $ CNT       : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST04Q01   : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
##  $ ST05Q01   : Factor w/ 3 levels "No  ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
##  $ ST08Q01   : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 2 2 1 2 3 ...
##  $ ST09Q01   : Factor w/ 4 levels "None  ","One or two times  ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST115Q01  : num  1 1 1 1 1 2 1 1 2 2 ...
##  $ ST20Q01   : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ST27Q01   : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
##  $ ST27Q02   : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
##  $ ST46Q04   : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ WEALTH    : num  -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
##  $ ISCEDO    : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ PIV1MATH.l: Factor w/ 3 levels "low","medium",..: 2 1 3 1 2 1 1 1 2 2 ...

Menguji dengan Model Regresi Linier

library(caret)

## Loading required package: lattice

library(foreach)
set.seed(42)
index <- createDataPartition(pisa$PIV1MATH.l, p = 0.7, list = FALSE)
train_data <- pisa[index, ]
test_data  <- pisa[-index, ]

#linier Model
model <- lm(PIV1MATH.l~., data = train_data)

## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

Mulai Memprediksi

Kita mencoba memprediksi untuk siswa dengan data urutan ke 4500

library(breakDown)
library(ggplot2)
new_observation <- test_data[4500,]
explain_1 <- broken(model, new_observation)
explain_1

##                                      contribution
## (Intercept)                               2.0e+00
## CNT = Vietnam                             4.6e-01
## ST05Q01 = Yes, for one year or less      -7.0e-02
## ST08Q01 = None                            4.2e-02
## ST09Q01 = None                            3.3e-02
## ST27Q01 = Three or more                   1.9e-02
## ISCEDO = General                          1.6e-02
## ST04Q01 = Male                            1.2e-02
## WEALTH = -1.32                            5.4e-03
## ST46Q04 = Disagree                        4.3e-03
## ST27Q02 = Two                            -3.4e-03
## ST20Q01 = Country of test                -2.8e-03
## ST115Q01 = 1                              7.7e-05
## final_prognosis                           2.5e+00
## baseline:  0

plot(explain_1) + ggtitle("breakDown plot for random forest predictors of PISA Score")

print(new_observation)

##           CNT ST04Q01                    ST05Q01 ST08Q01 ST09Q01 ST115Q01
## 15028 Vietnam    Male Yes, for one year or less   None    None          1
##               ST20Q01       ST27Q01 ST27Q02  ST46Q04 WEALTH  ISCEDO
## 15028 Country of test Three or more     Two Disagree  -1.32 General
##       PIV1MATH.l
## 15028       high

Dari perkiraan model score akan jatuh di level tinggi = 2.5 , sedangkan data nyatanya ke 4500 nyatanya ada di tingkat + High tinggi juga artinya model dan prediksi dan data sudah cocok

Kita mencoba memprediksi untuk siswa dengan data urutan ke 750

new_observation <- pisa[750,]
explain_1 <- broken(model, new_observation)
explain_1

##                                       contribution
## (Intercept)                                2.0e+00
## CNT = Indonesia                           -3.7e-01
## WEALTH = -2.75                            -1.8e-01
## ST08Q01 = One or two times                -1.0e-01
## ST05Q01 = Yes, for more than one year      7.5e-02
## ST09Q01 = None                             3.3e-02
## ST27Q01 = Two                             -2.9e-02
## ISCEDO = General                           1.6e-02
## ST46Q04 = Strongly agree                   1.3e-02
## ST04Q01 = Male                             1.2e-02
## ST20Q01 = Country of test                 -2.8e-03
## ST27Q02 = One                             -2.2e-03
## ST115Q01 = 1                               7.7e-05
## final_prognosis                            1.5e+00
## baseline:  0

plot(explain_1) + ggtitle("breakDown plot for linear predictors of PISA SCore")

print(new_observation)

##           CNT ST04Q01                     ST05Q01            ST08Q01
## 750 Indonesia    Male Yes, for more than one year One or two times  
##     ST09Q01 ST115Q01         ST20Q01 ST27Q01 ST27Q02        ST46Q04 WEALTH
## 750  None          1 Country of test     Two     One Strongly agree  -2.75
##      ISCEDO PIV1MATH.l
## 750 General        low

Dari model perkiraan dengan data yang ada diprediksi di level rendah, sedangkan angka nyatanya juga di level rendah. terbukti lagi model prediksinya sudah cocok.

Prediksi Score PISA Matematika

Heru Wiryanto

December 6, 2017

Penjelasan

Data

Menguji dengan Model Regresi Linier

Mulai Memprediksi