Data yang akurat sangat dibutuhkan untuk pengambilan kebijakan. Salah satunya yaitu data mengenai karakteristik sosial-demografi penduduk. Berbagai macam kebijakan dapat dilandasi oleh data tersebut.
Katakanlah semisal pemerintah ingin memberikan bantuan kepada penduduk dengan income di bawah nominal tertentu. Data yang dimiliki adalah karakteristik sosio-demografi seperti usia, pekerjaan, status perkawinan, pendidikan, jenis kelamin, ras, jam kerja, dll. Harapannya pemerintah dapat mengotomatisasi penggolongan income berdasarkan karakteristik yang dimiliki penduduk.
Berikut ini adalah data yang tersedia. Sebelum melakukan pemodelan lebih lanjut, hal yang perlu dilakukan adalah eksplorasi data. Jika dalam eksplorasi tersebut ditemukan beberapa ‘kesalahan’ atau ‘keanehan’ dalam data, tentu harus kita tindak lanjut dengan cleaning, imputing, dan beberapa treatment lain.
library(readr)
library(skimr)
library(dplyr)
raw=read_csv("D:/1. DATA MINING/income_evaluation.csv")
head(raw)
## # A tibble: 6 x 15
## age workclass fnlwgt educa~1 educa~2 marit~3 occup~4 relat~5 race sex
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 39 State-gov 77516 Bachel~ 13 Never-~ Adm-cl~ Not-in~ White Male
## 2 50 Self-emp-not~ 83311 Bachel~ 13 Marrie~ Exec-m~ Husband White Male
## 3 38 Private 215646 HS-grad 9 Divorc~ Handle~ Not-in~ White Male
## 4 53 Private 234721 11th 7 Marrie~ Handle~ Husband Black Male
## 5 28 Private 338409 Bachel~ 13 Marrie~ Prof-s~ Wife Black Fema~
## 6 37 Private 284582 Masters 14 Marrie~ Exec-m~ Wife White Fema~
## # ... with 5 more variables: `capital-gain` <dbl>, `capital-loss` <dbl>,
## # `hours-per-week` <dbl>, `native-country` <chr>, income <chr>, and
## # abbreviated variable names 1: education, 2: `education-num`,
## # 3: `marital-status`, 4: occupation, 5: relationship
Variabel finalweight tidak diperlukan dalam analisis, maka dapat kita buang
data=data.frame(raw[,-3])
skim(data)
| Name | data |
| Number of rows | 32561 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| workclass | 0 | 1 | 1 | 16 | 0 | 9 | 0 |
| education | 0 | 1 | 3 | 12 | 0 | 16 | 0 |
| marital.status | 0 | 1 | 7 | 21 | 0 | 7 | 0 |
| occupation | 0 | 1 | 1 | 17 | 0 | 15 | 0 |
| relationship | 0 | 1 | 4 | 14 | 0 | 6 | 0 |
| race | 0 | 1 | 5 | 18 | 0 | 5 | 0 |
| sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| native.country | 0 | 1 | 1 | 26 | 0 | 42 | 0 |
| income | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 38.58 | 13.64 | 17 | 28 | 37 | 48 | 90 | ▇▇▅▂▁ |
| education.num | 0 | 1 | 10.08 | 2.57 | 1 | 9 | 10 | 12 | 16 | ▁▁▇▃▁ |
| capital.gain | 0 | 1 | 1077.65 | 7385.29 | 0 | 0 | 0 | 0 | 99999 | ▇▁▁▁▁ |
| capital.loss | 0 | 1 | 87.30 | 402.96 | 0 | 0 | 0 | 0 | 4356 | ▇▁▁▁▁ |
| hours.per.week | 0 | 1 | 40.44 | 12.35 | 1 | 40 | 40 | 45 | 99 | ▁▇▃▁▁ |
Dari summary di atas tampak bahwa tidak ada missing value (NA). Namun, jika kita amati lebih dalam ada beberapa tuple yang tercatat “?” dan tidak sesuai ketentuan atribut.
sort(unique(data$workclass))
## [1] "?" "Federal-gov" "Local-gov" "Never-worked"
## [5] "Private" "Self-emp-inc" "Self-emp-not-inc" "State-gov"
## [9] "Without-pay"
sort(unique(data$occupation))
## [1] "?" "Adm-clerical" "Armed-Forces"
## [4] "Craft-repair" "Exec-managerial" "Farming-fishing"
## [7] "Handlers-cleaners" "Machine-op-inspct" "Other-service"
## [10] "Priv-house-serv" "Prof-specialty" "Protective-serv"
## [13] "Sales" "Tech-support" "Transport-moving"
sort(unique(data$native.country))
## [1] "?" "Cambodia"
## [3] "Canada" "China"
## [5] "Columbia" "Cuba"
## [7] "Dominican-Republic" "Ecuador"
## [9] "El-Salvador" "England"
## [11] "France" "Germany"
## [13] "Greece" "Guatemala"
## [15] "Haiti" "Holand-Netherlands"
## [17] "Honduras" "Hong"
## [19] "Hungary" "India"
## [21] "Iran" "Ireland"
## [23] "Italy" "Jamaica"
## [25] "Japan" "Laos"
## [27] "Mexico" "Nicaragua"
## [29] "Outlying-US(Guam-USVI-etc)" "Peru"
## [31] "Philippines" "Poland"
## [33] "Portugal" "Puerto-Rico"
## [35] "Scotland" "South"
## [37] "Taiwan" "Thailand"
## [39] "Trinadad&Tobago" "United-States"
## [41] "Vietnam" "Yugoslavia"
Nilai “?” akan kita anggap sebagai NA dan selanjutnya kita lakukan imputasi.
data=replace(data,data=="?",NA)
Ada beberapa atribut yang tipe datanya masih berupa character. Berikut cara mengubah tipe character menjadi factor
data2 <- as.data.frame(unclass(data), stringsAsFactors = TRUE)
str(data2)
## 'data.frame': 32561 obs. of 14 variables:
## $ age : num 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education.num : num 13 13 9 7 13 14 5 9 14 13 ...
## $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital.gain : num 2174 0 0 0 0 ...
## $ capital.loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: num 40 13 40 40 40 40 16 45 50 40 ...
## $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(data2$workclass)
## Federal-gov Local-gov Never-worked Private
## 960 2093 7 22696
## Self-emp-inc Self-emp-not-inc State-gov Without-pay
## 1116 2541 1298 14
## NA's
## 1836
summary(data2$occupation)
## Adm-clerical Armed-Forces Craft-repair Exec-managerial
## 3770 9 4099 4066
## Farming-fishing Handlers-cleaners Machine-op-inspct Other-service
## 994 1370 2002 3295
## Priv-house-serv Prof-specialty Protective-serv Sales
## 149 4140 649 3650
## Tech-support Transport-moving NA's
## 928 1597 1843
summary(data2$native.country)
## Cambodia Canada
## 19 121
## China Columbia
## 75 59
## Cuba Dominican-Republic
## 95 70
## Ecuador El-Salvador
## 28 106
## England France
## 90 29
## Germany Greece
## 137 29
## Guatemala Haiti
## 64 44
## Holand-Netherlands Honduras
## 1 13
## Hong Hungary
## 20 13
## India Iran
## 100 43
## Ireland Italy
## 24 73
## Jamaica Japan
## 81 62
## Laos Mexico
## 18 643
## Nicaragua Outlying-US(Guam-USVI-etc)
## 34 14
## Peru Philippines
## 31 198
## Poland Portugal
## 60 37
## Puerto-Rico Scotland
## 114 12
## South Taiwan
## 80 51
## Thailand Trinadad&Tobago
## 18 19
## United-States Vietnam
## 29170 67
## Yugoslavia NA's
## 16 583
Hasil summary di atas menunjukkan bahwa karakter “?” sudah menjadi NA
Berikut ini, visualisasi data yang masih mengandung NA
counts2 <- table(data2$income, data2$workclass)
par(mar=c(4,7,2,3))
barplot(counts2, main="Income by Workclass",
xlab="Workclass", col=c("Grey","Pink"),
legend = rownames(counts2),cex.names = 0.8,horiz = T,las=1)
Pada grafik di atas, terlihat bahwa sebagian besar penduduk bekerja pada workclass jenis private. Bila dibagi menurut kategori income, hampir semua jenis workclass cukup timpang kecuali “self-emp-inc” dengan jumlah kategori income >50K nya banyak.
counts4 <- table(data2$income, data2$marital.status)
par(mar=c(4,7,2,3))
barplot(counts4, main="Income by Marital Status",
xlab="Marital Status", col=c("Grey","Pink"),
legend = rownames(counts4),cex.names = 0.7,cex.axis = 0.8, horiz = T, las=1)
Jika berdasarkan marital status, terlihat bahwa rasio income yang tidak terlalu timpang berada di kategori married-civ-spouse.
counts5 <- table(data2$income, data2$occupation)
par(mar=c(4,7,2,1))
barplot(counts5, main="Income by Occupation",
xlab="Occupation", col=c("Grey","Pink"),
legend = rownames(counts5),cex.names =0.8,horiz = T, las=1)
Berdasarkan occupation, terlihat bahwa kategori prof-speciality dan exec-managerial memiliki jumlah penduduk dengan income >50K yang cukup banyak dibanding kategori lain.
counts6 <- table(data2$income, data2$relationship)
par(mar=c(4,7,2,1))
barplot(counts6, main="Income by Relationship",
xlab="Relationship", col=c("Grey","Pink"),
legend = rownames(counts6),cex.names =0.8,horiz = T, las=1)
Berdasarkan relationship, kategori own-child dan other relative cenderung memiliki income di bawah 50K. Sementara kategori husband cenderung memiliki income >50K.
counts <- table(data2$income, data2$race)
par(mar=c(4,7,2,2))
barplot(counts, main="Income by Races",
xlab="Races", col=c("Grey","Pink"),
legend = rownames(counts),cex.names = 0.8,horiz = T,las=1)
Jika berdasarkan race, tidak terlalu tampak perbedaan antara ras yang satu dengan yang lain untuk mengklasifikasikan income seseorang. Ini mengindikasikan bahwa korelasi race dengan income kecil.
counts7 <- table(data2$income, data2$sex)
barplot(counts7, main="Income by Sex",
xlab="Sex", col=c("Grey","Pink"),
legend = rownames(counts7))
Jika dilihat dari sex, rasio ketimpangan income pada female lebih tinggi dibandingkan male.
counts3 <- table(data2$income, data2$education.num)
barplot(counts3, main="Income by Education",
xlab="Education", col=c("Grey","Pink"),
legend = rownames(counts3))
Terlihat dari grafik di atas, mulai dari education kategori ke-13 (Bachelors) sampai ke-16 (Doctorate) rasio ketimpangan income semakin mengecil. Ini mengindikasikan adanya korelasi positif antara education dengan income.
counts8 <- table(data2$income, data2$native.country)
par(mar=c(4,6,2,2))
barplot(counts8, main="Income by Native Country",
xlab="Native Country", col=c("Grey","Pink"),
legend = rownames(counts8),horiz = T,las=1,cex.names = 0.5,cex.axis = 0.7)
Perbedaan ketimpangan income antar native country tidak dapat terlihat secara jelas. Hampir semua kategori native country memiliki ketimpangan income yang cukup besar.
library(ggplot2)
ggplot(data2, aes(x = age)) +xlab("Age")+
geom_histogram(aes(color = income, fill = income),
position = "identity", bins = 30, alpha = 0.4) +
scale_color_manual(values = c("#00AFBB", "red")) +
scale_fill_manual(values = c("#00AFBB", "Red"))
Berdasarkan histogram di atas, terlihat perbedaan sebaran age dari kedua kategori income. Age dari penduduk dengan income <=50K tersebar pada usia sekitar 30 ke bawah. Sementara penduduk dengan income >50K mengelompok pada usia 35 ke atas sampai 50-an.
ggplot(data2, aes(x = capital.gain)) +xlab("Capital Gain")+
geom_histogram(aes(color = income, fill = income),
position = "identity", bins = 30, alpha = 0.4) +
scale_color_manual(values = c("#00AFBB", "red")) +
scale_fill_manual(values = c("#00AFBB", "Red"))
Jika ditinjau dari nilai capital gain, terdapat penduduk dengan income >50K yang memiliki capital gain mencapai 100000. Sementara penduduk dengan income rendah, capital gainnya hanya berkisar di bawah angka 25000.
ggplot(data2, aes(x = capital.loss)) +xlab("Capital Loss")+
geom_histogram(aes(color = income, fill = income),
position = "identity", bins = 30, alpha = 0.4) +
scale_color_manual(values = c("#00AFBB", "red")) +
scale_fill_manual(values = c("#00AFBB", "Red"))
Pada grafik di atas, perbedaan antara kedua kategori income tidak terlalu terlihat. Keduanya sama-sama mengumpul di angka sekitar 0 dan 2000.
ggplot(data2, aes(x = hours.per.week)) + xlab("Hours per Week")+
geom_histogram(aes(color = income, fill = income),
position = "identity", bins = 30, alpha = 0.4) +
scale_color_manual(values = c("#00AFBB", "red")) +
scale_fill_manual(values = c("#00AFBB", "Red"))
Nilai hours per week penduduk dengan income >50K juga tampak menyebar di angka yang lebih besar dibanding penduduk dengan income <=50K.
library(reshape2)
cordata <- data.matrix(data2)
cormat <- round(cor(cordata, method = "pearson"),2)
melted_cormat <- melt(cormat)
# Peroleh segitiga bawah dari matriks korelasi
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Peroleh segitiga atas dari matriks korelasi
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
# Atur matriks korelasi
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Buat ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 9, hjust = 1))+
coord_fixed()
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
Pada korelasi plot di atas, tampak bahwa beberapa atribut yang kosong. Ini dikarenakan masih adanya missing value. Namun, secara umum atribut yang memiliki korelasi cukup besar terhadap income adalah education numerik dan relationship.
Ada hal yang menarik dari plot korelasi di atas, korelasi education.num dengan income berbeda jauh dibanding education dengan income. Hal ini disebabkan education.num merupakan level factor yang dijadikan numerik. Level ini sudah bertipe ordinal, artinya mengurutkan dari level preschool sampai doctorate. Berbeda dengan education. Apabila kita jadikan education as numeric, angka tersebut masih acak dan tidak mencerminkan tingkatan pendidikan (bukan ordinal). Sehingga, korelasi yang dihasilkan pun kecil.
Selanjutnya, yang kita gunakan adalah variabel education.num yang kita jadikan factor.
Seperti yang telah dipaparkan sebelumnya, variabel education.num lebih merepresentasikan tingkat pendidikan penduduk. Oleh karena itu, variabel education dapat kita buang. Sementara variabel education.num kita ubah namanya dan menjadikannya factor.
data3=select(data2,-education)%>%rename(education=education.num)
data3$education=as.factor(data3$education)
Pada plot matriks korelasi sebelumnya, tampak bahwa beberapa atribut/variabel kosong yang disebabkan oleh adanya missing value. Oleh karena itu, perlu kita lakukan imputasi. Imputasi yang dilakukan yaitu dengan metode knn dengan nilai k=5.
library(VIM)
data3 = kNN(data3, k=5, imp_var = FALSE)
sum(is.na(data3))
## [1] 0
Output di atas menunjukkan bahwa jumlah missing value dari data kita sudah 0.
Agar prediksi kita tidak condong ke suatu kategori tertentu, sebaiknya kategori tersebut tidak-lah timpang. Dalam melakukan oversampling, data yang kita miliki harus diubah tipenya dengan function unclass().
data4=data3
data4[, c("workclass",'education','marital.status','occupation','relationship','race','sex','native.country','income')] =sapply(data3[, c('workclass','education','marital.status','occupation','relationship','race','sex','native.country','income')], unclass)
abbrev_x <- c("","<=50K","",">50K","")
ggplot(data4, aes(x = income)) +
geom_bar(fill = "lavender") +
scale_x_continuous(breaks = seq(0.5,2.5,by=0.5), labels = abbrev_x) +
ggtitle("Income Category") +
theme(legend.position="none")+
theme_dark()
Plot di atas menunjukkan bahwa kategori income <=50K jauh lebih banyak observasinya dibandingkan income >50K.
library(imbalance)
imbalanceRatio(data4, classAttr = "income")
## [1] 0.3171926
Rasio antara income tinggi dan rendah mencapai 0.31. Artinya, tergolong ke dalam mild imbalance data.
set.seed(999)
smote <- oversample(data4, ratio = 0.95, method = "SMOTE", classAttr = "income")
abbrev_x <- c("","<=50K","",">50K","")
ggplot(smote, aes(x = income)) +
geom_bar(fill = "lavender") +
scale_x_continuous(breaks = seq(0.5,2.5,by=0.5), labels = abbrev_x) +
ggtitle("Income Category") +
theme_dark()
imbalanceRatio(smote, classAttr = "income")
## [1] 0.95
Setelah dilakukan oversampling dengan metode SMOTE, tampak bahwa rasio kategori income sudah seimbang yakni 95%.
set.seed(999)
dataADASYN <- oversample(data4, method = "ADASYN", classAttr = "income")
abbrev_x <- c("","<=50K","",">50K","")
ggplot(dataADASYN, aes(x = income)) +
geom_bar(fill = "lavender") +
scale_x_continuous(breaks = seq(0.5,2.5,by=0.5), labels = abbrev_x) +
ggtitle("Income Category") +
theme_dark()
Jika kita menggunakan metode adasyn, kelas mayoritas menjadi berubah yang semula <=50K menjadi >50K. Oleh karena itu, untuk langkah selanjutnya kita menggunakan data hasil SMOTE.
Catatan: Sebelumnya telah dicoba menggunakan metode MWMOTE. Namun, computing resource tidak cukup untuk memprosesnya sampai akhir.
Setelah dilakukan imputasi dan oversampling, mari kita lihat plot matriks korelasinya.
library(reshape2)
cordata <- data.matrix(smote)
cormat <- round(cor(cordata, method = "pearson"),2)
melted_cormat <- melt(cormat)
# Peroleh segitiga bawah dari matriks korelasi
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Peroleh segitiga atas dari matriks korelasi
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
# Atur matriks korelasi
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Buat ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 9, hjust = 1))+
coord_fixed()
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
Terlihat bahwa variabel yang memiliki korelasi terbesar terhadap income adalah education, relationship, age, marital status, dan hours per week.
Terlebih dahulu kita kembalikan tipe data seperti semula.
fix=smote
fix$workclass=as.factor(fix$workclass)
fix$education=as.factor(fix$education)
fix$marital.status=as.factor(fix$marital.status)
fix$occupation=as.factor(fix$occupation)
fix$relationship=as.factor(fix$relationship)
fix$race=as.factor(fix$race)
fix$sex=as.factor(fix$sex)
fix$native.country=as.factor(fix$native.country)
fix$income=factor(fix$income,levels = c(1, 2),labels = c("<=50K", ">50K"))
Kita bagi 80% data untuk training membuat model dan 20% untuk testing evaluasi model.
library(caTools)
set.seed(999)
sample <- sample.split(fix$income, SplitRatio = 0.8)
train <- subset(fix, sample == TRUE)
test <- subset(fix, sample == FALSE)
prop.table(table(train$income))
##
## <=50K >50K
## 0.5128232 0.4871768
prop.table(table(test$income))
##
## <=50K >50K
## 0.5128099 0.4871901
Proporsi kelas antara data train dan test sudah sama, sehingga kita bisa melakukan pemodelan.
library(caret)
set.seed(999)
# Tentukan cross validation berulang dengan 10 folds/lipatan dan satu pengulangan
evaluationSetting <- trainControl(method='repeatedcv',
number=10,
repeats=1,
# Evaluate performance using
# the following function summaryFunction
summaryFunction = multiClassSummary)
metric <- "Accuracy"
library(rpart)
library(rpart.plot)
fit=rpart(income~.,data=train,method = 'class')
rpart.plot(fit,extra =100)
Didapatkan decision tree seperti pada gambar di atas. Decision tree di atas dapat diartikan sebagai berikut:
Berikut kode dari variabel-variabel kategorik. Kode tidak ditampilkan di decision tree agar terlihat ringkas dan bisa dibaca.
DT_Model <- caret::train(factor(income)~.,
data=smote,
method="rpart",
metric=metric,
tuneGrid = expand.grid(cp = seq(0, 0.1, 0.001)),
parms = list(split='information'),
trControl=evaluationSetting)
prediksi=predict(fit,test,type = 'class')
confusionMatrix(prediksi,test$income,mode = "everything",positive = "<=50K")
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 3939 695
## >50K 1005 4002
##
## Accuracy : 0.8237
## 95% CI : (0.8159, 0.8312)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6477
##
## Mcnemar's Test P-Value : 6.663e-14
##
## Sensitivity : 0.7967
## Specificity : 0.8520
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.7993
## Precision : 0.8500
## Recall : 0.7967
## F1 : 0.8225
## Prevalence : 0.5128
## Detection Rate : 0.4086
## Detection Prevalence : 0.4807
## Balanced Accuracy : 0.8244
##
## 'Positive' Class : <=50K
##
Berdasarkan confusion matrix di atas, diperoleh informasi sebagai berikut:
print(DT_Model)
## CART
##
## 48204 samples
## 12 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 43384, 43384, 43383, 43384, 43384, 43384, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa F1 Sensitivity Specificity
## 0.000 0.8562775 0.7125397 0.8582327 0.8484223 0.8645451
## 0.001 0.8343705 0.6688606 0.8353084 0.8192557 0.8502805
## 0.002 0.8260100 0.6520592 0.8277936 0.8154531 0.8371219
## 0.003 0.8248484 0.6497640 0.8263465 0.8127427 0.8375897
## 0.004 0.8247239 0.6495931 0.8254739 0.8084547 0.8418486
## 0.005 0.8239978 0.6481822 0.8243739 0.8055421 0.8434242
## 0.006 0.8234378 0.6470957 0.8235004 0.8031553 0.8447868
## 0.007 0.8230851 0.6464605 0.8224570 0.7991100 0.8483211
## 0.008 0.8230851 0.6464605 0.8224570 0.7991100 0.8483211
## 0.009 0.8205543 0.6413580 0.8203616 0.7991100 0.8431263
## 0.010 0.8205543 0.6413580 0.8203616 0.7991100 0.8431263
## 0.011 0.8198074 0.6399424 0.8188744 0.7943770 0.8465753
## 0.012 0.8193303 0.6390300 0.8180049 0.7917880 0.8483215
## 0.013 0.8022981 0.6058125 0.7920111 0.7387945 0.8691458
## 0.014 0.8009913 0.6032453 0.7902374 0.7353560 0.8700819
## 0.015 0.7997880 0.6009505 0.7878920 0.7286812 0.8746380
## 0.016 0.7795413 0.5617273 0.7536983 0.6579693 0.9075115
## 0.017 0.7795413 0.5617273 0.7536983 0.6579693 0.9075115
## 0.018 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.019 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.020 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.021 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.022 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.023 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.024 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.025 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.026 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.027 0.7742099 0.5510022 0.7493001 0.6580906 0.8964407
## 0.028 0.7698531 0.5420933 0.7472525 0.6637136 0.8815769
## 0.029 0.7679029 0.5380928 0.7464806 0.6667476 0.8743793
## 0.030 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.031 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.032 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.033 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.034 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.035 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.036 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.037 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.038 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.039 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.040 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.041 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.042 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.043 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.044 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.045 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.046 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.047 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.048 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.049 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.050 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.051 0.7665961 0.5353863 0.7462762 0.6700647 0.8682065
## 0.052 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.053 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.054 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.055 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.056 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.057 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.058 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.059 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.060 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.061 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.062 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.063 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.064 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.065 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.066 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.067 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.068 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.069 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.070 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.071 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.072 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.073 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.074 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.075 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.076 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.077 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.078 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.079 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.080 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.081 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.082 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.083 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.084 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.085 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.086 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.087 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.088 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.089 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.090 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.091 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.092 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.093 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.094 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.095 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.096 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.097 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.098 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.099 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## 0.100 0.7568248 0.5156379 0.7388148 0.6703883 0.8478084
## Pos_Pred_Value Neg_Pred_Value Precision Recall Detection_Rate
## 0.8683497 0.8442789 0.8683497 0.8484223 0.4350886
## 0.8521321 0.8172837 0.8521321 0.8192557 0.4201311
## 0.8409452 0.8118745 0.8409452 0.8154531 0.4181811
## 0.8411194 0.8098956 0.8411194 0.8127427 0.4167914
## 0.8439462 0.8071755 0.8439462 0.8084547 0.4145922
## 0.8446729 0.8049865 0.8446729 0.8055421 0.4130986
## 0.8453892 0.8032411 0.8453892 0.8031553 0.4118747
## 0.8477913 0.8007311 0.8477913 0.7991100 0.4098002
## 0.8477913 0.8007311 0.8477913 0.7991100 0.4098002
## 0.8433188 0.7997586 0.8433188 0.7991100 0.4098002
## 0.8433188 0.7997586 0.8433188 0.7991100 0.4098002
## 0.8453066 0.7965748 0.8453066 0.7943770 0.4073730
## 0.8462539 0.7947951 0.8462539 0.7917880 0.4060452
## 0.8580322 0.7620985 0.8580322 0.7387945 0.3788684
## 0.8583422 0.7597321 0.8583422 0.7353560 0.3771052
## 0.8614370 0.7557096 0.8614370 0.7286812 0.3736823
## 0.8822097 0.7160387 0.8822097 0.6579693 0.3374203
## 0.8822097 0.7160387 0.8822097 0.6579693 0.3374203
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8699171 0.7135696 0.8699171 0.6580906 0.3374826
## 0.8562163 0.7137265 0.8562163 0.6637136 0.3403664
## 0.8496514 0.7139919 0.8496514 0.6667476 0.3419224
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8439817 0.7147048 0.8439817 0.6700647 0.3436233
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## 0.8265741 0.7098501 0.8265741 0.6703883 0.3437892
## Balanced_Accuracy
## 0.8564837
## 0.8347681
## 0.8262875
## 0.8251662
## 0.8251517
## 0.8244832
## 0.8239710
## 0.8237156
## 0.8237156
## 0.8211182
## 0.8211182
## 0.8204762
## 0.8200548
## 0.8039701
## 0.8027189
## 0.8016596
## 0.7827404
## 0.7827404
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7772656
## 0.7726453
## 0.7705634
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7691356
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
## 0.7590984
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
Berdasarkan output di atas, didapatkan model optimal dengan nilai akurasi 85,68%, F1 score 85,90%, sensitivity 85,04%, dan spesificity 86,36%