MikalaiD
19 June 2016
Earlier today you’ve seen the mock forecast of either Martin’s characters will die or not. Well, the initial hypothesis was that the author just get’s rid of somecharacters every new book to have some artificial turnover to keep readers’ attention.
P.S.: Thx to kaggle for the dataset.
First of all I’ve loaded necessary libraries like dplyr and caret and then loaded the data.
library(dplyr)
library(caret)
data<-read.csv2("./character-deaths.csv", sep=",", header = T)
head(data)## Name Allegiances Death.Year Book.of.Death
## 1 Addam Marbrand Lannister NA NA
## 2 Aegon Frey (Jinglebell) None 299 3
## 3 Aegon Targaryen House Targaryen NA NA
## 4 Adrack Humble House Greyjoy 300 5
## 5 Aemon Costayne Lannister NA NA
## 6 Aemon Estermont Baratheon NA NA
## Death.Chapter Book.Intro.Chapter Gender Nobility GoT CoK SoS FfC DwD
## 1 NA 56 1 1 1 1 1 1 0
## 2 51 49 1 1 0 0 1 0 0
## 3 NA 5 1 1 0 0 0 0 1
## 4 20 20 1 1 0 0 0 0 1
## 5 NA NA 1 1 0 0 1 0 0
## 6 NA NA 1 1 0 1 1 0 0
Raw data includes number of book a character died or NA in case he/she is still alive. To transform it to smth more straightforward I wrote the simpliest function possible.
dead<-function(x){
if(is.na(x)){
"Alive"
} else {
"Dead"
}
}Using dplyr I transformed the data to have clear dataset which can be used for the forecast.
data<-data%>%dplyr::select(Name, Allegiances, Book.of.Death, Nobility, GoT, CoK, SoS, FfC, DwD) %>%
mutate(Dead=as.factor(sapply(Book.of.Death, dead)), litAge=GoT+CoK+SoS+FfC+DwD)%>%
mutate(Allegiances=as.factor(gsub("House ", "", Allegiances)))%>%
dplyr::select(Name, Allegiances, Nobility, litAge, Dead)So. in the final clean version I have the following variables:
names(data)## [1] "Name" "Allegiances" "Nobility" "litAge" "Dead"
For the forecast I’ve used random forest method. The confision matrix shows that forecast is still very bad. Nevertheless, other methods did not give any better results (combining several of them as well). So, pease don’t be sad if my shiny app said you favorite character will die - probably he won’t :)
inTr<-createDataPartition(data$Dead, p=0.75, list=FALSE)
train<-data[inTr,]
test<-data[-inTr,]
model<-train(Dead~Nobility+Allegiances+litAge, data=train, method="rf")
prediction<-predict(model, train)
confusionMatrix(prediction,train$Dead)## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 448 211
## Dead 10 20
##
## Accuracy : 0.6792
## 95% CI : (0.643, 0.714)
## No Information Rate : 0.6647
## P-Value [Acc > NIR] : 0.2222
##
## Kappa : 0.0825
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.97817
## Specificity : 0.08658
## Pos Pred Value : 0.67982
## Neg Pred Value : 0.66667
## Prevalence : 0.66473
## Detection Rate : 0.65022
## Detection Prevalence : 0.95646
## Balanced Accuracy : 0.53237
##
## 'Positive' Class : Alive
##