Developing Data Products

MikalaiD

19 June 2016

Is forecasting aplicable for literature?

Earlier today you’ve seen the mock forecast of either Martin’s characters will die or not. Well, the initial hypothesis was that the author just get’s rid of somecharacters every new book to have some artificial turnover to keep readers’ attention.

P.S.: Thx to kaggle for the dataset.

Data loading

First of all I’ve loaded necessary libraries like dplyr and caret and then loaded the data.

library(dplyr)
library(caret)

data<-read.csv2("./character-deaths.csv", sep=",", header = T)

head(data)
##                      Name     Allegiances Death.Year Book.of.Death
## 1          Addam Marbrand       Lannister         NA            NA
## 2 Aegon Frey (Jinglebell)            None        299             3
## 3         Aegon Targaryen House Targaryen         NA            NA
## 4           Adrack Humble   House Greyjoy        300             5
## 5          Aemon Costayne       Lannister         NA            NA
## 6         Aemon Estermont       Baratheon         NA            NA
##   Death.Chapter Book.Intro.Chapter Gender Nobility GoT CoK SoS FfC DwD
## 1            NA                 56      1        1   1   1   1   1   0
## 2            51                 49      1        1   0   0   1   0   0
## 3            NA                  5      1        1   0   0   0   0   1
## 4            20                 20      1        1   0   0   0   0   1
## 5            NA                 NA      1        1   0   0   1   0   0
## 6            NA                 NA      1        1   0   1   1   0   0

Writing function

Raw data includes number of book a character died or NA in case he/she is still alive. To transform it to smth more straightforward I wrote the simpliest function possible.

dead<-function(x){
        if(is.na(x)){
                "Alive"
        } else {
                "Dead"
        }
}

Transforming data

Using dplyr I transformed the data to have clear dataset which can be used for the forecast.

data<-data%>%dplyr::select(Name, Allegiances, Book.of.Death, Nobility, GoT, CoK, SoS, FfC, DwD) %>%
        mutate(Dead=as.factor(sapply(Book.of.Death, dead)), litAge=GoT+CoK+SoS+FfC+DwD)%>%
        mutate(Allegiances=as.factor(gsub("House ", "", Allegiances)))%>%
        dplyr::select(Name, Allegiances, Nobility, litAge, Dead)

So. in the final clean version I have the following variables:

names(data)
## [1] "Name"        "Allegiances" "Nobility"    "litAge"      "Dead"

Forecast

For the forecast I’ve used random forest method. The confision matrix shows that forecast is still very bad. Nevertheless, other methods did not give any better results (combining several of them as well). So, pease don’t be sad if my shiny app said you favorite character will die - probably he won’t :)

inTr<-createDataPartition(data$Dead, p=0.75, list=FALSE)
train<-data[inTr,]
test<-data[-inTr,]


model<-train(Dead~Nobility+Allegiances+litAge, data=train, method="rf")

prediction<-predict(model, train)

confusionMatrix(prediction,train$Dead)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive   448  211
##      Dead     10   20
##                                         
##                Accuracy : 0.6792        
##                  95% CI : (0.643, 0.714)
##     No Information Rate : 0.6647        
##     P-Value [Acc > NIR] : 0.2222        
##                                         
##                   Kappa : 0.0825        
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.97817       
##             Specificity : 0.08658       
##          Pos Pred Value : 0.67982       
##          Neg Pred Value : 0.66667       
##              Prevalence : 0.66473       
##          Detection Rate : 0.65022       
##    Detection Prevalence : 0.95646       
##       Balanced Accuracy : 0.53237       
##                                         
##        'Positive' Class : Alive         
##