Class Discussion Assignment

The data set character-deaths.csv, found on kaggle, contains a list of 917 characters in Game of Thrones, the best TV show ever in history (IMHO). The lists indicate whether the character has died in the show or not, and a number of variables, including the gender, nobility, etc. data for the character.

got <- read.csv('character-deaths.csv')
head(got)
##                      Name     Allegiances Death.Year Book.of.Death
## 1          Addam Marbrand       Lannister         NA            NA
## 2 Aegon Frey (Jinglebell)            None        299             3
## 3         Aegon Targaryen House Targaryen         NA            NA
## 4           Adrack Humble   House Greyjoy        300             5
## 5          Aemon Costayne       Lannister         NA            NA
## 6         Aemon Estermont       Baratheon         NA            NA
##   Death.Chapter Book.Intro.Chapter Gender Nobility GoT CoK SoS FfC DwD
## 1            NA                 56      1        1   1   1   1   1   0
## 2            51                 49      1        1   0   0   1   0   0
## 3            NA                  5      1        1   0   0   0   0   1
## 4            20                 20      1        1   0   0   0   0   1
## 5            NA                 NA      1        1   0   0   1   0   0
## 6            NA                 NA      1        1   0   1   1   0   0
str(got)
## 'data.frame':    917 obs. of  13 variables:
##  $ Name              : Factor w/ 916 levels "Addam Marbrand",..: 1 3 4 2 5 6 7 8 9 10 ...
##  $ Allegiances       : Factor w/ 21 levels "Arryn","Baratheon",..: 13 16 10 6 13 2 15 16 6 15 ...
##  $ Death.Year        : int  NA 299 NA 300 NA NA 300 300 NA NA ...
##  $ Book.of.Death     : int  NA 3 NA 5 NA NA 4 5 NA NA ...
##  $ Death.Chapter     : int  NA 51 NA 20 NA NA 35 NA NA NA ...
##  $ Book.Intro.Chapter: int  56 49 5 20 NA NA 21 59 11 0 ...
##  $ Gender            : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Nobility          : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ GoT               : int  1 0 0 0 0 0 1 1 0 0 ...
##  $ CoK               : int  1 0 0 0 0 1 0 1 1 0 ...
##  $ SoS               : int  1 1 0 0 1 1 1 1 0 1 ...
##  $ FfC               : int  1 0 0 0 0 0 1 0 1 0 ...
##  $ DwD               : int  0 0 1 1 0 0 0 1 0 0 ...

Here, I wish to create a multi-variate regression model to see if the Death.Chapter can be related to the following variables:

  • Gender: Gender of the character, a dichotomous term
  • Nobility: Whether the character is a noble, a dichotomous term
  • Book.Intro.Chapter: The chapter of a book where the character is introduced, a quantitative term.

To throw into the mix, I also include a quadratic term and few interaction term between the dichotomous terms and the quantitative term:

  • intro.square: Squaring the Book.Intro.Chapter term
  • intro.noble: Multiplying Book.Intro.Chapter by Nobility
  • intro.gender: Multiplying Book.Intro.Chapter by Gender
got$intro.square <- got$Book.Intro.Chapter ^ 2
got$intro.noble <- got$Book.Intro.Chapter * got$Nobility
got$intro.gender <- got$Book.Intro.Chapter * got$Gender

The target variable Death.Chapter is the chapter of the book when the character died. Keep in mind that GoT is based on many books of the series “Songs of Fire and Ice” by George RR Martin. The target variable does not predict which book, just which chapter. Let’s keep it simple for now.

model <- lm(Death.Chapter ~ Gender + Nobility + Book.Intro.Chapter + 
              intro.square + intro.noble + intro.gender, data = got)
summary(model)
## 
## Call:
## lm(formula = Death.Chapter ~ Gender + Nobility + Book.Intro.Chapter + 
##     intro.square + intro.noble + intro.gender, data = got)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.465 -13.407  -1.502  13.988  41.820 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        18.348032   5.903957   3.108  0.00207 ** 
## Gender              1.222154   5.751378   0.212  0.83187    
## Nobility           18.956234   3.779758   5.015 9.26e-07 ***
## Book.Intro.Chapter  0.994950   0.234774   4.238 3.04e-05 ***
## intro.square       -0.002700   0.002778  -0.972  0.33185    
## intro.noble        -0.537851   0.107801  -4.989 1.05e-06 ***
## intro.gender       -0.222670   0.167585  -1.329  0.18500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.89 on 289 degrees of freedom
##   (621 observations deleted due to missingness)
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2251 
## F-statistic: 15.28 on 6 and 289 DF,  p-value: 3.308e-15

The model dropped 621 observations because these are characters still alive, and the Death.Chapter is NA. The model reports a adjusted R-squared of 0.23. Let’s drop an insignificant variable, said, Gender, which has the highest p-value.

model <- lm(Death.Chapter ~ Nobility + Book.Intro.Chapter + 
              intro.square + intro.noble + intro.gender, data = got)
summary(model)
## 
## Call:
## lm(formula = Death.Chapter ~ Nobility + Book.Intro.Chapter + 
##     intro.square + intro.noble + intro.gender, data = got)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.593 -13.357  -1.448  13.950  40.898 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        19.452571   2.795142   6.959 2.28e-11 ***
## Nobility           18.882004   3.757380   5.025 8.81e-07 ***
## Book.Intro.Chapter  0.968958   0.200064   4.843 2.08e-06 ***
## intro.square       -0.002711   0.002773  -0.978   0.3290    
## intro.noble        -0.535663   0.107132  -5.000 9.94e-07 ***
## intro.gender       -0.193605   0.096673  -2.003   0.0461 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.87 on 290 degrees of freedom
##   (621 observations deleted due to missingness)
## Multiple R-squared:  0.2407, Adjusted R-squared:  0.2276 
## F-statistic: 18.39 on 5 and 290 DF,  p-value: 7.412e-16

The adjusted R-squared increased slightly.

Residual analysis is done as follows:

plot(fitted(model), resid(model), xlab='Fitted Values', ylab='Residuals', main='Residuals vs Fitted Values')

The residuals seem to be constant along the fitted values.

qqnorm(resid(model))
qqline(resid(model))

The points are near the normal line, but at this point, I would not be confident to say the model is appropriate.

Fun Part

Now, let’s get to the fun part. I’m going to try to use logistic regression to predict the death of characters that are still alive in the show. The R function I will use is glm.

Let’s first introduce a new column Dead to ID the characters as dead or alive.

got$Dead <- !is.na(got$Death.Year)

I will reserve the rows of several of my favorite characters and save them for the prediction at the end. I will keep them out of the training portion.

favor_name <- c('Jon Snow', 'Daenerys Targaryen', 'Tyrion Lannister', 'Arya Stark', 'Sansa Stark')
favor_char <- got[which(got$Name %in% favor_name), ]
got <- got[-which(got$Name %in% favor_name), ]

I will now scramble the rows, and split the remaining data set into 80/20 train/test split.

set.seed(1)
got <- got[sample(nrow(got)), ]
split <- round(nrow(got)*0.8)
train <- got[1:split, ]
test <- got[(split+1):nrow(got), ]

The glm is built using the following predictors:

  • Allegiances: which houses/faction the character belongs to
  • Book.Intro.Chapter: the chapter of a book the character was introduced.
  • Gender: character’s gender
  • Nobility: whether the character is a noble
  • GoT: whether the character appeared in the book A Game of Thrones
  • CoK: whether the character appeared in the book A Clash of Kings
  • SoS: whether the character appeared in the book A Storm of Swords
  • FfC: whether the character appeared in the book A Feast for Crows
  • DwD: whether the character appeared in the book A Dance with Dragons

I will train the model using the train set.

model <- glm(Dead ~ Allegiances + Gender + Nobility + Book.Intro.Chapter+ GoT + CoK +
               SoS + FfC + DwD, data = train)
model
## 
## Call:  glm(formula = Dead ~ Allegiances + Gender + Nobility + Book.Intro.Chapter + 
##     GoT + CoK + SoS + FfC + DwD, data = train)
## 
## Coefficients:
##                (Intercept)        AllegiancesBaratheon  
##                   0.296087                    0.084014  
##         AllegiancesGreyjoy      AllegiancesHouse Arryn  
##                   0.032485                    0.069911  
## AllegiancesHouse Baratheon    AllegiancesHouse Greyjoy  
##                   0.176898                    0.404964  
## AllegiancesHouse Lannister    AllegiancesHouse Martell  
##                   0.305131                    0.157270  
##     AllegiancesHouse Stark  AllegiancesHouse Targaryen  
##                   0.174381                   -0.011178  
##     AllegiancesHouse Tully     AllegiancesHouse Tyrell  
##                   0.389135                    0.043514  
##       AllegiancesLannister          AllegiancesMartell  
##                   0.030983                   -0.012773  
##   AllegiancesNight's Watch             AllegiancesNone  
##                   0.225051                    0.107074  
##           AllegiancesStark        AllegiancesTargaryen  
##                   0.162857                    0.082539  
##           AllegiancesTully           AllegiancesTyrell  
##                   0.032084                   -0.090317  
##        AllegiancesWildling                      Gender  
##                   0.411984                    0.056142  
##                   Nobility          Book.Intro.Chapter  
##                  -0.043275                   -0.002142  
##                        GoT                         CoK  
##                   0.098115                    0.065228  
##                        SoS                         FfC  
##                   0.016791                   -0.224466  
##                        DwD  
##                  -0.194873  
## 
## Degrees of Freedom: 719 Total (i.e. Null);  691 Residual
##   (10 observations deleted due to missingness)
## Null Deviance:       160 
## Residual Deviance: 130.1     AIC: 871.4

Let’s test the model using the test set. Here, the model will predict a probability of death for the characters. We will use 0.5 as the threshold; if the predicted probability is higher than 0.5, the character is presumed dead.

cols <- c('Allegiances', 'Gender', 'Nobility', 'Book.Intro.Chapter', 'GoT', 'CoK', 'SoS', 'FfC', 'DwD')
fitted.results <- predict(model, newdata=subset(test,select=cols, type='response'))
predict_death <- ifelse(fitted.results > 0.5, TRUE, FALSE)
accuracy <- mean(predict_death == test$Dead, na.rm = TRUE)
print(paste('Accuracy', accuracy))
## [1] "Accuracy 0.711111111111111"

The prediction accuracy is 71.11%.

Finally, let’s see if my favorite characters will die…

prediction<- predict(model, newdata=subset(favor_char,select=cols, type='response'))
names <- favor_char$Name
data.frame(`Character`=names, `Death Probability`=prediction, row.names=NULL)
##            Character Death.Probability
## 1         Arya Stark         0.1721809
## 2 Daenerys Targaryen         0.2204702
## 3           Jon Snow         0.2926582
## 4        Sansa Stark         0.3563447
## 5   Tyrion Lannister         0.3166317

It’s good to see that none of my favorite characters has a high chance of dying. Let’s hope they will endure this winter, till the end…