The data set character-deaths.csv, found on kaggle, contains a list of 917 characters in Game of Thrones, the best TV show ever in history (IMHO). The lists indicate whether the character has died in the show or not, and a number of variables, including the gender, nobility, etc. data for the character.
got <- read.csv('character-deaths.csv')
head(got)
## Name Allegiances Death.Year Book.of.Death
## 1 Addam Marbrand Lannister NA NA
## 2 Aegon Frey (Jinglebell) None 299 3
## 3 Aegon Targaryen House Targaryen NA NA
## 4 Adrack Humble House Greyjoy 300 5
## 5 Aemon Costayne Lannister NA NA
## 6 Aemon Estermont Baratheon NA NA
## Death.Chapter Book.Intro.Chapter Gender Nobility GoT CoK SoS FfC DwD
## 1 NA 56 1 1 1 1 1 1 0
## 2 51 49 1 1 0 0 1 0 0
## 3 NA 5 1 1 0 0 0 0 1
## 4 20 20 1 1 0 0 0 0 1
## 5 NA NA 1 1 0 0 1 0 0
## 6 NA NA 1 1 0 1 1 0 0
str(got)
## 'data.frame': 917 obs. of 13 variables:
## $ Name : Factor w/ 916 levels "Addam Marbrand",..: 1 3 4 2 5 6 7 8 9 10 ...
## $ Allegiances : Factor w/ 21 levels "Arryn","Baratheon",..: 13 16 10 6 13 2 15 16 6 15 ...
## $ Death.Year : int NA 299 NA 300 NA NA 300 300 NA NA ...
## $ Book.of.Death : int NA 3 NA 5 NA NA 4 5 NA NA ...
## $ Death.Chapter : int NA 51 NA 20 NA NA 35 NA NA NA ...
## $ Book.Intro.Chapter: int 56 49 5 20 NA NA 21 59 11 0 ...
## $ Gender : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Nobility : int 1 1 1 1 1 1 1 1 1 0 ...
## $ GoT : int 1 0 0 0 0 0 1 1 0 0 ...
## $ CoK : int 1 0 0 0 0 1 0 1 1 0 ...
## $ SoS : int 1 1 0 0 1 1 1 1 0 1 ...
## $ FfC : int 1 0 0 0 0 0 1 0 1 0 ...
## $ DwD : int 0 0 1 1 0 0 0 1 0 0 ...
Here, I wish to create a multi-variate regression model to see if the Death.Chapter can be related to the following variables:
Gender: Gender of the character, a dichotomous termNobility: Whether the character is a noble, a dichotomous termBook.Intro.Chapter: The chapter of a book where the character is introduced, a quantitative term.To throw into the mix, I also include a quadratic term and few interaction term between the dichotomous terms and the quantitative term:
intro.square: Squaring the Book.Intro.Chapter termintro.noble: Multiplying Book.Intro.Chapter by Nobilityintro.gender: Multiplying Book.Intro.Chapter by Gendergot$intro.square <- got$Book.Intro.Chapter ^ 2
got$intro.noble <- got$Book.Intro.Chapter * got$Nobility
got$intro.gender <- got$Book.Intro.Chapter * got$Gender
The target variable Death.Chapter is the chapter of the book when the character died. Keep in mind that GoT is based on many books of the series “Songs of Fire and Ice” by George RR Martin. The target variable does not predict which book, just which chapter. Let’s keep it simple for now.
model <- lm(Death.Chapter ~ Gender + Nobility + Book.Intro.Chapter +
intro.square + intro.noble + intro.gender, data = got)
summary(model)
##
## Call:
## lm(formula = Death.Chapter ~ Gender + Nobility + Book.Intro.Chapter +
## intro.square + intro.noble + intro.gender, data = got)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.465 -13.407 -1.502 13.988 41.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.348032 5.903957 3.108 0.00207 **
## Gender 1.222154 5.751378 0.212 0.83187
## Nobility 18.956234 3.779758 5.015 9.26e-07 ***
## Book.Intro.Chapter 0.994950 0.234774 4.238 3.04e-05 ***
## intro.square -0.002700 0.002778 -0.972 0.33185
## intro.noble -0.537851 0.107801 -4.989 1.05e-06 ***
## intro.gender -0.222670 0.167585 -1.329 0.18500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.89 on 289 degrees of freedom
## (621 observations deleted due to missingness)
## Multiple R-squared: 0.2408, Adjusted R-squared: 0.2251
## F-statistic: 15.28 on 6 and 289 DF, p-value: 3.308e-15
The model dropped 621 observations because these are characters still alive, and the Death.Chapter is NA. The model reports a adjusted R-squared of 0.23. Let’s drop an insignificant variable, said, Gender, which has the highest p-value.
model <- lm(Death.Chapter ~ Nobility + Book.Intro.Chapter +
intro.square + intro.noble + intro.gender, data = got)
summary(model)
##
## Call:
## lm(formula = Death.Chapter ~ Nobility + Book.Intro.Chapter +
## intro.square + intro.noble + intro.gender, data = got)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.593 -13.357 -1.448 13.950 40.898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.452571 2.795142 6.959 2.28e-11 ***
## Nobility 18.882004 3.757380 5.025 8.81e-07 ***
## Book.Intro.Chapter 0.968958 0.200064 4.843 2.08e-06 ***
## intro.square -0.002711 0.002773 -0.978 0.3290
## intro.noble -0.535663 0.107132 -5.000 9.94e-07 ***
## intro.gender -0.193605 0.096673 -2.003 0.0461 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.87 on 290 degrees of freedom
## (621 observations deleted due to missingness)
## Multiple R-squared: 0.2407, Adjusted R-squared: 0.2276
## F-statistic: 18.39 on 5 and 290 DF, p-value: 7.412e-16
The adjusted R-squared increased slightly.
Residual analysis is done as follows:
plot(fitted(model), resid(model), xlab='Fitted Values', ylab='Residuals', main='Residuals vs Fitted Values')
The residuals seem to be constant along the fitted values.
qqnorm(resid(model))
qqline(resid(model))
The points are near the normal line, but at this point, I would not be confident to say the model is appropriate.
Now, let’s get to the fun part. I’m going to try to use logistic regression to predict the death of characters that are still alive in the show. The R function I will use is glm.
Let’s first introduce a new column Dead to ID the characters as dead or alive.
got$Dead <- !is.na(got$Death.Year)
I will reserve the rows of several of my favorite characters and save them for the prediction at the end. I will keep them out of the training portion.
favor_name <- c('Jon Snow', 'Daenerys Targaryen', 'Tyrion Lannister', 'Arya Stark', 'Sansa Stark')
favor_char <- got[which(got$Name %in% favor_name), ]
got <- got[-which(got$Name %in% favor_name), ]
I will now scramble the rows, and split the remaining data set into 80/20 train/test split.
set.seed(1)
got <- got[sample(nrow(got)), ]
split <- round(nrow(got)*0.8)
train <- got[1:split, ]
test <- got[(split+1):nrow(got), ]
The glm is built using the following predictors:
Allegiances: which houses/faction the character belongs toBook.Intro.Chapter: the chapter of a book the character was introduced.Gender: character’s genderNobility: whether the character is a nobleGoT: whether the character appeared in the book A Game of ThronesCoK: whether the character appeared in the book A Clash of KingsSoS: whether the character appeared in the book A Storm of SwordsFfC: whether the character appeared in the book A Feast for CrowsDwD: whether the character appeared in the book A Dance with DragonsI will train the model using the train set.
model <- glm(Dead ~ Allegiances + Gender + Nobility + Book.Intro.Chapter+ GoT + CoK +
SoS + FfC + DwD, data = train)
model
##
## Call: glm(formula = Dead ~ Allegiances + Gender + Nobility + Book.Intro.Chapter +
## GoT + CoK + SoS + FfC + DwD, data = train)
##
## Coefficients:
## (Intercept) AllegiancesBaratheon
## 0.296087 0.084014
## AllegiancesGreyjoy AllegiancesHouse Arryn
## 0.032485 0.069911
## AllegiancesHouse Baratheon AllegiancesHouse Greyjoy
## 0.176898 0.404964
## AllegiancesHouse Lannister AllegiancesHouse Martell
## 0.305131 0.157270
## AllegiancesHouse Stark AllegiancesHouse Targaryen
## 0.174381 -0.011178
## AllegiancesHouse Tully AllegiancesHouse Tyrell
## 0.389135 0.043514
## AllegiancesLannister AllegiancesMartell
## 0.030983 -0.012773
## AllegiancesNight's Watch AllegiancesNone
## 0.225051 0.107074
## AllegiancesStark AllegiancesTargaryen
## 0.162857 0.082539
## AllegiancesTully AllegiancesTyrell
## 0.032084 -0.090317
## AllegiancesWildling Gender
## 0.411984 0.056142
## Nobility Book.Intro.Chapter
## -0.043275 -0.002142
## GoT CoK
## 0.098115 0.065228
## SoS FfC
## 0.016791 -0.224466
## DwD
## -0.194873
##
## Degrees of Freedom: 719 Total (i.e. Null); 691 Residual
## (10 observations deleted due to missingness)
## Null Deviance: 160
## Residual Deviance: 130.1 AIC: 871.4
Let’s test the model using the test set. Here, the model will predict a probability of death for the characters. We will use 0.5 as the threshold; if the predicted probability is higher than 0.5, the character is presumed dead.
cols <- c('Allegiances', 'Gender', 'Nobility', 'Book.Intro.Chapter', 'GoT', 'CoK', 'SoS', 'FfC', 'DwD')
fitted.results <- predict(model, newdata=subset(test,select=cols, type='response'))
predict_death <- ifelse(fitted.results > 0.5, TRUE, FALSE)
accuracy <- mean(predict_death == test$Dead, na.rm = TRUE)
print(paste('Accuracy', accuracy))
## [1] "Accuracy 0.711111111111111"
The prediction accuracy is 71.11%.
Finally, let’s see if my favorite characters will die…
prediction<- predict(model, newdata=subset(favor_char,select=cols, type='response'))
names <- favor_char$Name
data.frame(`Character`=names, `Death Probability`=prediction, row.names=NULL)
## Character Death.Probability
## 1 Arya Stark 0.1721809
## 2 Daenerys Targaryen 0.2204702
## 3 Jon Snow 0.2926582
## 4 Sansa Stark 0.3563447
## 5 Tyrion Lannister 0.3166317
It’s good to see that none of my favorite characters has a high chance of dying. Let’s hope they will endure this winter, till the end…