Data

The dataset that was used this week is a Game of Thrones character prediction that was obtained through kaggle.com. Using this dataset, I intend to measure the popularity of gender and age that has been collected. The data is generally compsed of 33 variables and 1900 observations. The binary variable being used is isAlive which indicates whether or not a character from the novels has been killed of. The variable has been coded using 0 for dead and 1 for living. Other variables being used are popularity, isNoble, isMarried, and age.

## # A tibble: 6 x 23
##    S.no  plod name  title  male culture dateOfBirth DateoFdeath NobleHouse
##   <dbl> <dbl> <chr> <chr> <dbl> <chr>         <dbl>       <dbl> <chr>     
## 1  1922 0.885 Barra None      1 Unkown          298           0 Baratheon 
## 2  1195 0.563 Edri~ None      1 Unkown          287           0 Baratheon 
## 3  1785 0.55  Renl~ Lord~     1 Storml~         277           0 Baratheon 
## 4  1812 0.934 Stef~ Stor~     1 Unkown          246           0 Baratheon 
## 5  1682 0.426 Dona~ None      1 Unkown          244           0 Baratheon 
## 6  1683 0.026 Dome~ None      1 Northm~         279           0 Bolton    
## # ... with 14 more variables: spouse <chr>, book1 <dbl>, book2 <dbl>,
## #   book3 <dbl>, book4 <dbl>, book5 <dbl>, isMarried <dbl>, isNoble <dbl>,
## #   age <dbl>, numDeadRelations <dbl>, boolDeadRelations <dbl>,
## #   isPopular <dbl>, popularity <dbl>, isAlive <dbl>

Logistic Regression

Model 1

model_1 <- glm(isAlive ~ popularity, family = 'binomial', data = GoT.Character.Prediction)
summary(model_1)
## 
## Call:
## glm(formula = isAlive ~ popularity, family = "binomial", data = GoT.Character.Prediction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3773  -1.3679   0.9921   0.9950   1.0263  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.45843    0.18526   2.475   0.0133 *
## popularity  -0.09194    0.50347  -0.183   0.8551  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.43  on 218  degrees of freedom
## Residual deviance: 293.40  on 217  degrees of freedom
## AIC: 297.4
## 
## Number of Fisher Scoring iterations: 4

From model 1, the coefficient is negative. From this, it can be interpretted as character popularity decreases when character status is living. Interpreting the y-intercept, it can be stated that the odds of a character still being alive throughout the Game of Thrones story is .61292. The value of the slope can be interpreted as for every one-unit of popularity, the log odds of a character still being alive decreases by -.023. Moreover, from the z value, is less than one standard deviation away from zero indicating that the results are not statistically significant.

Model 2

model_2 <- glm(isAlive ~ popularity + isNoble, family = 'binomial', data = GoT.Character.Prediction)
summary(model_2)
## 
## Call:
## glm(formula = isAlive ~ popularity + isNoble, family = "binomial", 
##     data = GoT.Character.Prediction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9983  -0.9041   0.6911   0.7441   1.6249  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0496     0.2324   4.517 6.27e-06 ***
## popularity    1.8626     0.6332   2.942  0.00326 ** 
## isNoble      -2.1521     0.3603  -5.973 2.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.43  on 218  degrees of freedom
## Residual deviance: 252.83  on 216  degrees of freedom
## AIC: 258.83
## 
## Number of Fisher Scoring iterations: 4

For model 2, the variable of isNoble has be added. Based on this addition, the odds of a character being alive is significantly reduced by -2.15 and the log odds of survival based on character popularity has increased to 1.86.

Model 3

model_3 <- glm(isAlive ~ popularity + isNoble + isMarried, family = 'binomial', data = GoT.Character.Prediction)
summary(model_3)
## 
## Call:
## glm(formula = isAlive ~ popularity + isNoble + isMarried, family = "binomial", 
##     data = GoT.Character.Prediction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9705  -0.8856   0.6665   0.7343   1.7102  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.1282     0.2381   4.737 2.17e-06 ***
## popularity    1.8814     0.6376   2.951  0.00317 ** 
## isNoble      -1.8480     0.3938  -4.692 2.70e-06 ***
## isMarried    -0.6484     0.3601  -1.801  0.07175 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.43  on 218  degrees of freedom
## Residual deviance: 249.62  on 215  degrees of freedom
## AIC: 257.62
## 
## Number of Fisher Scoring iterations: 4

Model 3 introduces the vairable of isMarried. The addition of this variable finds that the log odds of character survival for characyers that are married is significantly reduced by -.65. However, when compared to isNoble, the log odds of survival are higher than log odds of survival for nobles, even with a significantly increase from -2.15 to -1.84.

Model 4

model_4 <- glm(isAlive ~ popularity + isNoble + isMarried + age, family = 'binomial', data = GoT.Character.Prediction)
summary(model_4)
## 
## Call:
## glm(formula = isAlive ~ popularity + isNoble + isMarried + age, 
##     family = "binomial", data = GoT.Character.Prediction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0148  -0.8982   0.6227   0.7398   1.8745  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.609293   0.321561   5.005  5.6e-07 ***
## popularity   1.643093   0.652345   2.519 0.011777 *  
## isNoble     -1.584641   0.411720  -3.849 0.000119 ***
## isMarried   -0.500931   0.369652  -1.355 0.175372    
## age         -0.017385   0.007463  -2.330 0.019830 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.43  on 218  degrees of freedom
## Residual deviance: 243.93  on 214  degrees of freedom
## AIC: 253.93
## 
## Number of Fisher Scoring iterations: 4

With model 4, we introduce the variable of age. This model indicates that the each year a character ages, the log odds of survival decreases -.017.

Interaction Model

model_5 <- glm(isAlive ~ (popularity * male) + isNoble + isMarried + (age * male), family = 'binomial' , data = GoT.Character.Prediction)
summary(model_5)
## 
## Call:
## glm(formula = isAlive ~ (popularity * male) + isNoble + isMarried + 
##     (age * male), family = "binomial", data = GoT.Character.Prediction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2238  -0.8871   0.3369   0.8823   1.9411  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.25589    0.72252   4.506  6.6e-06 ***
## popularity       2.51583    1.83451   1.371  0.17025    
## male            -2.64899    0.81923  -3.234  0.00122 ** 
## isNoble         -1.34606    0.42100  -3.197  0.00139 ** 
## isMarried       -0.70989    0.38055  -1.865  0.06212 .  
## age             -0.04304    0.01394  -3.088  0.00202 ** 
## popularity:male -0.98097    1.89469  -0.518  0.60463    
## male:age         0.04232    0.01634   2.590  0.00959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.43  on 218  degrees of freedom
## Residual deviance: 222.62  on 211  degrees of freedom
## AIC: 238.62
## 
## Number of Fisher Scoring iterations: 5

Adding the vairable of male to interact with popularity, the data indicates that based on popularity the log odds of survival among males is -.98 and the odds of survivial for males based on age is .042. We can therefore indicate that gender does influence does influence the log odds of survival.

Regression Models

Model 1 Model 2 Model 3 Model 4
(Intercept) 0.460* 0.730*** 0.745*** 0.838***
(0.185) (0.044) (0.044) (0.058)
popularity -0.096 0.417** 0.414** 0.356**
(0.504) (0.127) (0.126) (0.127)
isNoble -0.481*** -0.409*** -0.344***
(0.070) (0.079) (0.082)
isMarried -0.139 -0.109
(0.074) (0.074)
age -0.003*
(0.001)
AIC 297.396 271.969 270.416 266.300
BIC 304.175 285.525 287.362 286.634
Log Likelihood -146.698 -131.984 -130.208 -127.150
Deviance 293.396 42.800 42.111 40.951
Num. obs. 219 219 219 219
p < 0.001, p < 0.01, p < 0.05

Model 4 = Best Model

Based on both the AIC and BIC, it can be concluded not only that the best model to use is Model 4, but also that the models get show improvement as variables are included.

Visualization of Best Model

library(visreg)
visreg(model_4, "popularity", by = "age", scale = "response")

As we can see, based on both popularity and age, younger Game of Throne characters are more likely to be alive when compared to the older ages.

Visual of Interaction Model

library(visreg)
visreg(model_5, "popularity", by = "male", scale = "response")

The first graph indicates that females characters not only have higer popularity, by they they also have higher odds of surviving the Game of Thrones

library(interactions)
## Warning: package 'interactions' was built under R version 3.5.3
interact_plot(model_5, pred = popularity , modx = male)

The interaction graph further reinforces the notion that female characters are not only more popular, but that they also have higher odds of survival.