library(readr)
library(dplyr)
library(glmnet)
##讀檔
tbl_titanic <- read_csv("titanic3.csv", na = c("", "NA"))

2-1 Conduct logistic regression analysis according to the two models.

先對資料做前置處理

##轉換資料形態
tbl_titanic <- tbl_titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_at(.vars = vars(pclass, survived, sex, embarked),
.funs = funs(as.factor))
summary(tbl_titanic)
   survived pclass      sex           age            sibsp       
   0:809    1:323   female:466   Min.   : 0.17   Min.   :0.0000  
   1:500    2:277   male  :843   1st Qu.:21.00   1st Qu.:0.0000  
            3:709                Median :28.00   Median :0.0000  
                                 Mean   :29.88   Mean   :0.4989  
                                 3rd Qu.:39.00   3rd Qu.:1.0000  
                                 Max.   :80.00   Max.   :8.0000  
                                 NA's   :263                     
       parch            fare         embarked  
   Min.   :0.000   Min.   :  0.000   C   :270  
   1st Qu.:0.000   1st Qu.:  7.896   Q   :123  
   Median :0.000   Median : 14.454   S   :914  
   Mean   :0.385   Mean   : 33.295   NA's:  2  
   3rd Qu.:0.000   3rd Qu.: 31.275             
   Max.   :9.000   Max.   :512.329             
                   NA's   :1
##處理missing data
tbl_titanic <- tbl_titanic %>%
mutate(age_na = is.na(age), age = replace(age, is.na(age), 0),
fare = replace(fare, is.na(fare), median(fare, na.rm = T)),
embarked = replace(embarked, is.na(embarked), "S"))
summary(tbl_titanic)
   survived pclass      sex           age            sibsp       
   0:809    1:323   female:466   Min.   : 0.00   Min.   :0.0000  
   1:500    2:277   male  :843   1st Qu.: 7.00   1st Qu.:0.0000  
            3:709                Median :24.00   Median :0.0000  
                                 Mean   :23.88   Mean   :0.4989  
                                 3rd Qu.:35.00   3rd Qu.:1.0000  
                                 Max.   :80.00   Max.   :8.0000  
       parch            fare         embarked   age_na       
   Min.   :0.000   Min.   :  0.000   C:270    Mode :logical  
   1st Qu.:0.000   1st Qu.:  7.896   Q:123    FALSE:1046     
   Median :0.000   Median : 14.454   S:916    TRUE :263      
   Mean   :0.385   Mean   : 33.281                           
   3rd Qu.:0.000   3rd Qu.: 31.275                           
   Max.   :9.000   Max.   :512.329

建構 “survived ~ sex + pclass” logistic regression model

set.seed(56500)
idc_train <- sample(c(T, F), nrow(tbl_titanic), replace = T, prob = c(.7, .3))
glm_fit_1 <- tbl_titanic %>% glm(survived ~ sex+pclass, data = .,
family = binomial(link = "logit"),
subset = idc_train)
summary(glm_fit_1)
  
  Call:
  glm(formula = survived ~ sex + pclass, family = binomial(link = "logit"), 
      data = ., subset = idc_train)
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -2.0974  -0.6883  -0.4580   0.7398   2.1479  
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
  (Intercept)   2.0822     0.2043  10.192  < 2e-16 ***
  sexmale      -2.4756     0.1743 -14.202  < 2e-16 ***
  pclass2      -0.9260     0.2343  -3.953 7.73e-05 ***
  pclass3      -1.8085     0.2055  -8.798  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 1227.86  on 927  degrees of freedom
  Residual deviance:  886.13  on 924  degrees of freedom
  AIC: 894.13
  
  Number of Fisher Scoring iterations: 4

建構 “survived ~ sex * pclass” logistic regression model

glm_fit_2 <- tbl_titanic %>% glm(survived ~ sex*pclass, data = .,
family = binomial(link = "logit"),
subset = idc_train)
summary(glm_fit_2)
  
  Call:
  glm(formula = survived ~ sex * pclass, family = binomial(link = "logit"), 
      data = ., subset = idc_train)
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -2.5601  -0.5669  -0.5605   0.5207   1.9640  
  
  Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
  (Intercept)       3.2387     0.5095   6.356 2.07e-10 ***
  sexmale          -3.9071     0.5445  -7.175 7.21e-13 ***
  pclass2          -1.3088     0.6220  -2.104 0.035358 *  
  pclass3          -3.3654     0.5339  -6.304 2.90e-10 ***
  sexmale:pclass2   0.2303     0.6968   0.331 0.741002    
  sexmale:pclass3   2.2623     0.5876   3.850 0.000118 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 1227.86  on 927  degrees of freedom
  Residual deviance:  854.13  on 922  degrees of freedom
  AIC: 866.13
  
  Number of Fisher Scoring iterations: 5

2-2 Please calculate the probability of y = 1 under each combination of sex and pclass via the result of 1.

已知 \(P(y=1|x)=\frac{exp(\beta_0+\sum_{p=1}^{P}\beta_p x_p)}{1+exp(\beta_0+\sum_{p=1}^{P}\beta_p x_p)}\)

model 1 沒有 sex 及 pclass 的交互作用,而 Coefficients 分別為

(Intercept) sexmale pclass2 pclass3
2.0822 -2.4756 -0.9260 -1.8085

其中 sexfemale 以及 pclass1 為 0

\(P(y=1|x=female,x=pclass1)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=\frac{exp(2.0822)}{1+exp(2.0822)}=0.8896\)

\(P(y=1|x=female,x=pclass2)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(2.0822-0.9260)}{1+exp(2.0822-0.9260)}=0.7606\)

\(P(y=1|x=female,x=pclass3)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(2.0822-1.8085)}{1+exp(2.0822-1.8085)}=0.5680\)

\(P(y=1|x=male,x=pclass1)=\frac{exp(\beta_0+\beta_1x_1)}{1+exp(\beta_0+\beta_1x_1)}=\frac{exp(2.0822-2.4756)}{1+exp(2.0822-2.4756)}=0.4029\)

\(P(y=1|x=male,x=pclass2)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2)}=\frac{exp(2.0822-2.4756-0.9260)}{1+exp(2.0822-2.4756-0.9260)}=0.2109\)

\(P(y=1|x=male,x=pclass3)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2)}=\frac{exp(2.0822-2.4756-1.8085)}{1+exp(2.0822-2.4756-1.8085)}=0.0996\)

model 2 有 sex 及 pclass 的交互作用,而 Coefficients 分別為

(Intercept) sexmale pclass2 pclass3 sexmale:pclass2 sexmale:pclass3
3.2387 -3.9071 -1.3088 -3.3654 0.2303 2.2623

其中 sexfemale 以及 pclass1 為 0

\(P(y=1|x=female,x=pclass1)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=\frac{exp(3.2387)}{1+exp(3.2387)}=0.9623\)

\(P(y=1|x=female,x=pclass2)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(3.2387-1.3088)}{1+exp(3.2387-1.3088)}=0.8732\)

\(P(y=1|x=female,x=pclass3)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(3.2387-3.3654)}{1+exp(3.2387-3.3654)}=0.4684\)

\(P(y=1|x=male,x=pclass1)=\frac{exp(\beta_0+\beta_1x_1)}{1+exp(\beta_0+\beta_1x_1)}=\frac{exp(3.2387-3.9071)}{1+exp(3.2387-3.9071)}=0.3389\)

\(P(y=1|x=male,x=pclass2)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}=\frac{exp(3.2387-3.9071-1.3088+0.2303)}{1+exp(3.2387-3.9071-1.3088+0.2303)}=0.1484\)

\(P(y=1|x=male,x=pclass3)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}=\frac{exp(3.2387-3.9071-3.3654+2.2623)}{1+exp(3.2387-3.9071-3.3654+2.2623)}=0.1454\)


2-3 Now calculate the probability of y = 1 under each combination of sex and pclass again based on the following flat contingency table.

tbl_titanic[idc_train, ] %>% select(survived, sex, pclass) %>% ftable()
                  pclass   1   2   3
  survived sex                      
  0        female          4   9  84
           male           80 109 294
  1        female        102  62  74
           male           41  19  50

由上表得知

\(P(y=1|x=female,x=pclass1)=102/106=0.9623\)

\(P(y=1|x=female,x=pclass2)=62/71=0.8732\)

\(P(y=1|x=female,x=pclass3)=74/158=0.4684\)

\(P(y=1|x=male,x=pclass1)=41/121=0.3388\)

\(P(y=1|x=male,x=pclass2)=19/128=0.1484\)

\(P(y=1|x=male,x=pclass3)=50/344=0.1453\)


2-4 Please compare the results of 2 and 3.

由 2 與 3 的結果可以得知,在第一個 model 中雖然沒有辦法得到 training set 正確的機率,但也算非常接近,而 model 2 的結果可以很好的代表整個 training set,不過哪個比較好還是要使用 testing set 做驗證,但在 training set 中 model 2 確實做的比較好。