約會前就能知道結果?

資料介紹

透過事前的問題可以知道給個人對於各類事物的喜好程度

plot

plot

plot

回歸模型

用自己對各種興趣的喜好(1-10)當自變數
對方是否對自己有興趣當作反應變數

\(dec_{pred}= \\ \beta_0+\beta_1\ sports+\beta_2\ tvsports+\beta_3\ excersice+\beta_4\ dinning+\beta_5\ museums \\ \ \ +\beta_6\ atr+\beta_7\ hiking+\beta_8\ gaming+\beta_9\ clubbing+\beta_{10}\ reading+\beta_{11}\ tv \\ \ \ +\beta_{12}\ theater+\beta_{13}\ movies+\beta_{14}\ concerts+\beta_{15}\ music+\beta_{16}\ shopping\\\ \ +\beta_{17}\ yoga\)

##                  Estimate Std. Error      z value     Pr(>|z|)
## (Intercept) -9.699499e-01 0.22588400 -4.294017896 1.754683e-05
## sports       4.396830e-02 0.01360625  3.231477886 1.231519e-03
## tvsports    -4.164909e-02 0.01220659 -3.412017613 6.448395e-04
## exercise     3.937751e-02 0.01264759  3.113440461 1.849198e-03
## dining       3.568129e-02 0.01831000  1.948732524 5.132738e-02
## museums     -8.176128e-05 0.02747476 -0.002975868 9.976256e-01
## art          4.771713e-03 0.02382029  0.200321402 8.412292e-01
## hiking       3.268697e-02 0.01181010  2.767712358 5.645125e-03
## gaming      -5.778836e-02 0.01149035 -5.029294471 4.922879e-07
## clubbing     4.421329e-02 0.01155367  3.826774246 1.298335e-04
## reading     -1.099251e-02 0.01465726 -0.749970135 4.532727e-01
## tv          -1.468673e-02 0.01370993 -1.071248172 2.840579e-01
## theater     -2.280711e-02 0.01699074 -1.342326062 1.794903e-01
## movies      -2.991890e-02 0.02004273 -1.492755758 1.355011e-01
## concerts    -2.531595e-02 0.01840439 -1.375538457 1.689646e-01
## music        2.590403e-02 0.02018995  1.283016432 1.994863e-01
## shopping     5.367783e-02 0.01326894  4.045373347 5.223980e-05
## yoga         4.590679e-03 0.01104360  0.415686942 6.776391e-01

刪除不顯著的變數並檢定

model1.1<-update(model0,.~.-museums -art -reading  -theater -movies
                 -concerts-music -yoga -tv)
anova(model1.1,model0,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: dec_o ~ sports + tvsports + exercise + dining + hiking + gaming + 
##     clubbing + shopping
## Model 2: dec_o ~ sports + tvsports + exercise + dining + museums + art + 
##     hiking + gaming + clubbing + reading + tv + theater + movies + 
##     concerts + music + shopping + yoga
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1      5796     7794.2                       
## 2      5787     7777.9  9   16.305  0.06078 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC

AIC(model0,model1.1)
##          df      AIC
## model0   18 7813.887
## model1.1  9 7812.192

逐步回歸

#step(model1.1,.~.^2)

model.new<-glm(formula = dec_o ~ sports + tvsports + exercise 
               + dining + hiking + gaming + clubbing + shopping 
               + hiking:clubbing + dining:gaming + dining:clubbing
               + dining:shopping + sports:tvsports 
               + tvsports:clubbing + gaming:clubbing 
               + sports:gaming + exercise:dining 
               + clubbing:shopping + exercise:hiking 
               + dining:hiking, family = binomial(link = "logit"),
                 data = train)

AIC(model0,model1.1,model.new)
##           df      AIC
## model0    18 7813.887
## model1.1   9 7812.192
## model.new 21 7740.359

預測結果

以30%的資料來test

tab
##    Ypred
## Y      0    1
##   0 1260  242
##   1  755  237

正確率=\(\frac{1333+167}{1333+167+825+169}=0.597\)

ROC curve

\(auc = 0.5692534\)

約會前就能知道結果?
不能

嘗試使用約會當下的評分

變數介紹

  • attr : Attractive
  • sinc : Sincere
  • intel : Intelligent
  • fun : Fun
  • amb : Ambitious
  • shar : Shared Interests/Habbits

plot

回歸模型

每個選項分別可以填入1-10分 以自己是否願意繼續見面當作反應變數 \(dec_{pred}=\beta_0+\beta_1\ attr+\beta_2\ sinc+\beta_3\ intel+\beta_4\ fun+\beta_5\ amb+\beta_6\ shar\)

## glm(formula = dec ~ ., family = binomial(link = "logit"), data = train2)
##                Estimate Std. Error     z value      Pr(>|z|)
## (Intercept) -5.22889632 0.22344214 -23.4015674 4.119389e-121
## attr         0.54920080 0.02574977  21.3283767 6.191958e-101
## sinc        -0.09626930 0.02961857  -3.2503017  1.152826e-03
## intel       -0.01118552 0.03614843  -0.3094331  7.569921e-01
## fun          0.27111068 0.02823758   9.6010599  7.912652e-22
## amb         -0.15745055 0.02805145  -5.6129191  1.989416e-08
## shar         0.26755610 0.02221556  12.0436353  2.095150e-33

刪除不顯著的變數並檢定

model01<-update(model00,.~.-intel)
## glm(formula = dec ~ attr + sinc + fun + amb + shar, family = binomial(link = "logit"), 
##     data = train2)
##               Estimate Std. Error    z value      Pr(>|z|)
## (Intercept) -5.2592939 0.21234630 -24.767533 2.007129e-135
## attr         0.5497900 0.02573646  21.362300 2.996857e-101
## sinc        -0.1004672 0.02615828  -3.840742  1.226630e-04
## fun          0.2707435 0.02809545   9.636563  5.603269e-22
## amb         -0.1605729 0.02591154  -6.196963  5.756298e-10
## shar         0.2669980 0.02219823  12.027897  2.535355e-33

AIC

AIC(model00,model01)
##         df      AIC
## model00  7 5032.409
## model01  6 5035.351

使用最初的模型來預測

以30%的資料來test

tab
##    Ypred
## Y     0   1
##   0 940 247
##   1 293 623

正確率=\(\frac{940+623}{940+623+293+277}=0.732\)

正確率提昇了不少

ROC curve

\(auc = 0.822\)

逐步回歸

把兩兩變數的交互作用考慮進去

model02<-glm(formula = dec ~ attr + sinc + intel + fun + amb + shar + 
    fun:amb + intel:shar + attr:shar, family = binomial(link = "logit"), 
    data = train2)

除了原本的變數以外,還加入了 fun:amb、intel:shar、attr:shar
這三個交互作用項

tab
##    Ypred
## Y     0   1
##   0 943 244
##   1 288 628

正確率=\(\frac{943+628}{943+628+288+244}=0.747\)

ROC curve

\(auc = 0.8205\)

結論

在約會前,如果只看興趣的話是很難預測會不會繼續有進展
其中,在“Sincere”和“Ambitious”上得到的分數愈高
繼續發展的可能就愈低