Logistic regression

Bernoulli distribution

On Lecture 1 we explored two applications related to the normal and Poisson regression. For this lecture we will focus on logistic regression where the conditional distribution of the response variable given the covariates is a Bernoulli distribution. This is known as binomial logistic regression whereas when the response variable can take more than two possible values the model is known as multinomial logistic regression. As an example of a binomial logistic regression model you can think of the situation where the response variable is if a student has passed or failed his exam. It would be interesting to see whether the number of hours students’ spent revising predicted success in their exam. In the multinomial case the response variable can be a negative, neutral, or positive feedback a student gave for a course. In that case, the teacher would be interested to find out if there was an effect between how well written the slides were and the positive feedback for the course.

If $Y$ is a random variable which follows the Bernoulli distribution we have that $\text{Pr}(Y=0)=q=1-p, \text{Pr}(Y=1)=p$. The probability mass distribution for $Y$ is \[\text{Pr}(Y=y)=p^y (1-p)^{1-y} \quad \text{for} \ y \in \{0,1\}\] with expected value and variance \[\begin{align*} \mathbb{E}[Y]&=p,\\ \mathbb{V}[Y]&=pq. \end{align*}\]

Logistic regression

In binomial logistic regression we represent a function of the mean $p$ of the binary response variable $Y$ as a function of the $k$ covariates $x_{i1}, \ldots, x_{ik}$ as: \[\begin{equation} \text{logit} (p_i)=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}+\epsilon _i \quad i=1, \ldots, n,\end{equation}\]

where $\text{logit} (p_i)=\log\{\frac{p_i}{1-p_i}\}$ i.e. the logit link function models the log odds of the mean. Another way to look at the previous equation is $\log\{\frac{p_i}{1-p_i}\}=\log\{\ \frac{\text{Pr}(Y=1)}{\text{Pr}(Y=0)} \}$. In case you haven’t already noticed; the mean $p_i$ of the distribution is equal to the probability $\text{Pr}(Y_i=1)$.

In matrix form,

\[\begin{equation} \text{logit} (\mathbb{E}[\boldsymbol{y}])=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}, \end{equation}\]

where

\[ \boldsymbol{y}=\left(\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array}\right), \boldsymbol{X}=\left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1k} \\ 1 & x_{21} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \ldots & x_{nk} \end{array}\right), \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_k \end{array}\right), \boldsymbol{\epsilon}=\left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{array}\right) \]

This is similar to the equation in the Regression Modelling section on the Theory page on Lecture 1 but with the difference that we are now modelling a function of the mean of the response variable.

Solving the previous equation with respect to the mean $p_i$ we get \[p_i=\frac{1}{1+\exp\{-\boldsymbol{X}\boldsymbol{\beta}\}}\] which means that we can estimate $p_i$ by $\hat{p_i}$, the estimated probability of $Y_i=1$ by \[\hat{p_i}=\frac{1}{1+\exp\{-\boldsymbol{X}\hat{\boldsymbol{\beta}}\}}\]

ROC curve

We have seen that after using a logistic regression model we can attempt to predict the outcome of the response variable given the covariates. This is a way of assessing the performance of the model. The receiving operating characteristic (ROC) curve is a measure of classifier performance. Using the proportion of positive data points that are correctly considered as positive (true positive rate) and the proportion of negative data points that are mistakenly considered as positive (false positive rate), one can generate a graph that shows the trade off between the rate at which the model correctly predicts something versus the rate of incorrectly predicting something. To summarise, the ROC curve is a graph where on the x-axis we have the false positive rate and on the y-axis the true positive rate. These two rates are equal to $1$ - the specificity of the model, and the sensitivity of the model respectively where

Sensitivity is the proportion of patients who have the disease that are correctly identified.
Specificity is the proportion of patients that don’t have the disease that are correctly identified.

Area under the curve

The area under the ROC curve, known as AUC, is used as a measure of a diagnostic test’s discriminatory power. As an example, if were looking at people having a rare disease or not, the AUC measures whether the model discriminates between two persons (one ill and one healthy). The maximum value for the AUC is $1$, which indicates a perfect test (i.e. $100\%$ sensitivity and $100\%$ specificity). An AUC value of $0.5$ indicates no discrimiative value and is represented by a straight diagonal line.

The graph below shows the ROC curve, in black, for a specific binomial logistic regression model with an AUC of $0.802$; and in red the diagonal line which represents an AUC value of $0.5$.

For a short summary of the ROC curve you can have a look here. After you do that, check the Further reading section and specifically the self help subsection ($8$ and $9$).

Variable selection

Up until now we have assumed that all the covariates that are included in the model are chosen in advance. On this lecture we will see how we can decide which covariates should be included in the model. This is known as variable selection or model selection. It is important to note that the term model selection can be used for lots of different things. As an example, model selection may include choosing between different forms of link functions, different distributions of the errors, etc. For this lecture I will use the term variable selection which is what we will focus on. We will try to see how we can answer the question “Which covariates should be included in the model and which covariates should be excluded from the model?”.

This can either be done manually by adding a covariate in the model and comparing the two models (the model without the covariate and the model with it) or using an automatic method (e.g. forward selection, backward elimination, etc). Automatic methods are preferred when there are a large number of variables in the model.

Stepwise procedures

Backward elimination: This is a procedure where we start with including all covariates in the model and removing them one by one if they satisfy a criterion (e.g. p-value greater than 0.2). The procedure will stop when none of the remaining covariates atisfy that criterion.
Forward selection: This is a procedure where we start with including no covariates at all in the model and we add covariates in the model one by one if they satisfy a criterion (e.g. we choose the covariate with the lowest p-value which also has to be smaller than 0.2). The procedure will stop when none of the excuded covariates satisfy that criterion.
Stepwise regression: This is a combination of the previous two procedures where covariates are added or removed at each stage of the procedure. All these procedures should be taken with a grain of salt since they do have drawbacks. For instance, when removing less significant covariates this has as a result for the significance of the remaining covariates to be increased. In addition, these procedures are not directly connected with specific objectives (e.g. prediction, inference). It is important to remember that even though some covariates have been excuded from the model; that does not mean they are unrelated to the response variable. It is just that they provide no additional explanatory effect beyond the alreay included covariates of the model.

Note: Different criteria can be used for the previous methods (e.g. the model with the minimum AIC,BIC, etc)

Automated model selection

Automated model selection procedures are particularly useful when the number of covariates is large and it is not feasible to fit all possible models. The glmulti function located in the glmulti package is the best option I have found up until now. It is flexible enough to handle all GLMs (unlike the leaps package in R which deals only with linear regression) and any information criteria. The bestglm function located in the bestglm package is an alternative that offers different options for the information criteria but suffers from the same problem as before, it is only applicable to linear models, and it does not allow interactions between the covariates.

For more information on the glmultipackage and a brief overview of the stepwise model selection methods you can have a look here.

Titanic data set

Location of data set

On April 15, 1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. There is currently a competition/challenge on Kaggle about the sinking of the Titanic where the goal is to accurately predict the survival of a passenger based on information about him/her (e.g. age, gender, passenger’s class, etc). I would advise you to sign up on Kaggle and then download the Titanic data set. This consists of a training data set (comprised of 891 passengers) and a test data set (comprised of 418 passengers). You can also find the data set on Moodle. We will only focus on the training data set for this lecture.

Note: The original names of these variables are PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.

passenger.id	survived	passenger.class	name	gender	age	siblings.spouses.aboard	ticket.number	fare	cabin	port.of.embarkation
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583		Q
7	0	1	McCarthy, Mr. Timothy J	male	54	0	17463	51.8625	E46	S

Exploratory data analysis

We can find out more information about the structure of the data set by using the command str.

str(full)

Classes 'tbl_df', 'tbl' and 'data.frame':   891 obs. of  12 variables:
 $ passenger.id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ survived               : int  0 1 1 1 0 0 0 0 1 1 ...
 $ passenger.class        : int  3 1 3 1 3 3 1 3 3 2 ...
 $ name                   : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ gender                 : chr  "male" "female" "female" "female" ...
 $ age                    : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ siblings.spouses.aboard: int  1 1 0 1 0 0 0 3 0 1 ...
 $ parents.children.aboard: int  0 0 0 0 0 0 0 1 2 0 ...
 $ ticket.number          : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ fare                   : num  7.25 71.28 7.92 53.1 8.05 ...
 $ cabin                  : chr  "" "C85" "" "C123" ...
 $ port.of.embarkation    : chr  "S" "C" "S" "S" ...

We will first remove some columns we are not interested in, handle the missing values and convert some of the variables into more appropriate data types. After that we can produce result summaries of each variable, using the command summary.

 survived passenger.class    gender               age        siblings.spouses.aboard
 0:549    1:216           Length:891         Min.   : 0.42   Min.   :0.000          
 1:342    2:184           Class :character   1st Qu.:22.00   1st Qu.:0.000          
          3:491           Mode  :character   Median :29.70   Median :0.000          
                                             Mean   :29.70   Mean   :0.523          
                                             3rd Qu.:35.00   3rd Qu.:1.000          
                                             Max.   :80.00   Max.   :8.000          
 parents.children.aboard
 Min.   :0.0000         
 1st Qu.:0.0000         
 Median :0.0000         
 Mean   :0.3816         
 3rd Qu.:0.0000         
 Max.   :6.0000

Basic plots

We can plot the contingency table of survival among the two genders and the contingency table of survival with respect to each passenger’s class.

Regression analysis

The most common model for these sort of data is the binomial logistic regression model. We can run the model using the glm function with the argument family=binomial.


Call:
glm(formula = survived ~ ., family = binomial(link = "logit"), 
    data = full)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6564  -0.6111  -0.4229   0.6130   2.4317  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              4.07680    0.40613  10.038  < 2e-16 ***
passenger.class2        -1.19117    0.26191  -4.548 5.41e-06 ***
passenger.class3        -2.34887    0.24278  -9.675  < 2e-16 ***
gendermale              -2.76864    0.19878 -13.928  < 2e-16 ***
age                     -0.04016    0.00782  -5.136 2.80e-07 ***
siblings.spouses.aboard -0.33480    0.10873  -3.079  0.00207 ** 
parents.children.aboard -0.08166    0.11467  -0.712  0.47640    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  790.33  on 884  degrees of freedom
AIC: 804.33

Number of Fisher Scoring iterations: 5

If the covariates are not independent of each other, this will have as an effect that the variance of the model coefficients will increase. We would like the variance of the model coefficients to decrease in order for the model to be more robust. That is why is a good idea to omit the variables that show signs of multicollinearity from the model. A way that we can check multicollinearity is by using the vif function located in the car package. Usually, one should omit variables with a vif higher than $5$

                            GVIF Df GVIF^(1/(2*Df))
passenger.class         1.311086  2        1.070059
gender                  1.195490  1        1.093385
age                     1.296394  1        1.138593
siblings.spouses.aboard 1.240057  1        1.113578
parents.children.aboard 1.213541  1        1.101609

This is a good time to test if the residuals are correlated (we would not like that to be the case). We can use the durbinWatsonTest for which the null hypothesis is that the residuals are not correlated.

 lag Autocorrelation D-W Statistic p-value
   1      0.03249683      1.934519   0.324
 Alternative hypothesis: rho != 0

Good news. We have failed to reject the null hypothesis. It seems that it is safe to assume that this model is valid.

We can now plot the odds ratios with their confidence intervals.

We can also plot the predicted probabilities of survival, related to specifc model covariates. In this case we will choose the age and the passenger’s class.

We can use the predict function with the argument type=response to predict the probability of survival based on this model and look at the ROC curve (more info on this). This is also a way to test the overall fit of the model. We would not like for the model to perform as well as a random guess. We can also compare the ROC curves for different models to help us choose between a model.

Let’s say that we want to keep the false positive rate lower than $20\%$ (we want to incorrectly predict that a passenger survived with a probability less than $0.2$). In that case if we assume that

library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")

where perf is the performance of the predictions of the model regarding the true positive and false positive measures. We can now find the cutoff point by

cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]], 
                      tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
head(subset(cutoffs, fpr < 0.2))

          cut       fpr       tpr
256 0.3876013 0.1948998 0.7777778
257 0.3840028 0.1967213 0.7777778
258 0.3788025 0.1985428 0.7777778
255 0.3965224 0.1948998 0.7748538
254 0.4061724 0.1948998 0.7690058
253 0.4064211 0.1948998 0.7660819

So, in this case we can choose a cutoff value of 0.41 which means that any passengers with predicted probabilities of survival above this value will be considered as survivors.

Variable selection

We will first start with the full model and using the backwards elimination.

model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
#Backwards selection
backward <- step(model.bin)

Start:  AIC=804.33
survived ~ passenger.class + gender + age + siblings.spouses.aboard + 
    parents.children.aboard

                          Df Deviance     AIC
- parents.children.aboard  1   790.84  802.84
<none>                         790.33  804.33
- siblings.spouses.aboard  1   801.51  813.51
- age                      1   818.94  830.94
- passenger.class          2   901.34  911.34
- gender                   1  1037.28 1049.28

Step:  AIC=802.84
survived ~ passenger.class + gender + age + siblings.spouses.aboard

                          Df Deviance     AIC
<none>                         790.84  802.84
- siblings.spouses.aboard  1   805.29  815.29
- age                      1   819.15  829.15
- passenger.class          2   901.80  909.80
- gender                   1  1043.77 1053.77

backward


Call:  glm(formula = survived ~ passenger.class + gender + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)         passenger.class2         passenger.class3               gendermale  
                4.02742                 -1.18983                 -2.34780                 -2.74022  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

formula(backward)

survived ~ passenger.class + gender + age + siblings.spouses.aboard

We can also start from the empty model (with only the intercept) and use the forward selection.

#Forward selection
model.empty <- glm(survived~1, family=binomial(link="logit"), data=full)
forward <- step(model.empty,
                scope=list(lower=formula(model.empty),upper=formula(model.bin)), direction="forward")

Start:  AIC=1188.66
survived ~ 1

                          Df Deviance    AIC
+ gender                   1    917.8  921.8
+ passenger.class          2   1083.1 1089.1
+ parents.children.aboard  1   1180.8 1184.8
+ age                      1   1182.3 1186.3
<none>                         1186.7 1188.7
+ siblings.spouses.aboard  1   1185.5 1189.5

Step:  AIC=921.8
survived ~ gender

                          Df Deviance    AIC
+ passenger.class          2   826.89 834.89
+ siblings.spouses.aboard  1   904.69 910.69
+ parents.children.aboard  1   914.43 920.43
<none>                         917.80 921.80
+ age                      1   917.06 923.06

Step:  AIC=834.89
survived ~ gender + passenger.class

                          Df Deviance    AIC
+ age                      1   805.29 815.29
+ siblings.spouses.aboard  1   819.15 829.15
<none>                         826.89 834.89
+ parents.children.aboard  1   825.06 835.06

Step:  AIC=815.29
survived ~ gender + passenger.class + age

                          Df Deviance    AIC
+ siblings.spouses.aboard  1   790.84 802.84
+ parents.children.aboard  1   801.51 813.51
<none>                         805.29 815.29

Step:  AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard

                          Df Deviance    AIC
<none>                         790.84 802.84
+ parents.children.aboard  1   790.33 804.33

forward


Call:  glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)               gendermale         passenger.class2         passenger.class3  
                4.02742                 -2.74022                 -1.18983                 -2.34780  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

formula(forward)

survived ~ gender + passenger.class + age + siblings.spouses.aboard

We can also use the argument direction=both in the step function. This allows the algorithm to choose between forward and backward selection at each step.

Start:  AIC=1188.66
survived ~ 1

                          Df Deviance    AIC
+ gender                   1    917.8  921.8
+ passenger.class          2   1083.1 1089.1
+ parents.children.aboard  1   1180.8 1184.8
+ age                      1   1182.3 1186.3
<none>                         1186.7 1188.7
+ siblings.spouses.aboard  1   1185.5 1189.5

Step:  AIC=921.8
survived ~ gender

                          Df Deviance     AIC
+ passenger.class          2   826.89  834.89
+ siblings.spouses.aboard  1   904.69  910.69
+ parents.children.aboard  1   914.43  920.43
<none>                         917.80  921.80
+ age                      1   917.06  923.06
- gender                   1  1186.66 1188.66

Step:  AIC=834.89
survived ~ gender + passenger.class

                          Df Deviance     AIC
+ age                      1   805.29  815.29
+ siblings.spouses.aboard  1   819.15  829.15
<none>                         826.89  834.89
+ parents.children.aboard  1   825.06  835.06
- passenger.class          2   917.80  921.80
- gender                   1  1083.11 1089.11

Step:  AIC=815.29
survived ~ gender + passenger.class + age

                          Df Deviance     AIC
+ siblings.spouses.aboard  1   790.84  802.84
+ parents.children.aboard  1   801.51  813.51
<none>                         805.29  815.29
- age                      1   826.89  834.89
- passenger.class          2   917.06  923.06
- gender                   1  1046.48 1054.48

Step:  AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard

                          Df Deviance     AIC
<none>                         790.84  802.84
+ parents.children.aboard  1   790.33  804.33
- siblings.spouses.aboard  1   805.29  815.29
- age                      1   819.15  829.15
- passenger.class          2   901.80  909.80
- gender                   1  1043.77 1053.77


Call:  glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)               gendermale         passenger.class2         passenger.class3  
                4.02742                 -2.74022                 -1.18983                 -2.34780  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

survived ~ gender + passenger.class + age + siblings.spouses.aboard

Finally, we can use the glmulti function, allow for interactions between the covariates, and use the BIC as the criterion we would like to be minimised.


Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0484  -0.6186  -0.4897   0.3663   2.5423  

Coefficients:
                                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                      5.240113   0.690632   7.587 3.26e-14 ***
passenger.class2                                -1.197178   0.733635  -1.632 0.102713    
passenger.class3                                -3.870291   0.630624  -6.137 8.40e-10 ***
gendermale                                      -3.996719   0.623808  -6.407 1.48e-10 ***
age                                             -0.045827   0.008544  -5.364 8.15e-08 ***
parents.children.aboard:siblings.spouses.aboard -0.255927   0.072199  -3.545 0.000393 ***
passenger.class2:gendermale                     -0.396289   0.805568  -0.492 0.622764    
passenger.class3:gendermale                      2.078590   0.666602   3.118 0.001820 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  755.89  on 883  degrees of freedom
AIC: 771.89

Number of Fisher Scoring iterations: 6

Regression and data visualisation (questions)

Simple linear regression

We will first focus on the linear_reg data set which is located on Moodle. Download it and load it in R. The data set is comprised of $8$ columns, $4$ of them refer to covariates while the other $4$ refer to response variables. The full data set can be seen below

x1	x2	x3	x4	y1	y2	y3	y4
10	10	10	8	8.04	9.14	7.46	6.58
8	8	8	8	6.95	8.14	6.77	5.76
13	13	13	8	7.58	8.74	12.74	7.71
9	9	9	8	8.81	8.77	7.11	8.84
11	11	11	8	8.33	9.26	7.81	8.47
14	14	14	8	9.96	8.10	8.84	7.04
6	6	6	8	7.24	6.13	6.08	5.25
4	4	4	19	4.26	3.10	5.39	12.50
12	12	12	8	10.84	9.13	8.15	5.56
7	7	7	8	4.82	7.26	6.42	7.91
5	5	5	8	5.68	4.74	5.73	6.89

Your job is simple. You have to focus on two specific columns and run a simple linear regression model. Specifically, the 1st model should be comprised of $(x_1, y_1)$, the 2nd of $(x_2, y_2)$, the 3rd of $(x_3, y_3)$ and the last model is comprised of $(x_4, y_4)$.

Note: The response variables are the $y$ variables while the covariates are the $x$ variables.

Questions:

Do you notice anything?
How would you report these findings?
Would you try anything different for some of those data sets?

Single data set

We will now focus on the single_dataset data set which is located on Moodle. Download it and load it in R.

 [1] 1.03 1.24 1.47 1.52 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.72 2.75 2.78 2.81 2.84 2.87 2.90
[20] 2.93 2.96 2.99 3.60 3.64 3.66 3.72 3.77 3.88 3.91 4.14 4.54 4.77 4.81 5.62

Questions:

What would you do to visualise such a data set?
Can you explain your findings?

Multiple data sets

Now we will focus on the multiple_dataset which can also be found on Moodle. This is comprised of 3 different artificial data sets. The first $7$ rows of the data set can be seen below.

Note: You should also try and visualise them in a single plot

data1	data2	data3
-2.375872	0.2464703	-2.182037
-1.889814	1.4788872	-1.268482
-2.501377	-0.9337329	-1.875719
-1.042831	3.6399040	1.000932
-1.802295	-3.0531474	-1.936155
-2.492281	-3.6871996	-3.052505
-1.707543	0.0360403	3.032020

Questions:

Can you find a way to visualise all the data sets in a single graph
Do you notice any similarities or differences?

What is happening?

Data from a study on average SAT scores are located on the mosaic package in R. The data set is titled SAT. Using the command head(SAT) you can see the first $6$ rows of the data set

       state expend ratio salary frac verbal math  sat
1    Alabama  4.405  17.2 31.144    8    491  538 1029
2     Alaska  8.963  17.6 47.951   47    445  489  934
3    Arizona  4.778  19.3 32.175   27    448  496  944
4   Arkansas  4.459  17.1 28.934    6    482  523 1005
5 California  4.992  24.0 41.078   45    417  485  902
6   Colorado  5.443  18.4 34.571   29    462  518  980

We will focus only on the columns salary, frac, and sat which denote the estimated average annual salary of teachers, the percentage of all eligible students taking the SAT, and the average total SAT score. I would like you to run a simple linear regression model where the response variable is the SAT score and the only covariate is the salary of the teachers; and then run another model that also includes the percentage of all eligible students taking the SAT as an additional covariate.

Questions:

Do you see anything unusual?
How would you report these findings?

Explain the graph

Finally, let’s assume that we want to be able to predict if a patient has a serious disease based on some personal characteristics (e.g. good diet, smoker or not, etc). We can fit a logistic regression model and find out how good are the predictions of the model. If we have named the model logistic.glm, we can type logistic.glm$fitted values to see the fitted values. All these values are between $0$ and $1$ and denote the probability of each person having the disease. We can choose a cutoff value such as $0.4$ so that any person that has a fitted value of at least $0.4$ is classified as having that disease, and any person with a fitted value lower than $0.4$ is classified as a person that doesn’t have the disease.

We may also want to optimise different measures. Here are three of them:

Sensitivity: The proportion of patients who have the disease that are correctly identified.
Specificity: The proportion of patients that don’t have the disease that are correctly identified.
Correct classification rate: The proportions of predictions that are correct.

The x-axis on the next plot refers to the value of the cutoff point while the y-axis refers to the value of the measure we want to optimise.

Questions:

Can you explain what you see in the plot?
If your job was to talk to the doctor, explain these findings, and suggest new strategies what would you tell him/her?

Regression and data visualisation (answers)

Simple linear regression

Exploratory data analysis This data set is known as the Ancsombe data set and is located in the datasets library. You should install the package with install.packages("datasets"), load it with library(datasets), and finally load the data set with data(anscombe).

We can find out more information about the structure of the data set by using the command str(anscombe), and produce result summaries of each variable, using the command summary(anscombe).

str(anscombe)

'data.frame':   11 obs. of  8 variables:
 $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
 $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
 $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
 $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
 $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

summary(anscombe)

       x1             x2             x3             x4           y1               y2       
 Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260   Min.   :3.100  
 1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315   1st Qu.:6.695  
 Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580   Median :8.140  
 Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501   Mean   :7.501  
 3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570   3rd Qu.:8.950  
 Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840   Max.   :9.260  
       y3              y4        
 Min.   : 5.39   Min.   : 5.250  
 1st Qu.: 6.25   1st Qu.: 6.170  
 Median : 7.11   Median : 7.040  
 Mean   : 7.50   Mean   : 7.501  
 3rd Qu.: 7.98   3rd Qu.: 8.190  
 Max.   :12.74   Max.   :12.500

We can also see that the results from the individual linear models are similar for each data set.

line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)
line1


Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001

line2


Call:
lm(formula = y2 ~ x2, data = anscombe)

Coefficients:
(Intercept)           x2  
      3.001        0.500

line3


Call:
lm(formula = y3 ~ x3, data = anscombe)

Coefficients:
(Intercept)           x3  
     3.0025       0.4997

line4


Call:
lm(formula = y4 ~ x4, data = anscombe)

Coefficients:
(Intercept)           x4  
     3.0017       0.4999

We can now plot all 4 data sets and the regression lines for each one.

circle.size = 3
colors = list('red', 'blue', 'green', 'yellow')


#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
  geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2])+xlim(3,19)+ylim(2,14)

#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
  geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2])+xlim(3,19)+ylim(2,14)

#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
  geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2])+xlim(3,19)+ylim(2,14)

#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
  geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2])+xlim(3,19)+ylim(2,14)

grid.arrange(plot1, plot2, plot3, plot4)

Single data set

We should always be careful when using the histogram funtion in R. The results can look totally different depending on the value of the breaks argument.

par(mfrow=c(1,2))
x <- c(1.03, 1.24, 1.47, 1.52, 1.92, 1.93, 1.94, 1.95, 1.96, 1.97, 1.98,
       1.99, 2.72, 2.75, 2.78, 2.81, 2.84, 2.87, 2.9, 2.93, 2.96, 2.99, 3.6,
       3.64, 3.66, 3.72, 3.77, 3.88, 3.91, 4.14, 4.54, 4.77, 4.81, 5.62)
hist(x,breaks=seq(0.3,6.7,by=0.8),xlim=c(0,6.7),col="red",freq=FALSE)
hist(x,breaks=0:8,col="blue",freq=FALSE)

Multiple data sets

It is important to go a bit deeper than just the usual boxplot command in R. In this case we can also use the beanplot package in R. This provides more information about the data sets.

library("beanplot")
set.seed(1)
 par(mfrow = c(1, 2), mai = c(0.5, 0.5, 0.5, 0.1))
 mu <- 2
 si <- 0.6
 c <- 500
 bimodal <- c(rnorm(c/2, -mu, si), rnorm(c/2, mu, si))
 uniform <- runif(c, -4, 4)
 normal <- rnorm(c, 0, 1.5)
 ylim <- c(-7, 7)
 boxplot(bimodal, uniform, normal, ylim = ylim, main = "boxplot", names = 1:3)
 beanplot(bimodal, uniform, normal, ylim = ylim, main = "beanplot",col = c("#CAB2D6", "#33A02C", "#B2DF8A"), border = "#CAB2D6")

 par(mfrow=c(1,1))

What is happening?

Using the function summary on both models we have

model.1 <- lm(sat ~ salary, data=SAT)
summary(model.1)


Call:
lm(formula = sat ~ salary, data = SAT)

Residuals:
     Min       1Q   Median       3Q      Max 
-147.125  -45.354    4.073   42.193  125.279 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1158.859     57.659  20.098  < 2e-16 ***
salary        -5.540      1.632  -3.394  0.00139 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67.89 on 48 degrees of freedom
Multiple R-squared:  0.1935,    Adjusted R-squared:  0.1767 
F-statistic: 11.52 on 1 and 48 DF,  p-value: 0.001391

model.2 <- lm(sat ~ salary+frac, data=SAT)
summary(model.2)


Call:
lm(formula = sat ~ salary + frac, data = SAT)

Residuals:
    Min      1Q  Median      3Q     Max 
-78.313 -26.731   3.168  18.951  75.590 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 987.9005    31.8775  30.991   <2e-16 ***
salary        2.1804     1.0291   2.119   0.0394 *  
frac         -2.7787     0.2285 -12.163    4e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared:  0.8056,    Adjusted R-squared:  0.7973 
F-statistic: 97.36 on 2 and 47 DF,  p-value: < 2.2e-16

So on the first model the salary covariate has a negative coefficient while for the second model it has a positive coefficient. In both models the coefficient is significantly different than $0$.

Maybe a plot will help, but with different colours to represent low (red), medium (blue), and high (green) values for the fraction of the students taking the SAT.

SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81),
  labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") +
  2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...)
  { panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp,
  cex=0.6, panel=panel.labels, labels=SAT$state,
  col=c('red','blue','green')[SAT$fraccount])

Within each of the $3$ different groups, the slope is positive, while overall there is a negative slope. This is due to the result of a strong negative association between sat and frac, as well as a strong positive association between salary and frac. This is known as the Simpson’s paradox and is the result of the absence within a model of a confounding variable (i.e. the frac covariate in this case). For more information on the paradox have a look here.

Explain the graph

library(memisc)
suicides <- as.data.set(spss.system.file("2nd_project_lecture02.sav"))
#Complex model
log.suicide <- glm(suicide_risk~age+marital_status+mother_negligence+father_negligence+self_estrangement
                   +isolation+normlessness+meaninglessness+drug_use+
                     metal+worshipping+vicarious,family="binomial",data=suicides)

mod <- log.suicide
y <- suicides$suicide_risk
# using a cutoff of cut, calculate sensitivity, specificity, and classification rate
perf = function(cut, mod, y)
{
  yhat = (mod$fit>cut)
  w = which(y==1)
  sensitivity = mean( yhat[w] == 1 )
  specificity = mean( yhat[-w] == 0 )
  c.rate = mean( y==yhat )
   out = t(as.matrix(c(sensitivity, specificity, c.rate)))
  colnames(out) = c("sensitivity", "specificity", "c.rate")
  return(out)
}



s = seq(.01,.99,length=1000)
OUT = matrix(0,1000,3)
for(i in 1:1000) OUT[i,]=perf(s[i],mod,y)
plot(s,OUT[,1],xlab="Cutoff",ylab="Value",cex.lab=1,cex.axis=1,ylim=c(0,1),type="l",lwd=2,axes=FALSE,col=2)
axis(1,seq(0,1,length=5),seq(0,1,length=5),cex.lab=0.5)
axis(2,seq(0,1,length=5),seq(0,1,length=5),cex.lab=0.5)
#mtext("Cutoff",
 #     side=1) #Add text to the x-axis
#mtext("Probability",
#      side=2)
lines(s,OUT[,2],col="darkgreen",lwd=2)
lines(s,OUT[,3],col=4,lwd=2)
#lines(s,OUT[,4],col="darkred",lwd=2)
box()
#legend(0,.25,col=c(2,"darkgreen",4,"darkred"),lwd=c(2,2,2,2),c("Sensitivity","Specificity","Classification Rate","Distance"))
legend(0.4,.25,col=c(2,"darkgreen",4),lwd=c(2,2,2),c("Sensitivity","Specificity","Classification Rate"))

The three different measures are:

Sensitivity: The proportion of 1’s that are correctly identified
Specificity: The proportion of 0’s that are correctly identified
Correct classification rate: The proportions of predictions that are correct

Some comments that we can make based on the graph are:

Looking at when the cutoff value is 0.01 (all people with fitted values above 0.01 are classified as having the disease) we can see that the sensitivity is 100%. This makes sense since we classify almost all people as having the disease so we expect that $P(\hat{Y_i}=1|Y_i=1)$ would be close to 1. On the other hand, since we specify almost everyone as having the disease, that means that the specificity will be low since the model does not correctly identify all the people who haven’t the disease.
Looking at when the cutoff value is 0.99 (all people with fitted values above 0.99 are classified as having the disease) we can see that the specificity is 100%. This makes sense since we classify almost all people as not having the disease so we expect that $P(\hat{Y_i}=0|Y_i=0)$ would be close to 1. On the other hand, since we specify almost everyone as not having the disease, that means that the sensitivity will be low since the model does not correctly identify all the people who have the disease.
Deciding on which cutoff value we will choose is really important, but before we do that we have to decide which is the most important measure to optimise. You can see from the graph that different cutoff values optimise different measures. For example, if our model was aiming to evaluate a diagnostic test for a serious disease that has a relatively safe cure, the sensitivity is far more important that the specificity. In another case, if the disease was relatively minor and the treatment was risky, specificity would be more important to control.
It seems that if we want to choose a cutoff value which gives high values on all three different measures; we could choose the value 0.25 since all three probabilities are above 0.75.

Regression modelling

Description of data set

Download the data set project_lecture03.sav which is located on Moodle. Install the package memisc and load the data set in R with the command name_of_data_set <- as.data.set(spss.system.file("project_lecture03.sav")).

The data set is comprised of $121$ rows and $15$ columns. Each row refers to a different person while the columns refer to:

age denotes the age of a person.
age_group denotes in which age group the person belongs to. The value $1$ refers to $14-16$ years old while $2$ refers to $16-19$ years old.
drug_use refers to how many times the person used drug substances during the last year.
father_negligence and mother_negligence are scales in which a high score is associated with a perception of cold and rejecting family relationships.
gender shows the gender of the person.
isolation corresponds to a subjective perception of lack of support.
marital_status shows the marital status of the person’s parents. The two possible values are together or separated/divorced.
meaninglessness describes youths that may doubt the relevance of school in attaining future employment.
metal describes how much the person listens to metal music.
normlessness is defined as a belief that socially dissaproved behaviours may be used to achieve certain goals.
self_estrangement refers to persons who have a negative self-perception and who are overwhelmed by difficulties they consider out of control.
suicide_risk shows if a person is considered to be a suicide risk. The value $0$ refers to persons who are not considered to be suicide risks while the value $1$ refers to persons who are considered to be suicide risks.
vicarious music refers to when somebody listens to music when angry and/or bringing out aggressiveness by listening to music.
worshipping refers to behavioural manifestation of worshipping (e.g. hanging posters, acquiring information about singers, hanging out with other fans).

Note: For all the previous variables that are measured in a scale (i.e. father_negligence, mother_negligence, isolation, meaninglessness, metal, normlessness, self_estrangement, vicarious, worshipping); high values of the scale correspond to a behaviour/feeling that happens more, while low values correspond to a behaviour/feeling that is less present.

Group project

Things to do:

Run a regression model using variables from this data set.

Things to be careful of or/and think about:

Choose the appropriate model for the response variable you decide to use.
What is the question that you try to answer using this model?
Why did you choose this model instead of another?
Is the model valid?
Do you have any concerns about any of the variables or even the data set as a whole?
Can you think of any additional variables that would make the data set even more interesting?
Be ready to present your results.

Titanic data set

Location of data set

library(ggplot2) # visualisation
library(ggthemes) # visualisation
library(scales) # visualisation
library(dplyr) # data manipulation
library(mice) # imputation
library(bindrcpp)
library(sjPlot)
library(sjmisc)
#Instead of train.csv  you should use Titanic.csv (this is how the data set is titled on Moodle)
train <- read.csv('train.csv', stringsAsFactors = F)

#test  <- read.csv('test.csv', stringsAsFactors = F)
full  <- bind_rows(train)#, test) # bind training & test data
colnames(full) <- c("passenger.id","survived","passenger.class","name","gender","age","siblings.spouses.aboard","parents.children.aboard","ticket.number","fare","cabin","port.of.embarkation")
full <- as.tbl(full)
# check data
#str(full)

library(knitr)
kable(full[c(1:7),])

passenger.id	survived	passenger.class	name	gender	age	siblings.spouses.aboard	ticket.number	fare	cabin	port.of.embarkation
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583		Q
7	0	1	McCarthy, Mr. Timothy J	male	54	0	17463	51.8625	E46	S

Exploratory data analysis

str(full)

Classes 'tbl_df', 'tbl' and 'data.frame':   891 obs. of  12 variables:
 $ passenger.id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ survived               : int  0 1 1 1 0 0 0 0 1 1 ...
 $ passenger.class        : int  3 1 3 1 3 3 1 3 3 2 ...
 $ name                   : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ gender                 : chr  "male" "female" "female" "female" ...
 $ age                    : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ siblings.spouses.aboard: int  1 1 0 1 0 0 0 3 0 1 ...
 $ parents.children.aboard: int  0 0 0 0 0 0 0 1 2 0 ...
 $ ticket.number          : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ fare                   : num  7.25 71.28 7.92 53.1 8.05 ...
 $ cabin                  : chr  "" "C85" "" "C123" ...
 $ port.of.embarkation    : chr  "S" "C" "S" "S" ...

removecols <- c("passenger.id","name","ticket.number","fare","cabin","port.of.embarkation")
remove <- which(names(full) %in% removecols)
full <- full[-remove]
full$age[is.na(full$age)] <- mean(full$age,na.rm=T)
full$survived <- as.factor(full$survived)
full$passenger.class <- as.factor(full$passenger.class)
summary(full)

 survived passenger.class    gender               age        siblings.spouses.aboard
 0:549    1:216           Length:891         Min.   : 0.42   Min.   :0.000          
 1:342    2:184           Class :character   1st Qu.:22.00   1st Qu.:0.000          
          3:491           Mode  :character   Median :29.70   Median :0.000          
                                             Mean   :29.70   Mean   :0.523          
                                             3rd Qu.:35.00   3rd Qu.:1.000          
                                             Max.   :80.00   Max.   :8.000          
 parents.children.aboard
 Min.   :0.0000         
 1st Qu.:0.0000         
 Median :0.0000         
 Mean   :0.3816         
 3rd Qu.:0.0000         
 Max.   :6.0000

Basic plots

sjp.xtab(full$survived,full$gender,geom.colors = c("pink","blue"),show.values = FALSE,show.total = FALSE)

sjp.xtab(full$survived,full$passenger.class,geom.colors = c("red","purple","green"),show.values = FALSE,show.total = FALSE)

Regression analysis

model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
summary(model.bin)


Call:
glm(formula = survived ~ ., family = binomial(link = "logit"), 
    data = full)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6564  -0.6111  -0.4229   0.6130   2.4317  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              4.07680    0.40613  10.038  < 2e-16 ***
passenger.class2        -1.19117    0.26191  -4.548 5.41e-06 ***
passenger.class3        -2.34887    0.24278  -9.675  < 2e-16 ***
gendermale              -2.76864    0.19878 -13.928  < 2e-16 ***
age                     -0.04016    0.00782  -5.136 2.80e-07 ***
siblings.spouses.aboard -0.33480    0.10873  -3.079  0.00207 ** 
parents.children.aboard -0.08166    0.11467  -0.712  0.47640    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  790.33  on 884  degrees of freedom
AIC: 804.33

Number of Fisher Scoring iterations: 5

library(car)
vif(model.bin)

                            GVIF Df GVIF^(1/(2*Df))
passenger.class         1.311086  2        1.070059
gender                  1.195490  1        1.093385
age                     1.296394  1        1.138593
siblings.spouses.aboard 1.240057  1        1.113578
parents.children.aboard 1.213541  1        1.101609

durbinWatsonTest(model.bin)

 lag Autocorrelation D-W Statistic p-value
   1      0.03249683      1.934519   0.298
 Alternative hypothesis: rho != 0

sjp.glm(model.bin)

sjp.glm(model.bin,type="pred",vars=c("age","passenger.class"),facet.grid = FALSE,geom.colors = c("red","purple","green"))

library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")
auc <- performance(score,"auc")
plot(performance(score,"tpr","fpr"),col="green",main = paste("Area under the curve:", round(auc@y.values[[1]] ,3)))

library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")

cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]], 
                      tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
head(subset(cutoffs, fpr < 0.2))

          cut       fpr       tpr
256 0.3876013 0.1948998 0.7777778
257 0.3840028 0.1967213 0.7777778
258 0.3788025 0.1985428 0.7777778
255 0.3965224 0.1948998 0.7748538
254 0.4061724 0.1948998 0.7690058
253 0.4064211 0.1948998 0.7660819

Variable selection

train <- read.csv('train.csv', stringsAsFactors = F)
full  <- bind_rows(train)#, test) # bind training & test data
colnames(full) <- c("passenger.id","survived","passenger.class","name","gender","age","siblings.spouses.aboard","parents.children.aboard","ticket.number","fare","cabin","port.of.embarkation")
full <- as.tbl(full)
removecols <- c("passenger.id","name","ticket.number","fare","cabin","port.of.embarkation")
remove <- which(names(full) %in% removecols)
full <- full[-remove]
full$age[is.na(full$age)] <- mean(full$age,na.rm=T)
full$survived <- as.factor(full$survived)
full$passenger.class <- as.factor(full$passenger.class)

model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
#Backwards selection
backward <- step(model.bin)

Start:  AIC=804.33
survived ~ passenger.class + gender + age + siblings.spouses.aboard + 
    parents.children.aboard

                          Df Deviance     AIC
- parents.children.aboard  1   790.84  802.84
<none>                         790.33  804.33
- siblings.spouses.aboard  1   801.51  813.51
- age                      1   818.94  830.94
- passenger.class          2   901.34  911.34
- gender                   1  1037.28 1049.28

Step:  AIC=802.84
survived ~ passenger.class + gender + age + siblings.spouses.aboard

                          Df Deviance     AIC
<none>                         790.84  802.84
- siblings.spouses.aboard  1   805.29  815.29
- age                      1   819.15  829.15
- passenger.class          2   901.80  909.80
- gender                   1  1043.77 1053.77

backward


Call:  glm(formula = survived ~ passenger.class + gender + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)         passenger.class2         passenger.class3               gendermale  
                4.02742                 -1.18983                 -2.34780                 -2.74022  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

formula(backward)

survived ~ passenger.class + gender + age + siblings.spouses.aboard

#Forward selection
model.empty <- glm(survived~1, family=binomial(link="logit"), data=full)
forward <- step(model.empty,
                scope=list(lower=formula(model.empty),upper=formula(model.bin)), direction="forward")

Start:  AIC=1188.66
survived ~ 1

                          Df Deviance    AIC
+ gender                   1    917.8  921.8
+ passenger.class          2   1083.1 1089.1
+ parents.children.aboard  1   1180.8 1184.8
+ age                      1   1182.3 1186.3
<none>                         1186.7 1188.7
+ siblings.spouses.aboard  1   1185.5 1189.5

Step:  AIC=921.8
survived ~ gender

                          Df Deviance    AIC
+ passenger.class          2   826.89 834.89
+ siblings.spouses.aboard  1   904.69 910.69
+ parents.children.aboard  1   914.43 920.43
<none>                         917.80 921.80
+ age                      1   917.06 923.06

Step:  AIC=834.89
survived ~ gender + passenger.class

                          Df Deviance    AIC
+ age                      1   805.29 815.29
+ siblings.spouses.aboard  1   819.15 829.15
<none>                         826.89 834.89
+ parents.children.aboard  1   825.06 835.06

Step:  AIC=815.29
survived ~ gender + passenger.class + age

                          Df Deviance    AIC
+ siblings.spouses.aboard  1   790.84 802.84
+ parents.children.aboard  1   801.51 813.51
<none>                         805.29 815.29

Step:  AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard

                          Df Deviance    AIC
<none>                         790.84 802.84
+ parents.children.aboard  1   790.33 804.33

forward


Call:  glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)               gendermale         passenger.class2         passenger.class3  
                4.02742                 -2.74022                 -1.18983                 -2.34780  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

formula(forward)

survived ~ gender + passenger.class + age + siblings.spouses.aboard

bothways <-step(model.empty, list(lower=formula(model.empty),upper=formula(model.bin)),
         direction="both")

Start:  AIC=1188.66
survived ~ 1

                          Df Deviance    AIC
+ gender                   1    917.8  921.8
+ passenger.class          2   1083.1 1089.1
+ parents.children.aboard  1   1180.8 1184.8
+ age                      1   1182.3 1186.3
<none>                         1186.7 1188.7
+ siblings.spouses.aboard  1   1185.5 1189.5

Step:  AIC=921.8
survived ~ gender

                          Df Deviance     AIC
+ passenger.class          2   826.89  834.89
+ siblings.spouses.aboard  1   904.69  910.69
+ parents.children.aboard  1   914.43  920.43
<none>                         917.80  921.80
+ age                      1   917.06  923.06
- gender                   1  1186.66 1188.66

Step:  AIC=834.89
survived ~ gender + passenger.class

                          Df Deviance     AIC
+ age                      1   805.29  815.29
+ siblings.spouses.aboard  1   819.15  829.15
<none>                         826.89  834.89
+ parents.children.aboard  1   825.06  835.06
- passenger.class          2   917.80  921.80
- gender                   1  1083.11 1089.11

Step:  AIC=815.29
survived ~ gender + passenger.class + age

                          Df Deviance     AIC
+ siblings.spouses.aboard  1   790.84  802.84
+ parents.children.aboard  1   801.51  813.51
<none>                         805.29  815.29
- age                      1   826.89  834.89
- passenger.class          2   917.06  923.06
- gender                   1  1046.48 1054.48

Step:  AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard

                          Df Deviance     AIC
<none>                         790.84  802.84
+ parents.children.aboard  1   790.33  804.33
- siblings.spouses.aboard  1   805.29  815.29
- age                      1   819.15  829.15
- passenger.class          2   901.80  909.80
- gender                   1  1043.77 1053.77

bothways


Call:  glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard, 
    family = binomial(link = "logit"), data = full)

Coefficients:
            (Intercept)               gendermale         passenger.class2         passenger.class3  
                4.02742                 -2.74022                 -1.18983                 -2.34780  
                    age  siblings.spouses.aboard  
               -0.03985                 -0.35826  

Degrees of Freedom: 890 Total (i.e. Null);  885 Residual
Null Deviance:      1187 
Residual Deviance: 790.8    AIC: 802.8

formula(bothways)

survived ~ gender + passenger.class + age + siblings.spouses.aboard

library(glmulti)
glmulti.logistic <-
  glmulti(survived~., data = full,
          level = 2,               # Use level 2 to include interactions
          method = "h",            # Use "h" for exhaustive approach
          crit = "bic",            # BIC as the criterion
          confsetsize = 5,        # Keep 5 best models
          plotty = F, report = F,   # Without a graph and no reports
          fitfunction = "glm",     # glm function
          family = binomial)       # binomial family for logistic regression


summary(glmulti.logistic@objects[[1]])


Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0484  -0.6186  -0.4897   0.3663   2.5423  

Coefficients:
                                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                      5.240113   0.690632   7.587 3.26e-14 ***
passenger.class2                                -1.197178   0.733635  -1.632 0.102713    
passenger.class3                                -3.870291   0.630624  -6.137 8.40e-10 ***
gendermale                                      -3.996719   0.623808  -6.407 1.48e-10 ***
age                                             -0.045827   0.008544  -5.364 8.15e-08 ***
parents.children.aboard:siblings.spouses.aboard -0.255927   0.072199  -3.545 0.000393 ***
passenger.class2:gendermale                     -0.396289   0.805568  -0.492 0.622764    
passenger.class3:gendermale                      2.078590   0.666602   3.118 0.001820 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  755.89  on 883  degrees of freedom
AIC: 771.89

Number of Fisher Scoring iterations: 6

Logistic regression

Bernoulli distribution

Logistic regression

ROC curve

Area under the curve

Variable selection

Variable selection

Stepwise procedures

Automated model selection

Further reading

Reference

Self help

Titanic data set

Location of data set

Exploratory data analysis

Basic plots

Regression analysis

Variable selection

Regression and data visualisation (questions)

Simple linear regression

Single data set

Multiple data sets

What is happening?

Explain the graph

Regression and data visualisation (answers)

Simple linear regression

Single data set

Multiple data sets

What is happening?

Explain the graph

Regression modelling

Description of data set

Group project

Titanic data set

Location of data set

Exploratory data analysis

Basic plots

Regression analysis

Variable selection