On Lecture 1 we explored two applications related to the normal and Poisson regression. For this lecture we will focus on logistic regression where the conditional distribution of the response variable given the covariates is a Bernoulli distribution. This is known as binomial logistic regression whereas when the response variable can take more than two possible values the model is known as multinomial logistic regression. As an example of a binomial logistic regression model you can think of the situation where the response variable is if a student has passed or failed his exam. It would be interesting to see whether the number of hours students’ spent revising predicted success in their exam. In the multinomial case the response variable can be a negative, neutral, or positive feedback a student gave for a course. In that case, the teacher would be interested to find out if there was an effect between how well written the slides were and the positive feedback for the course.
If \(Y\) is a random variable which follows the Bernoulli distribution we have that \(\text{Pr}(Y=0)=q=1-p, \text{Pr}(Y=1)=p\). The probability mass distribution for \(Y\) is \[\text{Pr}(Y=y)=p^y (1-p)^{1-y} \quad \text{for} \ y \in \{0,1\}\] with expected value and variance \[\begin{align*} \mathbb{E}[Y]&=p,\\ \mathbb{V}[Y]&=pq. \end{align*}\]where \(\text{logit} (p_i)=\log\{\frac{p_i}{1-p_i}\}\) i.e. the logit link function models the log odds of the mean. Another way to look at the previous equation is \(\log\{\frac{p_i}{1-p_i}\}=\log\{\ \frac{\text{Pr}(Y=1)}{\text{Pr}(Y=0)} \}\). In case you haven’t already noticed; the mean \(p_i\) of the distribution is equal to the probability \(\text{Pr}(Y_i=1)\).
In matrix form,
\[\begin{equation} \text{logit} (\mathbb{E}[\boldsymbol{y}])=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}, \end{equation}\]where
\[ \boldsymbol{y}=\left(\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array}\right), \boldsymbol{X}=\left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1k} \\ 1 & x_{21} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \ldots & x_{nk} \end{array}\right), \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_k \end{array}\right), \boldsymbol{\epsilon}=\left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{array}\right) \]
This is similar to the equation in the Regression Modelling section on the Theory page on Lecture 1 but with the difference that we are now modelling a function of the mean of the response variable.
Solving the previous equation with respect to the mean \(p_i\) we get \[p_i=\frac{1}{1+\exp\{-\boldsymbol{X}\boldsymbol{\beta}\}}\] which means that we can estimate \(p_i\) by \(\hat{p_i}\), the estimated probability of \(Y_i=1\) by \[\hat{p_i}=\frac{1}{1+\exp\{-\boldsymbol{X}\hat{\boldsymbol{\beta}}\}}\]
We have seen that after using a logistic regression model we can attempt to predict the outcome of the response variable given the covariates. This is a way of assessing the performance of the model. The receiving operating characteristic (ROC) curve is a measure of classifier performance. Using the proportion of positive data points that are correctly considered as positive (true positive rate) and the proportion of negative data points that are mistakenly considered as positive (false positive rate), one can generate a graph that shows the trade off between the rate at which the model correctly predicts something versus the rate of incorrectly predicting something. To summarise, the ROC curve is a graph where on the x-axis we have the false positive rate and on the y-axis the true positive rate. These two rates are equal to \(1\) - the specificity of the model, and the sensitivity of the model respectively where
Sensitivity is the proportion of patients who have the disease that are correctly identified.
Specificity is the proportion of patients that don’t have the disease that are correctly identified.
The area under the ROC curve, known as AUC, is used as a measure of a diagnostic test’s discriminatory power. As an example, if were looking at people having a rare disease or not, the AUC measures whether the model discriminates between two persons (one ill and one healthy). The maximum value for the AUC is \(1\), which indicates a perfect test (i.e. \(100\%\) sensitivity and \(100\%\) specificity). An AUC value of \(0.5\) indicates no discrimiative value and is represented by a straight diagonal line.
The graph below shows the ROC curve, in black, for a specific binomial logistic regression model with an AUC of \(0.802\); and in red the diagonal line which represents an AUC value of \(0.5\).
For a short summary of the ROC curve you can have a look here. After you do that, check the Further reading section and specifically the self help subsection (\(8\) and \(9\)).
Up until now we have assumed that all the covariates that are included in the model are chosen in advance. On this lecture we will see how we can decide which covariates should be included in the model. This is known as variable selection or model selection. It is important to note that the term model selection can be used for lots of different things. As an example, model selection may include choosing between different forms of link functions, different distributions of the errors, etc. For this lecture I will use the term variable selection which is what we will focus on. We will try to see how we can answer the question “Which covariates should be included in the model and which covariates should be excluded from the model?”.
This can either be done manually by adding a covariate in the model and comparing the two models (the model without the covariate and the model with it) or using an automatic method (e.g. forward selection, backward elimination, etc). Automatic methods are preferred when there are a large number of variables in the model.
Backward elimination: This is a procedure where we start with including all covariates in the model and removing them one by one if they satisfy a criterion (e.g. p-value greater than 0.2). The procedure will stop when none of the remaining covariates atisfy that criterion.
Forward selection: This is a procedure where we start with including no covariates at all in the model and we add covariates in the model one by one if they satisfy a criterion (e.g. we choose the covariate with the lowest p-value which also has to be smaller than 0.2). The procedure will stop when none of the excuded covariates satisfy that criterion.
Stepwise regression: This is a combination of the previous two procedures where covariates are added or removed at each stage of the procedure. All these procedures should be taken with a grain of salt since they do have drawbacks. For instance, when removing less significant covariates this has as a result for the significance of the remaining covariates to be increased. In addition, these procedures are not directly connected with specific objectives (e.g. prediction, inference). It is important to remember that even though some covariates have been excuded from the model; that does not mean they are unrelated to the response variable. It is just that they provide no additional explanatory effect beyond the alreay included covariates of the model.
Note: Different criteria can be used for the previous methods (e.g. the model with the minimum AIC,BIC, etc)
Automated model selection procedures are particularly useful when the number of covariates is large and it is not feasible to fit all possible models. The glmulti function located in the glmulti package is the best option I have found up until now. It is flexible enough to handle all GLMs (unlike the leaps package in R which deals only with linear regression) and any information criteria. The bestglm function located in the bestglm package is an alternative that offers different options for the information criteria but suffers from the same problem as before, it is only applicable to linear models, and it does not allow interactions between the covariates.
For more information on the glmultipackage and a brief overview of the stepwise model selection methods you can have a look here.
On April 15, 1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. There is currently a competition/challenge on Kaggle about the sinking of the Titanic where the goal is to accurately predict the survival of a passenger based on information about him/her (e.g. age, gender, passenger’s class, etc). I would advise you to sign up on Kaggle and then download the Titanic data set. This consists of a training data set (comprised of 891 passengers) and a test data set (comprised of 418 passengers). You can also find the data set on Moodle. We will only focus on the training data set for this lecture.
Note: The original names of these variables are PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.
| passenger.id | survived | passenger.class | name | gender | age | siblings.spouses.aboard | parents.children.aboard | ticket.number | fare | cabin | port.of.embarkation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
| 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
We can find out more information about the structure of the data set by using the command str.
str(full)Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
$ passenger.id : int 1 2 3 4 5 6 7 8 9 10 ...
$ survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ passenger.class : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ gender : chr "male" "female" "female" "female" ...
$ age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ siblings.spouses.aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ parents.children.aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ ticket.number : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : chr "" "C85" "" "C123" ...
$ port.of.embarkation : chr "S" "C" "S" "S" ...
We will first remove some columns we are not interested in, handle the missing values and convert some of the variables into more appropriate data types. After that we can produce result summaries of each variable, using the command summary.
survived passenger.class gender age siblings.spouses.aboard
0:549 1:216 Length:891 Min. : 0.42 Min. :0.000
1:342 2:184 Class :character 1st Qu.:22.00 1st Qu.:0.000
3:491 Mode :character Median :29.70 Median :0.000
Mean :29.70 Mean :0.523
3rd Qu.:35.00 3rd Qu.:1.000
Max. :80.00 Max. :8.000
parents.children.aboard
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.3816
3rd Qu.:0.0000
Max. :6.0000
We can plot the contingency table of survival among the two genders and the contingency table of survival with respect to each passenger’s class.
The most common model for these sort of data is the binomial logistic regression model. We can run the model using the glm function with the argument family=binomial.
Call:
glm(formula = survived ~ ., family = binomial(link = "logit"),
data = full)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6564 -0.6111 -0.4229 0.6130 2.4317
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.07680 0.40613 10.038 < 2e-16 ***
passenger.class2 -1.19117 0.26191 -4.548 5.41e-06 ***
passenger.class3 -2.34887 0.24278 -9.675 < 2e-16 ***
gendermale -2.76864 0.19878 -13.928 < 2e-16 ***
age -0.04016 0.00782 -5.136 2.80e-07 ***
siblings.spouses.aboard -0.33480 0.10873 -3.079 0.00207 **
parents.children.aboard -0.08166 0.11467 -0.712 0.47640
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 790.33 on 884 degrees of freedom
AIC: 804.33
Number of Fisher Scoring iterations: 5
If the covariates are not independent of each other, this will have as an effect that the variance of the model coefficients will increase. We would like the variance of the model coefficients to decrease in order for the model to be more robust. That is why is a good idea to omit the variables that show signs of multicollinearity from the model. A way that we can check multicollinearity is by using the vif function located in the car package. Usually, one should omit variables with a vif higher than \(5\)
GVIF Df GVIF^(1/(2*Df))
passenger.class 1.311086 2 1.070059
gender 1.195490 1 1.093385
age 1.296394 1 1.138593
siblings.spouses.aboard 1.240057 1 1.113578
parents.children.aboard 1.213541 1 1.101609
This is a good time to test if the residuals are correlated (we would not like that to be the case). We can use the durbinWatsonTest for which the null hypothesis is that the residuals are not correlated.
lag Autocorrelation D-W Statistic p-value
1 0.03249683 1.934519 0.324
Alternative hypothesis: rho != 0
Good news. We have failed to reject the null hypothesis. It seems that it is safe to assume that this model is valid.
We can now plot the odds ratios with their confidence intervals.
We can also plot the predicted probabilities of survival, related to specifc model covariates. In this case we will choose the age and the passenger’s class.
We can use the predict function with the argument type=response to predict the probability of survival based on this model and look at the ROC curve (more info on this). This is also a way to test the overall fit of the model. We would not like for the model to perform as well as a random guess. We can also compare the ROC curves for different models to help us choose between a model.
Let’s say that we want to keep the false positive rate lower than \(20\%\) (we want to incorrectly predict that a passenger survived with a probability less than \(0.2\)). In that case if we assume that
library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")perf is the performance of the predictions of the model regarding the true positive and false positive measures. We can now find the cutoff point by
cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]],
tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
head(subset(cutoffs, fpr < 0.2)) cut fpr tpr
256 0.3876013 0.1948998 0.7777778
257 0.3840028 0.1967213 0.7777778
258 0.3788025 0.1985428 0.7777778
255 0.3965224 0.1948998 0.7748538
254 0.4061724 0.1948998 0.7690058
253 0.4064211 0.1948998 0.7660819
So, in this case we can choose a cutoff value of 0.41 which means that any passengers with predicted probabilities of survival above this value will be considered as survivors.
We will first start with the full model and using the backwards elimination.
model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
#Backwards selection
backward <- step(model.bin)Start: AIC=804.33
survived ~ passenger.class + gender + age + siblings.spouses.aboard +
parents.children.aboard
Df Deviance AIC
- parents.children.aboard 1 790.84 802.84
<none> 790.33 804.33
- siblings.spouses.aboard 1 801.51 813.51
- age 1 818.94 830.94
- passenger.class 2 901.34 911.34
- gender 1 1037.28 1049.28
Step: AIC=802.84
survived ~ passenger.class + gender + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
- siblings.spouses.aboard 1 805.29 815.29
- age 1 819.15 829.15
- passenger.class 2 901.80 909.80
- gender 1 1043.77 1053.77
backward
Call: glm(formula = survived ~ passenger.class + gender + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) passenger.class2 passenger.class3 gendermale
4.02742 -1.18983 -2.34780 -2.74022
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
formula(backward)survived ~ passenger.class + gender + age + siblings.spouses.aboard
We can also start from the empty model (with only the intercept) and use the forward selection.
#Forward selection
model.empty <- glm(survived~1, family=binomial(link="logit"), data=full)
forward <- step(model.empty,
scope=list(lower=formula(model.empty),upper=formula(model.bin)), direction="forward")Start: AIC=1188.66
survived ~ 1
Df Deviance AIC
+ gender 1 917.8 921.8
+ passenger.class 2 1083.1 1089.1
+ parents.children.aboard 1 1180.8 1184.8
+ age 1 1182.3 1186.3
<none> 1186.7 1188.7
+ siblings.spouses.aboard 1 1185.5 1189.5
Step: AIC=921.8
survived ~ gender
Df Deviance AIC
+ passenger.class 2 826.89 834.89
+ siblings.spouses.aboard 1 904.69 910.69
+ parents.children.aboard 1 914.43 920.43
<none> 917.80 921.80
+ age 1 917.06 923.06
Step: AIC=834.89
survived ~ gender + passenger.class
Df Deviance AIC
+ age 1 805.29 815.29
+ siblings.spouses.aboard 1 819.15 829.15
<none> 826.89 834.89
+ parents.children.aboard 1 825.06 835.06
Step: AIC=815.29
survived ~ gender + passenger.class + age
Df Deviance AIC
+ siblings.spouses.aboard 1 790.84 802.84
+ parents.children.aboard 1 801.51 813.51
<none> 805.29 815.29
Step: AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
+ parents.children.aboard 1 790.33 804.33
forward
Call: glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) gendermale passenger.class2 passenger.class3
4.02742 -2.74022 -1.18983 -2.34780
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
formula(forward)survived ~ gender + passenger.class + age + siblings.spouses.aboard
We can also use the argument direction=both in the step function. This allows the algorithm to choose between forward and backward selection at each step.
Start: AIC=1188.66
survived ~ 1
Df Deviance AIC
+ gender 1 917.8 921.8
+ passenger.class 2 1083.1 1089.1
+ parents.children.aboard 1 1180.8 1184.8
+ age 1 1182.3 1186.3
<none> 1186.7 1188.7
+ siblings.spouses.aboard 1 1185.5 1189.5
Step: AIC=921.8
survived ~ gender
Df Deviance AIC
+ passenger.class 2 826.89 834.89
+ siblings.spouses.aboard 1 904.69 910.69
+ parents.children.aboard 1 914.43 920.43
<none> 917.80 921.80
+ age 1 917.06 923.06
- gender 1 1186.66 1188.66
Step: AIC=834.89
survived ~ gender + passenger.class
Df Deviance AIC
+ age 1 805.29 815.29
+ siblings.spouses.aboard 1 819.15 829.15
<none> 826.89 834.89
+ parents.children.aboard 1 825.06 835.06
- passenger.class 2 917.80 921.80
- gender 1 1083.11 1089.11
Step: AIC=815.29
survived ~ gender + passenger.class + age
Df Deviance AIC
+ siblings.spouses.aboard 1 790.84 802.84
+ parents.children.aboard 1 801.51 813.51
<none> 805.29 815.29
- age 1 826.89 834.89
- passenger.class 2 917.06 923.06
- gender 1 1046.48 1054.48
Step: AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
+ parents.children.aboard 1 790.33 804.33
- siblings.spouses.aboard 1 805.29 815.29
- age 1 819.15 829.15
- passenger.class 2 901.80 909.80
- gender 1 1043.77 1053.77
Call: glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) gendermale passenger.class2 passenger.class3
4.02742 -2.74022 -1.18983 -2.34780
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
survived ~ gender + passenger.class + age + siblings.spouses.aboard
Finally, we can use the glmulti function, allow for interactions between the covariates, and use the BIC as the criterion we would like to be minimised.
Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0484 -0.6186 -0.4897 0.3663 2.5423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.240113 0.690632 7.587 3.26e-14 ***
passenger.class2 -1.197178 0.733635 -1.632 0.102713
passenger.class3 -3.870291 0.630624 -6.137 8.40e-10 ***
gendermale -3.996719 0.623808 -6.407 1.48e-10 ***
age -0.045827 0.008544 -5.364 8.15e-08 ***
parents.children.aboard:siblings.spouses.aboard -0.255927 0.072199 -3.545 0.000393 ***
passenger.class2:gendermale -0.396289 0.805568 -0.492 0.622764
passenger.class3:gendermale 2.078590 0.666602 3.118 0.001820 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 755.89 on 883 degrees of freedom
AIC: 771.89
Number of Fisher Scoring iterations: 6
We will first focus on the linear_reg data set which is located on Moodle. Download it and load it in R. The data set is comprised of \(8\) columns, \(4\) of them refer to covariates while the other \(4\) refer to response variables. The full data set can be seen below
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
Your job is simple. You have to focus on two specific columns and run a simple linear regression model. Specifically, the 1st model should be comprised of \((x_1, y_1)\), the 2nd of \((x_2, y_2)\), the 3rd of \((x_3, y_3)\) and the last model is comprised of \((x_4, y_4)\).
Note: The response variables are the \(y\) variables while the covariates are the \(x\) variables.
Questions:
Do you notice anything?
How would you report these findings?
Would you try anything different for some of those data sets?
We will now focus on the single_dataset data set which is located on Moodle. Download it and load it in R.
[1] 1.03 1.24 1.47 1.52 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.72 2.75 2.78 2.81 2.84 2.87 2.90
[20] 2.93 2.96 2.99 3.60 3.64 3.66 3.72 3.77 3.88 3.91 4.14 4.54 4.77 4.81 5.62
Questions:
What would you do to visualise such a data set?
Can you explain your findings?
Now we will focus on the multiple_dataset which can also be found on Moodle. This is comprised of 3 different artificial data sets. The first \(7\) rows of the data set can be seen below.
Note: You should also try and visualise them in a single plot
| data1 | data2 | data3 |
|---|---|---|
| -2.375872 | 0.2464703 | -2.182037 |
| -1.889814 | 1.4788872 | -1.268482 |
| -2.501377 | -0.9337329 | -1.875719 |
| -1.042831 | 3.6399040 | 1.000932 |
| -1.802295 | -3.0531474 | -1.936155 |
| -2.492281 | -3.6871996 | -3.052505 |
| -1.707543 | 0.0360403 | 3.032020 |
Questions:
Can you find a way to visualise all the data sets in a single graph
Do you notice any similarities or differences?
Data from a study on average SAT scores are located on the mosaic package in R. The data set is titled SAT. Using the command head(SAT) you can see the first \(6\) rows of the data set
state expend ratio salary frac verbal math sat
1 Alabama 4.405 17.2 31.144 8 491 538 1029
2 Alaska 8.963 17.6 47.951 47 445 489 934
3 Arizona 4.778 19.3 32.175 27 448 496 944
4 Arkansas 4.459 17.1 28.934 6 482 523 1005
5 California 4.992 24.0 41.078 45 417 485 902
6 Colorado 5.443 18.4 34.571 29 462 518 980
We will focus only on the columns salary, frac, and sat which denote the estimated average annual salary of teachers, the percentage of all eligible students taking the SAT, and the average total SAT score. I would like you to run a simple linear regression model where the response variable is the SAT score and the only covariate is the salary of the teachers; and then run another model that also includes the percentage of all eligible students taking the SAT as an additional covariate.
Questions:
Do you see anything unusual?
How would you report these findings?
Finally, let’s assume that we want to be able to predict if a patient has a serious disease based on some personal characteristics (e.g. good diet, smoker or not, etc). We can fit a logistic regression model and find out how good are the predictions of the model. If we have named the model logistic.glm, we can type logistic.glm$fitted values to see the fitted values. All these values are between \(0\) and \(1\) and denote the probability of each person having the disease. We can choose a cutoff value such as \(0.4\) so that any person that has a fitted value of at least \(0.4\) is classified as having that disease, and any person with a fitted value lower than \(0.4\) is classified as a person that doesn’t have the disease.
We may also want to optimise different measures. Here are three of them:
Sensitivity: The proportion of patients who have the disease that are correctly identified.
Specificity: The proportion of patients that don’t have the disease that are correctly identified.
Correct classification rate: The proportions of predictions that are correct.
The x-axis on the next plot refers to the value of the cutoff point while the y-axis refers to the value of the measure we want to optimise.
Questions:
Can you explain what you see in the plot?
If your job was to talk to the doctor, explain these findings, and suggest new strategies what would you tell him/her?
Exploratory data analysis This data set is known as the Ancsombe data set and is located in the datasets library. You should install the package with install.packages("datasets"), load it with library(datasets), and finally load the data set with data(anscombe).
We can find out more information about the structure of the data set by using the command str(anscombe), and produce result summaries of each variable, using the command summary(anscombe).
str(anscombe)'data.frame': 11 obs. of 8 variables:
$ x1: num 10 8 13 9 11 14 6 4 12 7 ...
$ x2: num 10 8 13 9 11 14 6 4 12 7 ...
$ x3: num 10 8 13 9 11 14 6 4 12 7 ...
$ x4: num 8 8 8 8 8 8 8 19 8 8 ...
$ y1: num 8.04 6.95 7.58 8.81 8.33 ...
$ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
$ y3: num 7.46 6.77 12.74 7.11 7.81 ...
$ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
summary(anscombe) x1 x2 x3 x4 y1 y2
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260 Min. :3.100
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315 1st Qu.:6.695
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580 Median :8.140
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501 Mean :7.501
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570 3rd Qu.:8.950
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840 Max. :9.260
y3 y4
Min. : 5.39 Min. : 5.250
1st Qu.: 6.25 1st Qu.: 6.170
Median : 7.11 Median : 7.040
Mean : 7.50 Mean : 7.501
3rd Qu.: 7.98 3rd Qu.: 8.190
Max. :12.74 Max. :12.500
We can also see that the results from the individual linear models are similar for each data set.
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)
line1
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
(Intercept) x1
3.0001 0.5001
line2
Call:
lm(formula = y2 ~ x2, data = anscombe)
Coefficients:
(Intercept) x2
3.001 0.500
line3
Call:
lm(formula = y3 ~ x3, data = anscombe)
Coefficients:
(Intercept) x3
3.0025 0.4997
line4
Call:
lm(formula = y4 ~ x4, data = anscombe)
Coefficients:
(Intercept) x4
3.0017 0.4999
We can now plot all 4 data sets and the regression lines for each one.
circle.size = 3
colors = list('red', 'blue', 'green', 'yellow')
#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2])+xlim(3,19)+ylim(2,14)
#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2])+xlim(3,19)+ylim(2,14)
#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2])+xlim(3,19)+ylim(2,14)
#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2])+xlim(3,19)+ylim(2,14)
grid.arrange(plot1, plot2, plot3, plot4)We should always be careful when using the histogram funtion in R. The results can look totally different depending on the value of the breaks argument.
par(mfrow=c(1,2))
x <- c(1.03, 1.24, 1.47, 1.52, 1.92, 1.93, 1.94, 1.95, 1.96, 1.97, 1.98,
1.99, 2.72, 2.75, 2.78, 2.81, 2.84, 2.87, 2.9, 2.93, 2.96, 2.99, 3.6,
3.64, 3.66, 3.72, 3.77, 3.88, 3.91, 4.14, 4.54, 4.77, 4.81, 5.62)
hist(x,breaks=seq(0.3,6.7,by=0.8),xlim=c(0,6.7),col="red",freq=FALSE)
hist(x,breaks=0:8,col="blue",freq=FALSE)It is important to go a bit deeper than just the usual boxplot command in R. In this case we can also use the beanplot package in R. This provides more information about the data sets.
library("beanplot")
set.seed(1)
par(mfrow = c(1, 2), mai = c(0.5, 0.5, 0.5, 0.1))
mu <- 2
si <- 0.6
c <- 500
bimodal <- c(rnorm(c/2, -mu, si), rnorm(c/2, mu, si))
uniform <- runif(c, -4, 4)
normal <- rnorm(c, 0, 1.5)
ylim <- c(-7, 7)
boxplot(bimodal, uniform, normal, ylim = ylim, main = "boxplot", names = 1:3)
beanplot(bimodal, uniform, normal, ylim = ylim, main = "beanplot",col = c("#CAB2D6", "#33A02C", "#B2DF8A"), border = "#CAB2D6") par(mfrow=c(1,1))Using the function summary on both models we have
model.1 <- lm(sat ~ salary, data=SAT)
summary(model.1)
Call:
lm(formula = sat ~ salary, data = SAT)
Residuals:
Min 1Q Median 3Q Max
-147.125 -45.354 4.073 42.193 125.279
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1158.859 57.659 20.098 < 2e-16 ***
salary -5.540 1.632 -3.394 0.00139 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 67.89 on 48 degrees of freedom
Multiple R-squared: 0.1935, Adjusted R-squared: 0.1767
F-statistic: 11.52 on 1 and 48 DF, p-value: 0.001391
model.2 <- lm(sat ~ salary+frac, data=SAT)
summary(model.2)
Call:
lm(formula = sat ~ salary + frac, data = SAT)
Residuals:
Min 1Q Median 3Q Max
-78.313 -26.731 3.168 18.951 75.590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 987.9005 31.8775 30.991 <2e-16 ***
salary 2.1804 1.0291 2.119 0.0394 *
frac -2.7787 0.2285 -12.163 4e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.69 on 47 degrees of freedom
Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16
So on the first model the salary covariate has a negative coefficient while for the second model it has a positive coefficient. In both models the coefficient is significantly different than \(0\).
Maybe a plot will help, but with different colours to represent low (red), medium (blue), and high (green) values for the fraction of the students taking the SAT.
SAT$fracgrp = cut(SAT$frac, breaks=c(0, 22, 49, 81),
labels=c("low", "medium", "high"))
SAT$fraccount = 1 + as.numeric(SAT$fracgrp=="medium") +
2*as.numeric(SAT$fracgrp=="high")
panel.labels = function(x, y, col='black', labels='x',...)
{ panel.text(x,y,labels,col=col,...)}
xyplot(sat ~salary, data=SAT, groups=fracgrp,
cex=0.6, panel=panel.labels, labels=SAT$state,
col=c('red','blue','green')[SAT$fraccount])Within each of the \(3\) different groups, the slope is positive, while overall there is a negative slope. This is due to the result of a strong negative association between sat and frac, as well as a strong positive association between salary and frac. This is known as the Simpson’s paradox and is the result of the absence within a model of a confounding variable (i.e. the frac covariate in this case). For more information on the paradox have a look here.
library(memisc)
suicides <- as.data.set(spss.system.file("2nd_project_lecture02.sav"))
#Complex model
log.suicide <- glm(suicide_risk~age+marital_status+mother_negligence+father_negligence+self_estrangement
+isolation+normlessness+meaninglessness+drug_use+
metal+worshipping+vicarious,family="binomial",data=suicides)
mod <- log.suicide
y <- suicides$suicide_risk
# using a cutoff of cut, calculate sensitivity, specificity, and classification rate
perf = function(cut, mod, y)
{
yhat = (mod$fit>cut)
w = which(y==1)
sensitivity = mean( yhat[w] == 1 )
specificity = mean( yhat[-w] == 0 )
c.rate = mean( y==yhat )
out = t(as.matrix(c(sensitivity, specificity, c.rate)))
colnames(out) = c("sensitivity", "specificity", "c.rate")
return(out)
}
s = seq(.01,.99,length=1000)
OUT = matrix(0,1000,3)
for(i in 1:1000) OUT[i,]=perf(s[i],mod,y)
plot(s,OUT[,1],xlab="Cutoff",ylab="Value",cex.lab=1,cex.axis=1,ylim=c(0,1),type="l",lwd=2,axes=FALSE,col=2)
axis(1,seq(0,1,length=5),seq(0,1,length=5),cex.lab=0.5)
axis(2,seq(0,1,length=5),seq(0,1,length=5),cex.lab=0.5)
#mtext("Cutoff",
# side=1) #Add text to the x-axis
#mtext("Probability",
# side=2)
lines(s,OUT[,2],col="darkgreen",lwd=2)
lines(s,OUT[,3],col=4,lwd=2)
#lines(s,OUT[,4],col="darkred",lwd=2)
box()
#legend(0,.25,col=c(2,"darkgreen",4,"darkred"),lwd=c(2,2,2,2),c("Sensitivity","Specificity","Classification Rate","Distance"))
legend(0.4,.25,col=c(2,"darkgreen",4),lwd=c(2,2,2),c("Sensitivity","Specificity","Classification Rate"))The three different measures are:
Sensitivity: The proportion of 1’s that are correctly identified
Specificity: The proportion of 0’s that are correctly identified
Correct classification rate: The proportions of predictions that are correct
Some comments that we can make based on the graph are:
Looking at when the cutoff value is 0.01 (all people with fitted values above 0.01 are classified as having the disease) we can see that the sensitivity is 100%. This makes sense since we classify almost all people as having the disease so we expect that \(P(\hat{Y_i}=1|Y_i=1)\) would be close to 1. On the other hand, since we specify almost everyone as having the disease, that means that the specificity will be low since the model does not correctly identify all the people who haven’t the disease.
Looking at when the cutoff value is 0.99 (all people with fitted values above 0.99 are classified as having the disease) we can see that the specificity is 100%. This makes sense since we classify almost all people as not having the disease so we expect that \(P(\hat{Y_i}=0|Y_i=0)\) would be close to 1. On the other hand, since we specify almost everyone as not having the disease, that means that the sensitivity will be low since the model does not correctly identify all the people who have the disease.
Deciding on which cutoff value we will choose is really important, but before we do that we have to decide which is the most important measure to optimise. You can see from the graph that different cutoff values optimise different measures. For example, if our model was aiming to evaluate a diagnostic test for a serious disease that has a relatively safe cure, the sensitivity is far more important that the specificity. In another case, if the disease was relatively minor and the treatment was risky, specificity would be more important to control.
It seems that if we want to choose a cutoff value which gives high values on all three different measures; we could choose the value 0.25 since all three probabilities are above 0.75.
Download the data set project_lecture03.sav which is located on Moodle. Install the package memisc and load the data set in R with the command name_of_data_set <- as.data.set(spss.system.file("project_lecture03.sav")).
The data set is comprised of \(121\) rows and \(15\) columns. Each row refers to a different person while the columns refer to:
age denotes the age of a person.
age_group denotes in which age group the person belongs to. The value \(1\) refers to \(14-16\) years old while \(2\) refers to \(16-19\) years old.
drug_use refers to how many times the person used drug substances during the last year.
father_negligence and mother_negligence are scales in which a high score is associated with a perception of cold and rejecting family relationships.
gender shows the gender of the person.
isolation corresponds to a subjective perception of lack of support.
marital_status shows the marital status of the person’s parents. The two possible values are together or separated/divorced.
meaninglessness describes youths that may doubt the relevance of school in attaining future employment.
metal describes how much the person listens to metal music.
normlessness is defined as a belief that socially dissaproved behaviours may be used to achieve certain goals.
self_estrangement refers to persons who have a negative self-perception and who are overwhelmed by difficulties they consider out of control.
suicide_risk shows if a person is considered to be a suicide risk. The value \(0\) refers to persons who are not considered to be suicide risks while the value \(1\) refers to persons who are considered to be suicide risks.
vicarious music refers to when somebody listens to music when angry and/or bringing out aggressiveness by listening to music.
worshipping refers to behavioural manifestation of worshipping (e.g. hanging posters, acquiring information about singers, hanging out with other fans).
Note: For all the previous variables that are measured in a scale (i.e. father_negligence, mother_negligence, isolation, meaninglessness, metal, normlessness, self_estrangement, vicarious, worshipping); high values of the scale correspond to a behaviour/feeling that happens more, while low values correspond to a behaviour/feeling that is less present.
Things to do:
Things to be careful of or/and think about:
Choose the appropriate model for the response variable you decide to use.
What is the question that you try to answer using this model?
Why did you choose this model instead of another?
Is the model valid?
Do you have any concerns about any of the variables or even the data set as a whole?
Can you think of any additional variables that would make the data set even more interesting?
Be ready to present your results.
library(ggplot2) # visualisation
library(ggthemes) # visualisation
library(scales) # visualisation
library(dplyr) # data manipulation
library(mice) # imputation
library(bindrcpp)
library(sjPlot)
library(sjmisc)
#Instead of train.csv you should use Titanic.csv (this is how the data set is titled on Moodle)
train <- read.csv('train.csv', stringsAsFactors = F)
#test <- read.csv('test.csv', stringsAsFactors = F)
full <- bind_rows(train)#, test) # bind training & test data
colnames(full) <- c("passenger.id","survived","passenger.class","name","gender","age","siblings.spouses.aboard","parents.children.aboard","ticket.number","fare","cabin","port.of.embarkation")
full <- as.tbl(full)
# check data
#str(full)library(knitr)
kable(full[c(1:7),])| passenger.id | survived | passenger.class | name | gender | age | siblings.spouses.aboard | parents.children.aboard | ticket.number | fare | cabin | port.of.embarkation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
| 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
str(full)Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
$ passenger.id : int 1 2 3 4 5 6 7 8 9 10 ...
$ survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ passenger.class : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ gender : chr "male" "female" "female" "female" ...
$ age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ siblings.spouses.aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ parents.children.aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ ticket.number : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : chr "" "C85" "" "C123" ...
$ port.of.embarkation : chr "S" "C" "S" "S" ...
removecols <- c("passenger.id","name","ticket.number","fare","cabin","port.of.embarkation")
remove <- which(names(full) %in% removecols)
full <- full[-remove]
full$age[is.na(full$age)] <- mean(full$age,na.rm=T)
full$survived <- as.factor(full$survived)
full$passenger.class <- as.factor(full$passenger.class)
summary(full) survived passenger.class gender age siblings.spouses.aboard
0:549 1:216 Length:891 Min. : 0.42 Min. :0.000
1:342 2:184 Class :character 1st Qu.:22.00 1st Qu.:0.000
3:491 Mode :character Median :29.70 Median :0.000
Mean :29.70 Mean :0.523
3rd Qu.:35.00 3rd Qu.:1.000
Max. :80.00 Max. :8.000
parents.children.aboard
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.3816
3rd Qu.:0.0000
Max. :6.0000
sjp.xtab(full$survived,full$gender,geom.colors = c("pink","blue"),show.values = FALSE,show.total = FALSE)sjp.xtab(full$survived,full$passenger.class,geom.colors = c("red","purple","green"),show.values = FALSE,show.total = FALSE)model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
summary(model.bin)
Call:
glm(formula = survived ~ ., family = binomial(link = "logit"),
data = full)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6564 -0.6111 -0.4229 0.6130 2.4317
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.07680 0.40613 10.038 < 2e-16 ***
passenger.class2 -1.19117 0.26191 -4.548 5.41e-06 ***
passenger.class3 -2.34887 0.24278 -9.675 < 2e-16 ***
gendermale -2.76864 0.19878 -13.928 < 2e-16 ***
age -0.04016 0.00782 -5.136 2.80e-07 ***
siblings.spouses.aboard -0.33480 0.10873 -3.079 0.00207 **
parents.children.aboard -0.08166 0.11467 -0.712 0.47640
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 790.33 on 884 degrees of freedom
AIC: 804.33
Number of Fisher Scoring iterations: 5
library(car)
vif(model.bin) GVIF Df GVIF^(1/(2*Df))
passenger.class 1.311086 2 1.070059
gender 1.195490 1 1.093385
age 1.296394 1 1.138593
siblings.spouses.aboard 1.240057 1 1.113578
parents.children.aboard 1.213541 1 1.101609
durbinWatsonTest(model.bin) lag Autocorrelation D-W Statistic p-value
1 0.03249683 1.934519 0.298
Alternative hypothesis: rho != 0
sjp.glm(model.bin)sjp.glm(model.bin,type="pred",vars=c("age","passenger.class"),facet.grid = FALSE,geom.colors = c("red","purple","green"))library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")
auc <- performance(score,"auc")
plot(performance(score,"tpr","fpr"),col="green",main = paste("Area under the curve:", round(auc@y.values[[1]] ,3)))library(ROCR)
full$Prid <-predict(model.bin,full,type="response")
score <- prediction(full$Prid,full$survived)
perf <- performance(score,"tpr","fpr")cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]],
tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
head(subset(cutoffs, fpr < 0.2)) cut fpr tpr
256 0.3876013 0.1948998 0.7777778
257 0.3840028 0.1967213 0.7777778
258 0.3788025 0.1985428 0.7777778
255 0.3965224 0.1948998 0.7748538
254 0.4061724 0.1948998 0.7690058
253 0.4064211 0.1948998 0.7660819
train <- read.csv('train.csv', stringsAsFactors = F)
full <- bind_rows(train)#, test) # bind training & test data
colnames(full) <- c("passenger.id","survived","passenger.class","name","gender","age","siblings.spouses.aboard","parents.children.aboard","ticket.number","fare","cabin","port.of.embarkation")
full <- as.tbl(full)
removecols <- c("passenger.id","name","ticket.number","fare","cabin","port.of.embarkation")
remove <- which(names(full) %in% removecols)
full <- full[-remove]
full$age[is.na(full$age)] <- mean(full$age,na.rm=T)
full$survived <- as.factor(full$survived)
full$passenger.class <- as.factor(full$passenger.class)model.bin <- glm(survived~., family=binomial(link="logit"), data=full)
#Backwards selection
backward <- step(model.bin)Start: AIC=804.33
survived ~ passenger.class + gender + age + siblings.spouses.aboard +
parents.children.aboard
Df Deviance AIC
- parents.children.aboard 1 790.84 802.84
<none> 790.33 804.33
- siblings.spouses.aboard 1 801.51 813.51
- age 1 818.94 830.94
- passenger.class 2 901.34 911.34
- gender 1 1037.28 1049.28
Step: AIC=802.84
survived ~ passenger.class + gender + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
- siblings.spouses.aboard 1 805.29 815.29
- age 1 819.15 829.15
- passenger.class 2 901.80 909.80
- gender 1 1043.77 1053.77
backward
Call: glm(formula = survived ~ passenger.class + gender + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) passenger.class2 passenger.class3 gendermale
4.02742 -1.18983 -2.34780 -2.74022
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
formula(backward)survived ~ passenger.class + gender + age + siblings.spouses.aboard
#Forward selection
model.empty <- glm(survived~1, family=binomial(link="logit"), data=full)
forward <- step(model.empty,
scope=list(lower=formula(model.empty),upper=formula(model.bin)), direction="forward")Start: AIC=1188.66
survived ~ 1
Df Deviance AIC
+ gender 1 917.8 921.8
+ passenger.class 2 1083.1 1089.1
+ parents.children.aboard 1 1180.8 1184.8
+ age 1 1182.3 1186.3
<none> 1186.7 1188.7
+ siblings.spouses.aboard 1 1185.5 1189.5
Step: AIC=921.8
survived ~ gender
Df Deviance AIC
+ passenger.class 2 826.89 834.89
+ siblings.spouses.aboard 1 904.69 910.69
+ parents.children.aboard 1 914.43 920.43
<none> 917.80 921.80
+ age 1 917.06 923.06
Step: AIC=834.89
survived ~ gender + passenger.class
Df Deviance AIC
+ age 1 805.29 815.29
+ siblings.spouses.aboard 1 819.15 829.15
<none> 826.89 834.89
+ parents.children.aboard 1 825.06 835.06
Step: AIC=815.29
survived ~ gender + passenger.class + age
Df Deviance AIC
+ siblings.spouses.aboard 1 790.84 802.84
+ parents.children.aboard 1 801.51 813.51
<none> 805.29 815.29
Step: AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
+ parents.children.aboard 1 790.33 804.33
forward
Call: glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) gendermale passenger.class2 passenger.class3
4.02742 -2.74022 -1.18983 -2.34780
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
formula(forward)survived ~ gender + passenger.class + age + siblings.spouses.aboard
bothways <-step(model.empty, list(lower=formula(model.empty),upper=formula(model.bin)),
direction="both")Start: AIC=1188.66
survived ~ 1
Df Deviance AIC
+ gender 1 917.8 921.8
+ passenger.class 2 1083.1 1089.1
+ parents.children.aboard 1 1180.8 1184.8
+ age 1 1182.3 1186.3
<none> 1186.7 1188.7
+ siblings.spouses.aboard 1 1185.5 1189.5
Step: AIC=921.8
survived ~ gender
Df Deviance AIC
+ passenger.class 2 826.89 834.89
+ siblings.spouses.aboard 1 904.69 910.69
+ parents.children.aboard 1 914.43 920.43
<none> 917.80 921.80
+ age 1 917.06 923.06
- gender 1 1186.66 1188.66
Step: AIC=834.89
survived ~ gender + passenger.class
Df Deviance AIC
+ age 1 805.29 815.29
+ siblings.spouses.aboard 1 819.15 829.15
<none> 826.89 834.89
+ parents.children.aboard 1 825.06 835.06
- passenger.class 2 917.80 921.80
- gender 1 1083.11 1089.11
Step: AIC=815.29
survived ~ gender + passenger.class + age
Df Deviance AIC
+ siblings.spouses.aboard 1 790.84 802.84
+ parents.children.aboard 1 801.51 813.51
<none> 805.29 815.29
- age 1 826.89 834.89
- passenger.class 2 917.06 923.06
- gender 1 1046.48 1054.48
Step: AIC=802.84
survived ~ gender + passenger.class + age + siblings.spouses.aboard
Df Deviance AIC
<none> 790.84 802.84
+ parents.children.aboard 1 790.33 804.33
- siblings.spouses.aboard 1 805.29 815.29
- age 1 819.15 829.15
- passenger.class 2 901.80 909.80
- gender 1 1043.77 1053.77
bothways
Call: glm(formula = survived ~ gender + passenger.class + age + siblings.spouses.aboard,
family = binomial(link = "logit"), data = full)
Coefficients:
(Intercept) gendermale passenger.class2 passenger.class3
4.02742 -2.74022 -1.18983 -2.34780
age siblings.spouses.aboard
-0.03985 -0.35826
Degrees of Freedom: 890 Total (i.e. Null); 885 Residual
Null Deviance: 1187
Residual Deviance: 790.8 AIC: 802.8
formula(bothways)survived ~ gender + passenger.class + age + siblings.spouses.aboard
library(glmulti)
glmulti.logistic <-
glmulti(survived~., data = full,
level = 2, # Use level 2 to include interactions
method = "h", # Use "h" for exhaustive approach
crit = "bic", # BIC as the criterion
confsetsize = 5, # Keep 5 best models
plotty = F, report = F, # Without a graph and no reports
fitfunction = "glm", # glm function
family = binomial) # binomial family for logistic regression
summary(glmulti.logistic@objects[[1]])
Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0484 -0.6186 -0.4897 0.3663 2.5423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.240113 0.690632 7.587 3.26e-14 ***
passenger.class2 -1.197178 0.733635 -1.632 0.102713
passenger.class3 -3.870291 0.630624 -6.137 8.40e-10 ***
gendermale -3.996719 0.623808 -6.407 1.48e-10 ***
age -0.045827 0.008544 -5.364 8.15e-08 ***
parents.children.aboard:siblings.spouses.aboard -0.255927 0.072199 -3.545 0.000393 ***
passenger.class2:gendermale -0.396289 0.805568 -0.492 0.622764
passenger.class3:gendermale 2.078590 0.666602 3.118 0.001820 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 755.89 on 883 degrees of freedom
AIC: 771.89
Number of Fisher Scoring iterations: 6