In this example, we are going to examine a dataset of part of the passenger list for the Titanic. Our aim is to use the data to fit a model which can predict each passenger’s survival/nonsurvival, given their age, sex and class of ticket.
When faced with a categorical response variable, that is, when the outcome is yes/no, success/failure, etc, we need to use logistic regression. This gives us a probability for each passenger which we can interpret as yes (>0.5) or no (<=0.5).
In this kind of situation, linear regression would give us probabilities outside [0,1] and poor performance from the model as the linear assumption does not hold. Instead, we assume a logistic relationship between the independent variable(s) and the log-odds of the dependent variable.
Load the ‘titanic’ dataset and assign it to the variable ‘dati’:
dati <- read.delim("Titanic.txt", head = T, sep = ",")
We examine the structure of the dataset:
str(dati)
## 'data.frame': 1046 obs. of 4 variables:
## $ Sopravvissuti: chr "yes" "yes" "no" "no" ...
## $ Genere : chr "female" "male" "female" "male" ...
## $ Eta : num 29 0.917 2 30 25 ...
## $ Classe : chr "1st" "1st" "1st" "1st" ...
We have 1046 observations of 4 variables: sopravvissuti (survived), a yes/no category, genere (sex), età (age), and classe (class of ticket). The first two and fourth variables are categorical and the third is numerical. We use the ‘factor’ function to change the categorical variables into factors. Factors store the data as integer codes associated with levels, making it memory efficient and ensures there is an inherent order to the levels, which can be important for analysis or plotting.
dati$Sopravvissuti <- factor(dati$Sopravvissuti)
dati$Genere <- factor(dati$Genere)
dati$Classe <- factor(dati$Classe)
Now we re-examine the dataset:
str(dati)
## 'data.frame': 1046 obs. of 4 variables:
## $ Sopravvissuti: Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
## $ Genere : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ Eta : num 29 0.917 2 30 25 ...
## $ Classe : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
We can see the categorical variables (survival, sex and class) are now factors with two, two and three levels respectively.
To see an overview of the data, we can use the ‘summary’ function:
summary(dati)
## Sopravvissuti Genere Eta Classe
## no :619 female:388 Min. : 0.1667 1st:284
## yes:427 male :658 1st Qu.:21.0000 2nd:261
## Median :28.0000 3rd:501
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
For the categorical variables, we can see the number of datapoints in each category. The passenger ages go from two months old to 80 years old, with the main concentration of ages between 20 and 40. We can plot the histogram using the ‘hist’ function to get a better feel for this variable:
hist(dati$Eta, col = rainbow(16))
In a ‘typical’ population, we would expect to see ages spread out according to a normal distribution, however, in our sample, there is a positive skew.
Before performing our logistic regression, we should check the contrast coding for the factor variables:
contrasts(dati$Sopravvissuti)
## yes
## no 0
## yes 1
contrasts(dati$Genere)
## male
## female 0
## male 1
contrasts(dati$Classe)
## 2nd 3rd
## 1st 0 0
## 2nd 1 0
## 3rd 0 1
We use the ‘glm’ function, and indicate that our response variable survival/sopravvissuti should be related to the all of the other independent variables. We choose the argument ‘family’ to be binomial as we have a binary response of yes / no.
glm.titanic <- glm(Sopravvissuti ~ ., data = dati, family = binomial)
We use the function ‘summary’ to see the output of the logistic model:
summary(glm.titanic)
##
## Call:
## glm(formula = Sopravvissuti ~ ., family = binomial, data = dati)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.522074 0.326702 10.781 < 2e-16 ***
## Generemale -2.497845 0.166037 -15.044 < 2e-16 ***
## Eta -0.034393 0.006331 -5.433 5.56e-08 ***
## Classe2nd -1.280570 0.225538 -5.678 1.36e-08 ***
## Classe3rd -2.289661 0.225802 -10.140 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1414.62 on 1045 degrees of freedom
## Residual deviance: 982.45 on 1041 degrees of freedom
## AIC: 992.45
##
## Number of Fisher Scoring iterations: 4
From the last column of the coefficients, ‘Pr’, we can see that all of the variables have very low p-values, confirming the statistical significance of the results. From the first column, the more negative the estimate, the less chance there was of survival (bearing in mind that 0 indicates non-survival and 1 survival). We can therefore see that age has a much smaller effect on survival, compared to the other variables. The most influential predictors leading to non-survival were: being male, having a third class ticket, then having a 2nd class ticket.
To evaluate the efficacy of our logarithmic model, we can use the ‘predict’ function, specifying the type ‘response’ and assigning the result to ‘predictions’.
predictions <- predict(glm.titanic, type = "response")
head(predictions)
## Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor
## 0.9258533 0.7296211
## Allison, Miss. Helen Loraine Allison, Mr. Hudson Joshua Crei
## 0.9693290 0.4981081
## Allison, Mrs. Hudson J C (Bessi Anderson, Mr. Harry
## 0.9347616 0.3482715
The ‘predictions’ vector now contains 1046 probabilities, one for each passenger, estimating their chances of survival, based on the logarithmic model. We consider a probability greater than 0.5 as ‘survived’ and less than or equal to 0.5 as ‘did not survive’. We can examine the histogram of the model’s output using the ‘hist’ function:
hist(predictions, main = "Histogram of Predictions", sub = "Non-survival Survival", col = wes_palette(n=5, name = "Moonrise3"))
abline(v=0.5, col = "darkred")
As expected, the majority of predictions are less than 0.5, as we know that 619/1046 = 59% of the passengers on our list did not survive. In order to analyse if our predictions match the actual outcomes, we need a 1 x 1046 vector with “yes” or “no” for each passenger, according to the model, which we can then compare with the actual outcomes from the ‘survival’ attribute in the dataset.
One way is to first construct a variable ‘sum.pred’ using the ‘rep’ function to form a 1 x 1046 vector containing “no” for every element in the dataset. Then we assign the value “yes” to every element in ‘sum.pred’ whose corresponding value in ‘predictions’ is greater than 0.5.
sum.pred <- rep("no", nrow(dati))
sum(sum.pred == "no")
## [1] 1046
sum.pred[predictions > 0.5] <- "yes"
sum(sum.pred == "no")
## [1] 646
Now the number of “no”s has dropped from 1046 to 646. We can construct a confusion matrix to see how well the model classified the data, which compares the ‘survived’ attribute with our model’s predictions, passenger by passenger:
confusion.matrix <- table(dati$Sopravvissuti, sum.pred)
confusion.matrix
## sum.pred
## no yes
## no 520 99
## yes 126 301
We can find the passengers which have been classified correctly by the model along the leading diagonal (top left and bottom right):
cat("Percentage passengers classified correctly (original model): ",
round(100*sum(diag(confusion.matrix))/sum(confusion.matrix),0),
"\nPercentage passengers classified incorrectly (original model):",
round(100*(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix),0),
"\n")
## Percentage passengers classified correctly (original model): 78
## Percentage passengers classified incorrectly (original model): 22
This model is accurate 78% of the time.
As an alternative method, as there are a large number of observations in this dataset, it becomes feasible to split the data into a training set, containing 70% of the datapoints and a testing set, containing the remaining 30%. We use the ‘nrow’ function to work out how many values are in the dataset, and we calculate 70% of the total number of rows.
nrow(dati)*0.70
## [1] 732.2
As the result is decimal, we use ‘ceiling’ to round up the the next integer and assign the result to ‘percent70’.
percent70 <- ceiling(nrow(dati)*0.70)
We are going to fit the new model using just over 70% of the data and
test the new model using the remaining 30%. The necessary steps are: -
Form a vector with numbers from 1 to 1046 using the ‘seq_len’ function.
70% of these numbers will be picked randomly using the ‘sample’
function, forming a vector of 733 numbers which is assigned to
‘train_nos’. - Create a logical vector of 1046 values that will be used
to split the dataset into training and testing sets. We do this by
comparing a vector of numbers from 1 to 1046 with the numbers in the
‘train_nos’ vector, where 733 numbers from 1 to 1046 were picked
randomly. When the numbers match, “TRUE” is returned, and if not,
“FALSE” is returned, resulting in a 1 x 1046 logical vector which we
assign to the variable ‘train’.
- Check the number of “TRUE” and “FALSE” values in ‘train’ using the
function ‘sum’ to count the ‘TRUE’ values.
train_nos <- sample(seq_len(nrow(dati)), percent70, replace = FALSE, prob = NULL)
train <- seq_len(nrow(dati)) %in% train_nos
cat("Training set size:", sum(train), "\nTesting set size:", sum(!train), "\n")
## Training set size: 733
## Testing set size: 313
We create a new model, whose output is assigned to ‘glm.train’, using
the glm
function. The glm arguments are the same as before,
except this time we take the subset of datapoints corresponding to the
random numbers in our variable train
.
glm.train <- glm(Sopravvissuti ~ ., data = dati, family = binomial, subset = train)
summary(glm.train)
##
## Call:
## glm(formula = Sopravvissuti ~ ., family = binomial, data = dati,
## subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.579865 0.390319 9.172 < 2e-16 ***
## Generemale -2.467504 0.199677 -12.358 < 2e-16 ***
## Eta -0.035205 0.007446 -4.728 2.27e-06 ***
## Classe2nd -1.278419 0.268730 -4.757 1.96e-06 ***
## Classe3rd -2.356405 0.266542 -8.841 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 989.63 on 732 degrees of freedom
## Residual deviance: 692.46 on 728 degrees of freedom
## AIC: 702.46
##
## Number of Fisher Scoring iterations: 4
Using the ‘summary’ function, we can see that the parameters differ slightly to those found with the original model, although the most influential predictors leading to non-survival are still: being male, having a third class ticket, then having a 2nd class ticket.
We use the ‘predict’ function to evaluate the new model. The function generates probabilities for the data rows which are not in the training set (and are therefore in the testing set). The output is assigned to the variable ‘glm.test’.
glm.test <- predict(glm.train, dati[!train,], type = "response")
Next, we convert this vector of probabilities into a vector of “yes” and “no” values. This allows us to directly compare it with the corresponding labels in the ‘survived/sopravvissuti’ category from the testing set in the original data.
sum.pred.test <- ifelse(glm.test > 0.5, "yes", "no")
confusion.matrix2 <- table(dati$Sopravvissuti[!train], sum.pred.test)
confusion.matrix2
## sum.pred.test
## no yes
## no 149 34
## yes 33 97
We can now calculate the percentages of passengers classified correctly and incorrectly, and compare those with the classification percentages from the original model.
cat("Percentage passengers classified correctly (split model): ",
round(100*sum(diag(confusion.matrix2))/sum(confusion.matrix2),0),
"\nPercentage passengers classified incorrectly (split model):",
round(100*(confusion.matrix2[1,2]+confusion.matrix2[2,1])/sum(confusion.matrix2),0),
"\n")
## Percentage passengers classified correctly (split model): 79
## Percentage passengers classified incorrectly (split model): 21
Running the code for the second model may produce varying outcomes: sometimes it yields better results, and other times it produces worse results than the original model, which used the entire dataset for training. For comparison:
cat("Percentage passengers classified correctly (original model): ",
round(100*sum(diag(confusion.matrix))/sum(confusion.matrix),0),
"\nPercentage passengers classified incorrectly (original model):",
round(100*(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix),0),
"\n")
## Percentage passengers classified correctly (original model): 78
## Percentage passengers classified incorrectly (original model): 22
However, we need to be cautious about overfitting, which occurs when a model learns not only the underlying patterns in the training data, but also the noise or random variations. This leads to the model performing exceptionally well on the training data but struggling to generalize to new, unseen data.
Overfitting would be evidenced by a 5-10% difference between model outcomes. To prevent overfitting, we can employ additional techniques or incorporate more data to ensure that the training and testing errors remain similar.
In this case, the difference in percentage error was:
difference <- round(100*sum(diag(confusion.matrix))/sum(confusion.matrix)
- 100*sum(diag(confusion.matrix2))/sum(confusion.matrix2),1)
cat("Difference in percentage error between models:",
difference,
"\n")
## Difference in percentage error between models: -0.1
cat("Overfitting evident?", abs(difference) >= 5)
## Overfitting evident? FALSE
Our models achieved around 78% accuracy in predicting each passenger’s survival or nonsurvival, based on their age, sex and class of ticket. Initially, we used a logistic regression model applied to all 1046 datapoints. For comparison, we then employed a 70:30 training-to-testing data split, which yielded a similar level of accuracy, indicating that overfitting was not a significant concern in the original model. Rerunning the second model may produce variable results, however, due to the randomness of the data sample it uses.
Logistic regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable. To ensure the validity of this assumption, further examination of these relationships is necessary. Additionally, while the survival and non-survival groups were reasonably balanced (59% vs. 41%), imbalanced groups in other scenarios could negatively impact the reliability of the model. Exploring adjustments to the threshold for predicting “yes” (currently set at 0.5) could further refine the model’s precision.
A well-performing logistic regression model requires a sufficient number of datapoints—generally at least 10-15 per predictor per class. When we examine the distribution of datapoints in our sample, some small subsets emerge as potential concerns:
table(dati$Sopravvissuti, dati$Genere)
##
## female male
## no 96 523
## yes 292 135
table(dati$Sopravvissuti, dati$Classe)
##
## 1st 2nd 3rd
## no 103 146 370
## yes 181 115 131
There were only 103 females in 1st class who did not survive, and 96 females overall who did not survive. Filtering females who did not survive and organising them by class, we have:
table(dati$Classe, (dati$Sopravvissuti == "no" & dati$Genere == "female"))
##
## FALSE TRUE
## 1st 279 5
## 2nd 250 11
## 3rd 421 80
There were only 5 females in first class who did not survive, so this small subset may impact the logistic regression model’s reliability, particularly in the split model. A longer passenger list, if available, might give us more datapoints in each subset of each class, enhancing the model’s performance.
The current model demonstrates a reasonable level of accuracy, although addressing the limitations in data size and structure could improve its predictive power and ideally push its accuracy above the 80-90% level.