Introduction

To begin with our practical tasks we load the Brexit data set given to us, containing socio-demographic statistics about electoral wards and their response towards the Brexit referendum, w.r.t Leaving the EU.

#loading the data and inspecting first few rows
brexit_data <- read.csv("brexit.csv",header=TRUE,stringsAsFactors = T)
head(brexit_data)

##        abc1   notBornUK medianIncome medianAge withHigherEd voteBrexit
## 1 0.1336406 0.012605042    0.2525773 0.5000000   0.08552632       TRUE
## 2 0.1290323 0.113445378    0.1082474 0.2727273   0.11184210       TRUE
## 3 0.1612903 0.004201681    0.1288660 0.6363636   0.11842105       TRUE
## 4 0.3225806 0.046218487    0.2268041 0.4545455   0.21710526       TRUE
## 5 0.3456221 0.058823529    0.2010309 0.5454545   0.24342105       TRUE
## 6 0.2211982 0.012605042    0.2319588 0.4545455   0.09210526       TRUE

summary(brexit_data$voteBrexit)

##    Mode   FALSE    TRUE 
## logical     107     237

Question 1 :

Use K-means algorithm (kmeans function in R) to cluster the inputs. Validate the optimal number of clusters. Justify your choice. Visualise the clusters and analyse the validated k-means model. Are the inferred clusters associated with or predictive of the outputs?

As there are 2 levels based on output column voteBrexit, initially we cluster the input data in 2 clusters to determine if these clusters can represent the output trend.

# clustering the wards based on inputs
km <- kmeans(brexit_data[1:5], 2)
km

## K-means clustering with 2 clusters of sizes 71, 273
## 
## Cluster means:
##        abc1  notBornUK medianIncome medianAge withHigherEd
## 1 0.6618420 0.42289028    0.5168433 0.3072983    0.5660675
## 2 0.3715501 0.09248315    0.2339224 0.5631036    0.2512531
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 1 1
##  [38] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2
## [112] 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2
## [223] 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 1 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 12.38362 22.99400
##  (between_SS / total_SS =  41.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

In order to validate the optimal number of clusters for our data, the NBClust() function is used. It gives us the comparison of clustering the dataset based on input columns using Hubert index and D-index values. In the graphs it can be seen that there is a significant elbow point after increasing number of clusters from 2 to 4, and that the Hubert value is increasing from there on. Thus it is safe to conclude that 2 is optimal number of cluster.

#Installing NBClust library and dependencies to validate optimal number of clusters
if(!require(NbClust)) install.packages("NbClust",repos = "http://cran.us.r-project.org")

## Loading required package: NbClust

library("NbClust")
cl <- NbClust(brexit_data[1:5], min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

#Visualising the clusters with ggplot2
result_df <- data.frame(brexit_data,km$cluster)
head(result_df)

##        abc1   notBornUK medianIncome medianAge withHigherEd voteBrexit
## 1 0.1336406 0.012605042    0.2525773 0.5000000   0.08552632       TRUE
## 2 0.1290323 0.113445378    0.1082474 0.2727273   0.11184210       TRUE
## 3 0.1612903 0.004201681    0.1288660 0.6363636   0.11842105       TRUE
## 4 0.3225806 0.046218487    0.2268041 0.4545455   0.21710526       TRUE
## 5 0.3456221 0.058823529    0.2010309 0.5454545   0.24342105       TRUE
## 6 0.2211982 0.012605042    0.2319588 0.4545455   0.09210526       TRUE
##   km.cluster
## 1          2
## 2          2
## 3          2
## 4          2
## 5          2
## 6          2

result_df$voteBrexit <- factor(result_df$voteBrexit, levels = c(FALSE, TRUE), labels = c("False", "True"))

In the first visualisation we use the cluster ID as hue and plot for all input columns

if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")

## Loading required package: ggplot2

library(ggplot2)
cluster_id = km$cluster
cols = c("Red", "Blue")
#Visualising using cluster id
plot(result_df[1:5], col=cols[cluster_id])

For the second visualisation we use the output column voteBrexit as hue and observe all input column combinations. By comparing the two graphs we can observe that the decision boundary formed by our clustering model is fitting well for some column combinations like abc1~notBornUK, abc1~medianAge, notBornUK~withHigherEd as seen in the first visualisation. However this has no similarity with the Brexit opinion distribution. The decision boundaries in second graph is not clearly distinguishable and it can be concluded that the inferred clusters are not predictive of our output column.

#Visualising the input using Brexit opinion
suppressWarnings({ 
 plot(result_df[1:5],col=result_df$voteBrexit,pch = 19,legend.text = c("False", "True"))
})

Before moving on to the next practical tasks, we set a fixed seed value to help reproducibility and consistent evaluation across all models.

#setting seed value
set.seed(4)

Question 2 :

Consider a logistic regression model for the data. Use the model with all inputs to find which input variables are relevant for explaining the output by interpreting the model. Identify the direction and magnitude of each input effect from the fitted coefficients. Which inputs would you say have strong effects? Order the inputs in terms of decreasing effect. Justify your reasoning. Compare your findings with the plots shown on the Guardian website. Do your findings agree with these plots? Comment on your findings

For a logistic regression model we use the glm() function with binomial family.

#Creating a Logistic Regression model
model_1 <- glm(voteBrexit ~ abc1+notBornUK+medianIncome+medianAge+withHigherEd, family =binomial,data = brexit_data)
summary(model_1)

## 
## Call:
## glm(formula = voteBrexit ~ abc1 + notBornUK + medianIncome + 
##     medianAge + withHigherEd, family = binomial, data = brexit_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.1386     0.8477  -0.164 0.870122    
## abc1          17.5780     2.9114   6.038 1.56e-09 ***
## notBornUK      5.6861     1.8033   3.153 0.001615 ** 
## medianIncome  -6.3857     1.9217  -3.323 0.000891 ***
## medianAge      5.9209     1.4066   4.209 2.56e-05 ***
## withHigherEd -26.7443     3.5762  -7.478 7.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.52  on 343  degrees of freedom
## Residual deviance: 247.39  on 338  degrees of freedom
## AIC: 259.39
## 
## Number of Fisher Scoring iterations: 6

The logistic regression model states that all the input columns are fairly significant to the output. Following is the interpretation of the coefficients :

abc1 : Positive direction , Magnitude : 17.57
notBornUK : Positive direction , Magnitude : 5.68
medianIncome : Negative direction , Magnitude : 6.38
medianAge : Positive direction , Magnitude : 5.92
withHigherEd : Negative direction , Magnitude : 26.74

From the co-efficients, we can say that abc1 has a strong positive relation to the output, whereas withHigherEd has a strong negative relation to the output. Saying that , we must interpret that a higher value of abc1 i.e. higher proportion of individuals in Upper or Middle class in a ward exhibited more tendency to vote in favour of Brexit. On other hand a higher value of withHigherEd: i.e. more proportion of residents with any university-level education in a ward exhibited lesser tendency to vote in favor of Brexit.

As a result we can order the input columns in decreasing order of effect considering only the absolute value of the magnitude as the signs associated are signifying the direction.

withHigherEd
abc1
medianIncome
notBornUK
medianAge

The findings of our model agree with the findings of Guardian website. While looking at the plots it can be inferred that proportion of residents with Higher Education and residents within ABC1 social class were correlated to the Leave/Remain decision while other factors like Median Income, Median Age were fairly distributed across both the opinions.As we quote from the given link “The best predictor of a vote for remain is the proportion of residents who have a degree”. This is same as the interpretation discussed above. Wards having higher university-level educated population were favoring the Remain decision, i.e. against leaving the EU.

Question 3

Discuss factors that may affect interpretability of the regression coefficients of the fitted model. Based on your discussion, explain whether you can reliably determine which inputs are relevant for modelling the output and order the input variables based on their relevance in decreasing order. Justify your reasoning. Use BAGGING for logistic regression for Task 2 to analyse variability of the regression coefficients to support your answer.

Factors effecting the interpretability of the regression coefficients are given as :

Collinearity : When two or more input variables in a model are highly correlated, the interpretation becomes challenging. The similar inputs increase the standard error of the models and it difficult to understand the true effect.
Variable Scale : Variable scale effects model interpretation. Larger variables have larger coefficients indicating they are more significant, although that is not actually true. This factor is not applicable here as all the variables are scaled between 0 to 1.
Interaction Pattern : If one variable is dependent on another variable, understanding the actual effect of those variables is challenging. Understanding these variables require careful consideration and domain knowledge.
Non-linearity : If input and output variables are related non-linearly, then coefficients of linear regression / logistic regression is not sufficient to understand the effect, and the model cannot capture the true relation.

In order to understand the reliability and relevance of the input variable coefficients, we find the correlation matrix and visualise the correlation patterns using hierarchical clustering with help of corrplot library.

if(!require(corrplot)) install.packages("corrplot",repos = "http://cran.us.r-project.org")

## Loading required package: corrplot

## corrplot 0.95 loaded

library(corrplot)
cor_matrix<- cor(brexit_data[1:5])
cor_matrix

##                    abc1  notBornUK medianIncome  medianAge withHigherEd
## abc1          1.0000000  0.3947856    0.7835105 -0.1662686    0.8863129
## notBornUK     0.3947856  1.0000000    0.5584333 -0.7314555    0.5501448
## medianIncome  0.7835105  0.5584333    1.0000000 -0.3467963    0.7412892
## medianAge    -0.1662686 -0.7314555   -0.3467963  1.0000000   -0.2356651
## withHigherEd  0.8863129  0.5501448    0.7412892 -0.2356651    1.0000000

corrplot(cor_matrix, method = "color", type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

From the plot it is evident that, withHigherEd has quite high correlation with Median Income and abc1 variables, and the latter two variables are correlated amongst each other as well. This actually makes sense in terms of demographic interpretation as well. It is possible that people from higher and middle class are able to afford higher education and have a higher median income than financially weaker sections in a ward.
Secondly notBornUK and medianAge don’t show very high signs of correlation, and can be considered relevant.

Hence the variables choosen in order of relevance are as follows :

withHigherEd
notBornUK
medianAge

Next we apply the BAGGING technique on our model.

library(boot)


# Define the function to extract coefficients from logistic regression model
coef_fun <- function(data, indices) {
  fit <- glm(formula = formula(model_1), data = data[indices, ], family = binomial)
  return(coef(fit))
}

# Set the number of iterations for bootstrapping
n_boot <- 1000

# Perform bootstrapping
boot_results <- boot(data = brexit_data, statistic = coef_fun, R = n_boot)

# Calculate the mean coefficients
mean_coef <- colMeans(boot_results$t)

# Calculate the standard errors of coefficients
se_coef <- apply(boot_results$t, 2, sd)

# Print mean coefficients and standard errors and compare with single model
result <- data.frame(coefficients(model_1),mean_coef,se_coef)
result

##              coefficients.model_1.   mean_coef   se_coef
## (Intercept)             -0.1385963  -0.1311362 0.7416401
## abc1                    17.5779980  18.4111562 3.0182646
## notBornUK                5.6861383   5.9755413 1.8648124
## medianIncome            -6.3857396  -6.8106970 2.1550825
## medianAge                5.9208767   6.1129262 1.3389821
## withHigherEd           -26.7442592 -27.8658581 3.9848256

On applying the BAGGING technique, we find out that standard error for the 3 correlated coefficients i.e. withHigherEd , abc1, and medianIncome are higher compared to the other variables. So it can be concluded that collinearity exists between these 3 variables and only one can be considered for relevance to the model.

Question 4

Based on your discussion for Task 3, present and carry out an alternative logistic regression approach to carry out the analysis for Task 2. Discuss benefits and disadvantages of your approach.

As we discussed in Task 3, a new logistic regression approach is used by building the model with only 3 input variables namely notBornUK, medianAge, withHigherEd.

#Creating a alternative Logistic Regression model
model_2 <- glm(voteBrexit ~ notBornUK+medianAge+withHigherEd, family =binomial,data = brexit_data)
summary(model_2)

## 
## Call:
## glm(formula = voteBrexit ~ notBornUK + medianAge + withHigherEd, 
##     family = binomial, data = brexit_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.2667     0.6866   3.302 0.000962 ***
## notBornUK      0.8167     1.3304   0.614 0.539282    
## medianAge      2.9134     1.0871   2.680 0.007365 ** 
## withHigherEd  -9.0665     1.2373  -7.328 2.34e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.52  on 343  degrees of freedom
## Residual deviance: 296.94  on 340  degrees of freedom
## AIC: 304.94
## 
## Number of Fisher Scoring iterations: 5

The new approach states that withHigherEd is the most significant to the output. Following is the interpretation of the coefficients :

notBornUK : Positive direction , Magnitude : 0.81
medianAge : Positive direction , Magnitude : 2.91
withHigherEd : Negative direction , Magnitude : 9.06

From the co-efficients, we can say that withHigherEd has a strong negative relation to the output in this model as well.The order the input columns in decreasing order of effect is as follows :

withHigherEd
medianAge
notBornUK

The findings of this model is echoing the findings of the Guardian website that withHigherEd is most significant to the output , indicating the proportion of people with higher education and university degrees were most effective in deciding whether an electoral ward decided in favor of leaving or not.

Advantage : By reducing the number of input variable columns, we are decreasing the complexity of the model. This is also decreasing the standard errors as we can see from comparing the model summaries. Also earlier the model was considering all input variables as significant due to multiple collinearity effect, whereas in the new model the most significant input is clearly distinguished.
Disadvantage : AIC score of the Task 4 model is going upto 304.9 whereas in Task 2 it was at 259.39. This indicates that with lesser input variables, the model might be under-fitting the data.This problem can be dealt with by using ensemble techniques such as BAGGING

Brexit Breakdown: A Statistical Exploration of Demographic Influences on Brexit Voting Pattern

Rajarshi Nandi

2024-05-10

Introduction

Question 1 :

Question 2 :

Question 3

Question 4