library(kableExtra)
library(boot)
library(DT)
library(MASS)
library(leaps)
library(tidyverse)
library(pander)

1 Introduction

Generalized linear model (GLM) is usually applicable for a continuous response variable given continuous and/or categorical predictors. The use of GLM is based on some basic assumptions e.g the data need to be independently distributed, the target variable does not need to be normally distributed, GLM does not assume a linear relationship between the target variable and the input variables, errors need to be independent but not normally distributed. All these assumptions were key points considered prior to the analysis. Most of the input variables in the microtus dataset are highly correlated. Hence, building a logistic model with all the input variables will be problematic.To circumvent this obstacle, variables that seem important for predicting the target (Group) were selected and used in building the model.

df_microtus<-read.csv(file = "df_microtus.csv")[-1]
datatable(df_microtus,filter = "top",caption = "Table.1: Microtus Dataset")

2 Exploratory Analysis

Started the analysis by separating the known species from the unknown. Among the known species, there are two distinct species namely; Multiplex and subterraneus. Again, these were separated followed by summary statistics on the Known species to have a fair idea of the distribution of the parameters used involved.

df_microtus[,c(2:9)]<-lapply(df_microtus[,c(2:9)],as.numeric)
df_Known<-filter(df_microtus,Group==c("multiplex", "subterraneus")) #Extracting the known species 
df_unknown<-filter(df_microtus,Group=="unknown")  #Extracting the unknown species
df_multiplex<-filter(df_microtus,Group=="multiplex") #Extracting only multiplex species
df_subterraneus<-filter(df_microtus,Group=="subterraneus")  #Extracting only subterraneus species
pander(summary(df_Known[,-1])) #Computing summary statistics on the known species
Table continues below
M1Left M2Left M3Left Foramen Pbone
Min. :1619 Min. :1355 Min. :1402 Min. :3451 Min. :4368
1st Qu.:1747 1st Qu.:1504 1st Qu.:1566 1st Qu.:3687 1st Qu.:4807
Median :1888 Median :1549 Median :1708 Median :3868 Median :5037
Mean :1900 Mean :1562 Mean :1694 Mean :3872 Mean :5050
3rd Qu.:2040 3rd Qu.:1612 3rd Qu.:1772 3rd Qu.:4021 3rd Qu.:5254
Max. :2282 Max. :1814 Max. :2145 Max. :4464 Max. :5874
Length Height Rostrum
Min. :1965 Min. :725.0 Min. :395.0
1st Qu.:2210 1st Qu.:750.0 1st Qu.:425.0
Median :2305 Median :775.0 Median :450.0
Mean :2290 Mean :781.4 Mean :446.4
3rd Qu.:2355 3rd Qu.:800.0 3rd Qu.:465.0
Max. :2600 Max. :910.0 Max. :524.0

Figure 2.1 shows how the input variables in the dataset are correlated with one another. It can be observed that most of the predictors are strongly correlated.

Group<-ifelse(df_Known$Group=="multiplex",1,2)
pairs(df_Known[ , -1],
      col = c("cornflowerblue", "purple")[Group],   #Change color by group
      pch = c(8, 18)[Group],                        #Change points by group
      labels =c("M1Left","M2Left","M3Left","Foramen","Pbone","Length","Height","Rostrum"),
      main = "Pairwise scatterplots on the Microtus Dataset")
Pairwise scatterplots on the Microtus Dataset

Figure 2.1: Pairwise scatterplots on the Microtus Dataset

3 Variable Selection

Selecting the input variables to be used for the logistic regression (GLM): First, created an object by name Group containing the Group variables(name of the species) and set it outside of the dataset. Next, constructed correlation matrix on the input variables, setting the upper triangle and perfectly correlated variables to 0. Then filtered the correlation matrix, removing any input variables with correlation absolute value of 0.8 or above.

Group_Known<-df_Known$Group #Creating a variable "Group" to contain the names of the known species
Known_Species<-df_Known[,!names(df_Known) %in% 'Group']
Matrix<- cor(Known_Species)  #Computing correlation matrix on the input variables
Matrix[upper.tri(Matrix)]<-0
diag(Matrix)<-0        #Setting the upper triangle and perfectly correlated variables to 0
Filter_Known_Species<-Known_Species[,!apply(Matrix,2,function(x) any(x > abs(0.8)))]#Removing any input variable with correlation absolute value of 0.8 or above
Filtered_Dataset<-cbind.data.frame(Filter_Known_Species,Group_Known)
Filtered_Dataset<-Filtered_Dataset %>%
  mutate(Group_Known= ifelse(Group_Known=='multiplex',1,0))
varibale_sel<-glm(Group_Known ~., data=Filtered_Dataset,family = binomial)
stepAIC(varibale_sel,direction = "both")#Used stepAIC with the glm model as input model for the feature selection
## Start:  AIC=16
## Group_Known ~ M1Left + M2Left + M3Left + Foramen + Pbone + Height + 
##     Rostrum
## 
##           Df Deviance    AIC
## - Foramen  1    0.000 14.000
## - Rostrum  1    0.000 14.000
## - Pbone    1    0.000 14.000
## - M3Left   1    0.000 14.000
## - Height   1    0.000 14.000
## - M2Left   1    0.000 14.000
## <none>          0.000 16.000
## - M1Left   1   10.601 24.601
## 
## Step:  AIC=14
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height + Rostrum
## 
##           Df Deviance    AIC
## - Rostrum  1   0.0000 12.000
## - M3Left   1   0.0000 12.000
## - Pbone    1   0.0000 12.000
## - Height   1   0.0000 12.000
## <none>         0.0000 14.000
## + Foramen  1   0.0000 16.000
## - M2Left   1   6.8524 18.852
## - M1Left   1  10.7524 22.752
## 
## Step:  AIC=12
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height
## 
##           Df Deviance    AIC
## - Pbone    1   0.0000 10.000
## - M3Left   1   0.0000 10.000
## - Height   1   0.0000 10.000
## <none>         0.0000 12.000
## + Rostrum  1   0.0000 14.000
## + Foramen  1   0.0000 14.000
## - M2Left   1   7.5962 17.596
## - M1Left   1  12.0232 22.023
## 
## Step:  AIC=10
## Group_Known ~ M1Left + M2Left + M3Left + Height
## 
##           Df Deviance    AIC
## - M3Left   1   0.0000  8.000
## <none>         0.0000 10.000
## + Pbone    1   0.0000 12.000
## + Rostrum  1   0.0000 12.000
## + Foramen  1   0.0000 12.000
## - M2Left   1   9.3662 17.366
## - Height   1   9.5699 17.570
## - M1Left   1  17.7874 25.787
## 
## Step:  AIC=8
## Group_Known ~ M1Left + M2Left + Height
## 
##           Df Deviance    AIC
## <none>         0.0000  8.000
## + M3Left   1   0.0000 10.000
## + Pbone    1   0.0000 10.000
## + Foramen  1   0.0000 10.000
## + Rostrum  1   0.0000 10.000
## - M2Left   1   9.6533 15.653
## - Height   1  10.2697 16.270
## - M1Left   1  17.7974 23.797
## 
## Call:  glm(formula = Group_Known ~ M1Left + M2Left + Height, family = binomial, 
##     data = Filtered_Dataset)
## 
## Coefficients:
## (Intercept)       M1Left       M2Left       Height  
##  -25752.644        3.812        6.725       10.499  
## 
## Degrees of Freedom: 44 Total (i.e. Null);  41 Residual
## Null Deviance:       62.36 
## Residual Deviance: 1.662e-07     AIC: 8

4 Model Building

Building the logistic model with the three input variables, namely: M1Left, M2Left, Height. Since the microtus dataset have few observations, the known species data extracted from the dataset was used as both the train and test data. The name of the species (Group name) was set aside of the data before building the model. The model was then used to predict the species names in the data.

Model<-glm(Group_Known ~ M1Left + M2Left + Height,data=Filtered_Dataset,family="binomial")#Building the logistic model with glm ()
pred <-predict(Model,data=Filtered_Dataset[,-8],type ="response")#Predicting the species names in the known data (Group name removed)
Prediction<- ifelse(pred >= 0.5,"multiplex","subterraneus")

Comparing the predicted species names with the actual names in the known data

Pred_Grp_names<-data.frame(df_Known,Prediction)
datatable(Pred_Grp_names,filter = "top",caption="Table.2: Predicted verses Actual Group names in the Known Data") # Dataframe comparing the species names side by side

5 Model Performance

Building a K-fold cross validation error rate. The error rate associated with the logistic regression model with K-fold cross validation (k = 10) is approximately 0.09 or 9% suggesting the selected input variables and the logistic regression model are in good standing in predicting the identity of the unknown species.

set.seed(12345)
cv_error <- cv.glm(Filtered_Dataset,Model, K = 10)$delta[1]
cv_error
## [1] 0.08888889

Prediction of the unknown species identity: The general linear model and the selected parameters were used in predicting the identity of the unknown species.

pred_unknown<- predict(Model,newdata =df_unknown[,-1], type = "response")
Predict_Group <- ifelse(pred_unknown >= 0.5,"multiplex","subterraneus")
Pred_unknw_Grp<-data.frame(df_unknown, Predict_Group)
datatable(Pred_unknw_Grp,filter = "top",caption="Table.3: Prediction of the Unknown Species Identity")