library(kableExtra)
library(boot)
library(DT)
library(MASS)
library(leaps)
library(tidyverse)
library(pander)

1 Introduction

Generalized linear model (GLM) is usually applicable for a continuous response variable given continuous and/or categorical predictors. The use of GLM is based on some basic assumptions e.g the data need to be independently distributed, the target variable does not need to be normally distributed, GLM does not assume a linear relationship between the target variable and the input variables, errors need to be independent but not normally distributed. All these assumptions were key points considered prior to the analysis. Most of the input variables in the microtus dataset are highly correlated. Hence, building a logistic model with all the input variables will be problematic.To circumvent this obstacle, variables that seem important for predicting the target (Group) were selected and used in building the model.

df_microtus<-read.csv(file = "df_microtus.csv")[-1]
datatable(df_microtus,filter = "top",caption = "Table.1: Microtus Dataset")

2 Exploratory Analysis

Started the analysis by separating the known species from the unknown. Among the known species, there are two distinct species namely; Multiplex and subterraneus. Again, these were separated followed by summary statistics on the Known species to have a fair idea of the distribution of the parameters used involved.

df_microtus[,c(2:9)]<-lapply(df_microtus[,c(2:9)],as.numeric)
df_Known<-filter(df_microtus,Group==c("multiplex", "subterraneus")) #Extracting the known species 
df_unknown<-filter(df_microtus,Group=="unknown")  #Extracting the unknown species
df_multiplex<-filter(df_microtus,Group=="multiplex") #Extracting only multiplex species
df_subterraneus<-filter(df_microtus,Group=="subterraneus")  #Extracting only subterraneus species
pander(summary(df_Known[,-1])) #Computing summary statistics on the known species

Table continues below
M1Left	M2Left	M3Left	Foramen	Pbone
Min. :1619	Min. :1355	Min. :1402	Min. :3451	Min. :4368
1st Qu.:1747	1st Qu.:1504	1st Qu.:1566	1st Qu.:3687	1st Qu.:4807
Median :1888	Median :1549	Median :1708	Median :3868	Median :5037
Mean :1900	Mean :1562	Mean :1694	Mean :3872	Mean :5050
3rd Qu.:2040	3rd Qu.:1612	3rd Qu.:1772	3rd Qu.:4021	3rd Qu.:5254
Max. :2282	Max. :1814	Max. :2145	Max. :4464	Max. :5874

Length	Height	Rostrum
Min. :1965	Min. :725.0	Min. :395.0
1st Qu.:2210	1st Qu.:750.0	1st Qu.:425.0
Median :2305	Median :775.0	Median :450.0
Mean :2290	Mean :781.4	Mean :446.4
3rd Qu.:2355	3rd Qu.:800.0	3rd Qu.:465.0
Max. :2600	Max. :910.0	Max. :524.0

Figure 2.1 shows how the input variables in the dataset are correlated with one another. It can be observed that most of the predictors are strongly correlated.

Group<-ifelse(df_Known$Group=="multiplex",1,2)
pairs(df_Known[ , -1],
      col = c("cornflowerblue", "purple")[Group],   #Change color by group
      pch = c(8, 18)[Group],                        #Change points by group
      labels =c("M1Left","M2Left","M3Left","Foramen","Pbone","Length","Height","Rostrum"),
      main = "Pairwise scatterplots on the Microtus Dataset")

Figure 2.1: Pairwise scatterplots on the Microtus Dataset

3 Variable Selection

Selecting the input variables to be used for the logistic regression (GLM): First, created an object by name Group containing the Group variables(name of the species) and set it outside of the dataset. Next, constructed correlation matrix on the input variables, setting the upper triangle and perfectly correlated variables to 0. Then filtered the correlation matrix, removing any input variables with correlation absolute value of 0.8 or above.

Group_Known<-df_Known$Group #Creating a variable "Group" to contain the names of the known species
Known_Species<-df_Known[,!names(df_Known) %in% 'Group']
Matrix<- cor(Known_Species)  #Computing correlation matrix on the input variables
Matrix[upper.tri(Matrix)]<-0
diag(Matrix)<-0        #Setting the upper triangle and perfectly correlated variables to 0
Filter_Known_Species<-Known_Species[,!apply(Matrix,2,function(x) any(x > abs(0.8)))]#Removing any input variable with correlation absolute value of 0.8 or above
Filtered_Dataset<-cbind.data.frame(Filter_Known_Species,Group_Known)
Filtered_Dataset<-Filtered_Dataset %>%
  mutate(Group_Known= ifelse(Group_Known=='multiplex',1,0))

varibale_sel<-glm(Group_Known ~., data=Filtered_Dataset,family = binomial)
stepAIC(varibale_sel,direction = "both")#Used stepAIC with the glm model as input model for the feature selection

## Start:  AIC=16
## Group_Known ~ M1Left + M2Left + M3Left + Foramen + Pbone + Height + 
##     Rostrum
## 
##           Df Deviance    AIC
## - Foramen  1    0.000 14.000
## - Rostrum  1    0.000 14.000
## - Pbone    1    0.000 14.000
## - M3Left   1    0.000 14.000
## - Height   1    0.000 14.000
## - M2Left   1    0.000 14.000
## <none>          0.000 16.000
## - M1Left   1   10.601 24.601
## 
## Step:  AIC=14
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height + Rostrum
## 
##           Df Deviance    AIC
## - Rostrum  1   0.0000 12.000
## - M3Left   1   0.0000 12.000
## - Pbone    1   0.0000 12.000
## - Height   1   0.0000 12.000
## <none>         0.0000 14.000
## + Foramen  1   0.0000 16.000
## - M2Left   1   6.8524 18.852
## - M1Left   1  10.7524 22.752
## 
## Step:  AIC=12
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height
## 
##           Df Deviance    AIC
## - Pbone    1   0.0000 10.000
## - M3Left   1   0.0000 10.000
## - Height   1   0.0000 10.000
## <none>         0.0000 12.000
## + Rostrum  1   0.0000 14.000
## + Foramen  1   0.0000 14.000
## - M2Left   1   7.5962 17.596
## - M1Left   1  12.0232 22.023
## 
## Step:  AIC=10
## Group_Known ~ M1Left + M2Left + M3Left + Height
## 
##           Df Deviance    AIC
## - M3Left   1   0.0000  8.000
## <none>         0.0000 10.000
## + Pbone    1   0.0000 12.000
## + Rostrum  1   0.0000 12.000
## + Foramen  1   0.0000 12.000
## - M2Left   1   9.3662 17.366
## - Height   1   9.5699 17.570
## - M1Left   1  17.7874 25.787
## 
## Step:  AIC=8
## Group_Known ~ M1Left + M2Left + Height
## 
##           Df Deviance    AIC
## <none>         0.0000  8.000
## + M3Left   1   0.0000 10.000
## + Pbone    1   0.0000 10.000
## + Foramen  1   0.0000 10.000
## + Rostrum  1   0.0000 10.000
## - M2Left   1   9.6533 15.653
## - Height   1  10.2697 16.270
## - M1Left   1  17.7974 23.797

## 
## Call:  glm(formula = Group_Known ~ M1Left + M2Left + Height, family = binomial, 
##     data = Filtered_Dataset)
## 
## Coefficients:
## (Intercept)       M1Left       M2Left       Height  
##  -25752.644        3.812        6.725       10.499  
## 
## Degrees of Freedom: 44 Total (i.e. Null);  41 Residual
## Null Deviance:       62.36 
## Residual Deviance: 1.662e-07     AIC: 8

4 Model Building

Building the logistic model with the three input variables, namely: M1Left, M2Left, Height. Since the microtus dataset have few observations, the known species data extracted from the dataset was used as both the train and test data. The name of the species (Group name) was set aside of the data before building the model. The model was then used to predict the species names in the data.

Model<-glm(Group_Known ~ M1Left + M2Left + Height,data=Filtered_Dataset,family="binomial")#Building the logistic model with glm ()
pred <-predict(Model,data=Filtered_Dataset[,-8],type ="response")#Predicting the species names in the known data (Group name removed)
Prediction<- ifelse(pred >= 0.5,"multiplex","subterraneus")

Comparing the predicted species names with the actual names in the known data

Pred_Grp_names<-data.frame(df_Known,Prediction)
datatable(Pred_Grp_names,filter = "top",caption="Table.2: Predicted verses Actual Group names in the Known Data") # Dataframe comparing the species names side by side

5 Model Performance

Building a K-fold cross validation error rate. The error rate associated with the logistic regression model with K-fold cross validation (k = 10) is approximately 0.09 or 9% suggesting the selected input variables and the logistic regression model are in good standing in predicting the identity of the unknown species.

set.seed(12345)
cv_error <- cv.glm(Filtered_Dataset,Model, K = 10)$delta[1]
cv_error

## [1] 0.08888889

Prediction of the unknown species identity: The general linear model and the selected parameters were used in predicting the identity of the unknown species.

pred_unknown<- predict(Model,newdata =df_unknown[,-1], type = "response")
Predict_Group <- ifelse(pred_unknown >= 0.5,"multiplex","subterraneus")
Pred_unknw_Grp<-data.frame(df_unknown, Predict_Group)
datatable(Pred_unknw_Grp,filter = "top",caption="Table.3: Prediction of the Unknown Species Identity")

ANALYSIS ON MICROTUS DATASET

Samuel Adjei

04/08/2021

1 Introduction

2 Exploratory Analysis

3 Variable Selection

4 Model Building

5 Model Performance