library(kableExtra)
library(boot)
library(DT)
library(MASS)
library(leaps)
library(tidyverse)
library(pander)
Generalized linear model (GLM) is usually applicable for a continuous response variable given continuous and/or categorical predictors. The use of GLM is based on some basic assumptions e.g the data need to be independently distributed, the target variable does not need to be normally distributed, GLM does not assume a linear relationship between the target variable and the input variables, errors need to be independent but not normally distributed. All these assumptions were key points considered prior to the analysis. Most of the input variables in the microtus dataset are highly correlated. Hence, building a logistic model with all the input variables will be problematic.To circumvent this obstacle, variables that seem important for predicting the target (Group) were selected and used in building the model.
df_microtus<-read.csv(file = "df_microtus.csv")[-1]
datatable(df_microtus,filter = "top",caption = "Table.1: Microtus Dataset")
Started the analysis by separating the known species from the unknown. Among the known species, there are two distinct species namely; Multiplex and subterraneus. Again, these were separated followed by summary statistics on the Known species to have a fair idea of the distribution of the parameters used involved.
df_microtus[,c(2:9)]<-lapply(df_microtus[,c(2:9)],as.numeric)
df_Known<-filter(df_microtus,Group==c("multiplex", "subterraneus")) #Extracting the known species
df_unknown<-filter(df_microtus,Group=="unknown") #Extracting the unknown species
df_multiplex<-filter(df_microtus,Group=="multiplex") #Extracting only multiplex species
df_subterraneus<-filter(df_microtus,Group=="subterraneus") #Extracting only subterraneus species
pander(summary(df_Known[,-1])) #Computing summary statistics on the known species
| M1Left | M2Left | M3Left | Foramen | Pbone |
|---|---|---|---|---|
| Min. :1619 | Min. :1355 | Min. :1402 | Min. :3451 | Min. :4368 |
| 1st Qu.:1747 | 1st Qu.:1504 | 1st Qu.:1566 | 1st Qu.:3687 | 1st Qu.:4807 |
| Median :1888 | Median :1549 | Median :1708 | Median :3868 | Median :5037 |
| Mean :1900 | Mean :1562 | Mean :1694 | Mean :3872 | Mean :5050 |
| 3rd Qu.:2040 | 3rd Qu.:1612 | 3rd Qu.:1772 | 3rd Qu.:4021 | 3rd Qu.:5254 |
| Max. :2282 | Max. :1814 | Max. :2145 | Max. :4464 | Max. :5874 |
| Length | Height | Rostrum |
|---|---|---|
| Min. :1965 | Min. :725.0 | Min. :395.0 |
| 1st Qu.:2210 | 1st Qu.:750.0 | 1st Qu.:425.0 |
| Median :2305 | Median :775.0 | Median :450.0 |
| Mean :2290 | Mean :781.4 | Mean :446.4 |
| 3rd Qu.:2355 | 3rd Qu.:800.0 | 3rd Qu.:465.0 |
| Max. :2600 | Max. :910.0 | Max. :524.0 |
Figure 2.1 shows how the input variables in the dataset are correlated with one another. It can be observed that most of the predictors are strongly correlated.
Group<-ifelse(df_Known$Group=="multiplex",1,2)
pairs(df_Known[ , -1],
col = c("cornflowerblue", "purple")[Group], #Change color by group
pch = c(8, 18)[Group], #Change points by group
labels =c("M1Left","M2Left","M3Left","Foramen","Pbone","Length","Height","Rostrum"),
main = "Pairwise scatterplots on the Microtus Dataset")
Figure 2.1: Pairwise scatterplots on the Microtus Dataset
Selecting the input variables to be used for the logistic regression (GLM): First, created an object by name Group containing the Group variables(name of the species) and set it outside of the dataset. Next, constructed correlation matrix on the input variables, setting the upper triangle and perfectly correlated variables to 0. Then filtered the correlation matrix, removing any input variables with correlation absolute value of 0.8 or above.
Group_Known<-df_Known$Group #Creating a variable "Group" to contain the names of the known species
Known_Species<-df_Known[,!names(df_Known) %in% 'Group']
Matrix<- cor(Known_Species) #Computing correlation matrix on the input variables
Matrix[upper.tri(Matrix)]<-0
diag(Matrix)<-0 #Setting the upper triangle and perfectly correlated variables to 0
Filter_Known_Species<-Known_Species[,!apply(Matrix,2,function(x) any(x > abs(0.8)))]#Removing any input variable with correlation absolute value of 0.8 or above
Filtered_Dataset<-cbind.data.frame(Filter_Known_Species,Group_Known)
Filtered_Dataset<-Filtered_Dataset %>%
mutate(Group_Known= ifelse(Group_Known=='multiplex',1,0))
varibale_sel<-glm(Group_Known ~., data=Filtered_Dataset,family = binomial)
stepAIC(varibale_sel,direction = "both")#Used stepAIC with the glm model as input model for the feature selection
## Start: AIC=16
## Group_Known ~ M1Left + M2Left + M3Left + Foramen + Pbone + Height +
## Rostrum
##
## Df Deviance AIC
## - Foramen 1 0.000 14.000
## - Rostrum 1 0.000 14.000
## - Pbone 1 0.000 14.000
## - M3Left 1 0.000 14.000
## - Height 1 0.000 14.000
## - M2Left 1 0.000 14.000
## <none> 0.000 16.000
## - M1Left 1 10.601 24.601
##
## Step: AIC=14
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height + Rostrum
##
## Df Deviance AIC
## - Rostrum 1 0.0000 12.000
## - M3Left 1 0.0000 12.000
## - Pbone 1 0.0000 12.000
## - Height 1 0.0000 12.000
## <none> 0.0000 14.000
## + Foramen 1 0.0000 16.000
## - M2Left 1 6.8524 18.852
## - M1Left 1 10.7524 22.752
##
## Step: AIC=12
## Group_Known ~ M1Left + M2Left + M3Left + Pbone + Height
##
## Df Deviance AIC
## - Pbone 1 0.0000 10.000
## - M3Left 1 0.0000 10.000
## - Height 1 0.0000 10.000
## <none> 0.0000 12.000
## + Rostrum 1 0.0000 14.000
## + Foramen 1 0.0000 14.000
## - M2Left 1 7.5962 17.596
## - M1Left 1 12.0232 22.023
##
## Step: AIC=10
## Group_Known ~ M1Left + M2Left + M3Left + Height
##
## Df Deviance AIC
## - M3Left 1 0.0000 8.000
## <none> 0.0000 10.000
## + Pbone 1 0.0000 12.000
## + Rostrum 1 0.0000 12.000
## + Foramen 1 0.0000 12.000
## - M2Left 1 9.3662 17.366
## - Height 1 9.5699 17.570
## - M1Left 1 17.7874 25.787
##
## Step: AIC=8
## Group_Known ~ M1Left + M2Left + Height
##
## Df Deviance AIC
## <none> 0.0000 8.000
## + M3Left 1 0.0000 10.000
## + Pbone 1 0.0000 10.000
## + Foramen 1 0.0000 10.000
## + Rostrum 1 0.0000 10.000
## - M2Left 1 9.6533 15.653
## - Height 1 10.2697 16.270
## - M1Left 1 17.7974 23.797
##
## Call: glm(formula = Group_Known ~ M1Left + M2Left + Height, family = binomial,
## data = Filtered_Dataset)
##
## Coefficients:
## (Intercept) M1Left M2Left Height
## -25752.644 3.812 6.725 10.499
##
## Degrees of Freedom: 44 Total (i.e. Null); 41 Residual
## Null Deviance: 62.36
## Residual Deviance: 1.662e-07 AIC: 8
Building the logistic model with the three input variables, namely: M1Left, M2Left, Height. Since the microtus dataset have few observations, the known species data extracted from the dataset was used as both the train and test data. The name of the species (Group name) was set aside of the data before building the model. The model was then used to predict the species names in the data.
Model<-glm(Group_Known ~ M1Left + M2Left + Height,data=Filtered_Dataset,family="binomial")#Building the logistic model with glm ()
pred <-predict(Model,data=Filtered_Dataset[,-8],type ="response")#Predicting the species names in the known data (Group name removed)
Prediction<- ifelse(pred >= 0.5,"multiplex","subterraneus")
Comparing the predicted species names with the actual names in the known data
Pred_Grp_names<-data.frame(df_Known,Prediction)
datatable(Pred_Grp_names,filter = "top",caption="Table.2: Predicted verses Actual Group names in the Known Data") # Dataframe comparing the species names side by side
Building a K-fold cross validation error rate. The error rate associated with the logistic regression model with K-fold cross validation (k = 10) is approximately 0.09 or 9% suggesting the selected input variables and the logistic regression model are in good standing in predicting the identity of the unknown species.
set.seed(12345)
cv_error <- cv.glm(Filtered_Dataset,Model, K = 10)$delta[1]
cv_error
## [1] 0.08888889
Prediction of the unknown species identity: The general linear model and the selected parameters were used in predicting the identity of the unknown species.
pred_unknown<- predict(Model,newdata =df_unknown[,-1], type = "response")
Predict_Group <- ifelse(pred_unknown >= 0.5,"multiplex","subterraneus")
Pred_unknw_Grp<-data.frame(df_unknown, Predict_Group)
datatable(Pred_unknw_Grp,filter = "top",caption="Table.3: Prediction of the Unknown Species Identity")