MACHINE LEARNING

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computer systems to learn from and make predictions or decisions based on data. In other words, it’s a field of study and practice where computers are trained to perform tasks without being explicitly programmed for each task.

SUPERVISED AND UNSUPERVISED LEARNING

Supervised learning and unsupervised learning are two fundamental categories of machine learning, each with its own distinct characteristics and use cases:

Supervised Learning:

Definition: Supervised learning is a type of machine learning where the algorithm learns from labeled training data. It involves a clear relationship between input and output, where the model is trained to map input data to corresponding target labels.

Key Characteristics: * The training data consists of pairs of input features (attributes or variables) and their corresponding target labels. * The goal is to learn a mapping function that can accurately predict the target label for new, unseen data based on the input features. * Supervised learning tasks include classification (predicting discrete labels) and regression (predicting continuous values).

Examples: * Classification: Spam email detection, image recognition, sentiment analysis. * Regression: House price prediction, stock price forecasting, temperature prediction.

Evaluation Metrics: * For classification: Accuracy, precision, recall, F1-score, ROC curve, etc. * For regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, etc.

Workflow:

Data pre-processing, feature engineering, model selection, training, evaluation, and testing are common steps in a supervised learning pipeline.

Unsupervised Learning:

Definition: Unsupervised learning is a type of machine learning where the algorithm learns from unlabeled data. It aims to discover patterns, structures, or relationships in the data without explicit guidance or target labels.

Key Characteristics: * The training data does not contain labeled target information. * The primary objective is to explore the inherent structure within the data, such as clustering similar data points or reducing dimensionality.

Examples: * Clustering: Grouping similar customer profiles for targeted marketing. * Dimensionality Reduction: Reducing the number of features while preserving essential information (e.g., Principal Component Analysis).

Evaluation Metrics: Evaluation in unsupervised learning can be more challenging since there are no ground truth labels. Metrics depend on the specific task, such as silhouette score for clustering or explained variance ratio for dimensionality reduction.

Workflow: * Data preprocessing, model selection, training, and evaluation are steps in an unsupervised learning pipeline. * Visualization and domain knowledge are often essential for interpreting results.

Semi-Supervised Learning:

In addition to supervised and unsupervised learning, there is also a category known as semi-supervised learning, which combines elements of both. In semi-supervised learning, you have a small amount of labeled data and a larger amount of unlabeled data. The goal is to leverage both sources of information to build better models.

In summary, supervised learning is about learning from labeled data to make predictions, unsupervised learning is about discovering patterns in unlabeled data, and semi-supervised learning combines elements of both to leverage labeled and unlabeled data for learning and prediction.

CLASSIFICATION

Classification is a fundamental concept in machine learning that involves categorizing data into predefined classes or labels based on the features or attributes of the data. Classification models are widely used in various fields, including finance, healthcare, natural language processing, image recognition, and more. Here’s a comprehensive write-up on classification models in machine learning:

Introduction to Classification:

Classification is a supervised learning task where the goal is to assign a class label to an input data point. The classes can be binary (two classes, often referred to as positive and negative) or multi-class (more than two classes). The objective of classification is to build a model that can generalize from the training data to make accurate predictions on new, unseen data.

MULTINOMIAL LOGISTIC REGRESSION

Overview

Multinomial logistic regression is used in situations where the dependent variable has more than two categories. It is an extension of the basic logistic regression model. For example, if the dependent variable has three categories, A, B and C we would estimate regression models of A vs B, and A vs C. It is important to note that these models are estimated simultaneously. This model is used to predict the probabilities of categorically dependent variable, which has more than two outcome classes. The dependent variables are nominal in nature. This means that there is no kind of ordering in the target classes ie these classes cannot be meaningfully ordered (eg 1,2,3,4 etc). Examples of multinomial logistic regression topics include:

Favorite ice cream flavor * Dependent Variable: Ice cream flavor - Chocolate, Vanilla, Coffee, Strawberry * Independent Variables: Gender, Age, Ethnicity

Current employment status * Dependent Variable: Employment status - Full-time employment, Part-time employment, Training, Unemployed * Independent Variables: Gender, Age, Qualifications, Area a person lives, number of children

Sentencing outcome status * Independent Variable: Sentencing status - Prison, Community Penalty, Fine, Discharge * Dependent Variables: Gender, Age, Previous Convictions

Assumptions:

Multinomial logistic regression assumes that:

1 Each independent variable has a single value for each case. 2 The dependent variable cannot be perfectly predicted from the independent variables for each case. 3 No multicollinearity between independent variables - This occurs when two or more independent variables are highly correlated with each other. This makes it difficult to understand how much every independent variable contributes to the category of dependent variable. 4 There should be no outliers in the data points.

Step 1: Installing the packages

The following packagaes are required sjmisc and sjPlot. To create the multinomial models the nnet package needs to be installed.

library(sjmisc)
library(sjPlot)
library(nnet)
library(wakefield)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)

Step 2: Creating a data frame

For the purposes of this exercise the wakefield package will be used to generate a random data set. This will be a psuedo data set to resemble the number of children in local authority who are either: a Child in Need (CIN), Subject to a Child Protection Plan (CP), or child looked after (CLA).

Using the wakefield package to easily generate reproducible “CPP” sample data

set.seed(999)
df.CPP <- 
  r_data_frame(
    n = 12500,
    r_sample(n = 255, x = "CPP", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.52, 0.48),
              name = "Sex"),
        r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.12, 0.08, 0.08, 0.08, 0.08, 0.06, 0.06, 0.06, 0.06, 0.06, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.01, 0.01),
              name = "Age")
  )

Using the wakefield package to easily generate reproducible “CIN” sample data

set.seed(999)
df.CIN <- 
  r_data_frame(
    n = 97000,
    r_sample(n = 255, x = "CIN", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.55, 0.45),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.04, 0.04, 0.04, 0.04, 0.044, 0.044, 0.044, 0.044, 0.044, 0.053, 0.053, 0.053, 0.053, 0.053, 0.053, 0.115, 0.115),
              name = "Age")
  )

Using the wakefield package to easily generate reproducible “CLA” sample data

set.seed(999)
df.CLA <- 
  r_data_frame(
    n = 16200,
    r_sample(n = 255, x = "CLA", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.57, 0.43),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.035, 0.035, 0.035, 0.035, 0.038, 0.038, 0.038, 0.038, 0.038, 0.065, 0.065, 0.065, 0.065, 0.065, 0.065, 0.115, 0.115),
              name = "Age")
  )

Merging the dataframes

mydata <- rbind(df.CIN, df.CLA, df.CPP)
head(mydata,5)

# A tibble: 5 × 4
  Type  ID    Sex    Age  
  <chr> <chr> <chr>  <chr>
1 CIN   00001 Female 16   
2 CIN   00002 Male   10   
3 CIN   00003 Male   2    
4 CIN   00004 Male   14   
5 CIN   00005 Female 14

Step 3: Exploratory data analysis

This set of data contains four variables. The dependent varaible is type which details the Children’s Services Outcome for each child. Using multinomial modeling we will explore how this outcome varies by sex. The next step is to start be requesting the frequencies of these two variables to check how they are coded.

frq(mydata, Type, Sex)

Type <character> 
# total N=125700 valid N=125700 mean=1.33 sd=0.65

Value |     N | Raw % | Valid % | Cum. %
----------------------------------------
CIN   | 97000 | 77.17 |   77.17 |  77.17
CLA   | 16200 | 12.89 |   12.89 |  90.06
CPP   | 12500 |  9.94 |    9.94 | 100.00
<NA>  |     0 |  0.00 |    <NA> |   <NA>

Sex <character> 
# total N=125700 valid N=125700 mean=1.55 sd=0.50

Value  |     N | Raw % | Valid % | Cum. %
-----------------------------------------
Female | 56513 | 44.96 |   44.96 |  44.96
Male   | 69187 | 55.04 |   55.04 | 100.00
<NA>   |     0 |  0.00 |    <NA> |   <NA>

The most common outcome was CIN (77% of all children), followed by CPP( 13%). We can also see that there were 69,014 males and 56,686 females. It is useful to see if there are any differences in Type by Sex with a cross-tabulation. Here we can see that CLA Children’s Services Outcomes are much more likely for males, and CPP are more common for females.

sjt.xtab(mydata$Type, mydata$Sex, show.col.prc = TRUE)

Type	Sex		Total
Type	Female	Male	Total
CIN	43647 77.2 %	53353 77.1 %	97000 77.2 %
CLA	6931 12.3 %	9269 13.4 %	16200 12.9 %
CPP	5935 10.5 %	6565 9.5 %	12500 9.9 %
Total	56513 100 %	69187 100 %	125700 100 %
χ²=63.131 · df=2 · Cramer’s V=0.022 · p=0.000

Step 4: Run a multinomial logistic regression

To run this regression the first step is to create a new “factor” variable, Type.f that replicates the values of Type but which replaces the values with text labels. Check that the coding of the variable has worked.

mydata$Type.f <- factor(mydata$Type, labels = c("CIN", "CLA", "CPP"))
head(mydata, n = 10)

# A tibble: 10 × 5
   Type  ID    Sex    Age   Type.f
   <chr> <chr> <chr>  <chr> <fct> 
 1 CIN   00001 Female 16    CIN   
 2 CIN   00002 Male   10    CIN   
 3 CIN   00003 Male   2     CIN   
 4 CIN   00004 Male   14    CIN   
 5 CIN   00005 Female 14    CIN   
 6 CIN   00006 Female 16    CIN   
 7 CIN   00007 Male   16    CIN   
 8 CIN   00008 Female 17    CIN   
 9 CIN   00009 Male   7     CIN   
10 CIN   00010 Female 2     CIN

The next step of the multinomial logistic regression is to select a reference category. By default R orders the factors alphabetically, so would treat CIN as the reference group. However, it is possible to change this using the relevel command. We will choose CLA as the reference group.

mydata$Type.f <- relevel(mydata$Type.f, ref = "CLA")

It is useful to create a new dummy variable, male which has a 0 if female and 1 if male. The purpose of this is to make the gender effect clearer to see in the model.

mydata$Male <- rec(mydata$Sex, rec = "Male = 1; Female = 0; else = NA")

Finally, the multinom command is used to run a multinomial model. The summary() command can be used once the model has converged to view the results.

### Fit the model
model1 <-multinom(Type.f ~ Male, data = mydata)

# weights:  9 (4 variable)
initial  value 138095.564686 
iter  10 value 87153.669778
iter  10 value 87153.669764
final  value 87153.669764 
converged

### Model summary
summary(model1)

Call:
multinom(formula = Type.f ~ Male, data = mydata)

Coefficients:
    (Intercept)       Male1
CIN   1.8401323 -0.08990659
CPP  -0.1551774 -0.18968963

Std. Errors:
    (Intercept)      Male1
CIN  0.01293019 0.01714112
CPP  0.01768551 0.02393687

Residual Deviance: 174307.3 
AIC: 174315.3

Step 5: Model results

The multinomial regression model results are included in the table Coefficients: with the associated standard errors included in the table Std.Errors: . The first colum (INtercept) includes the estimates for the reference category (female), whilst column male returns the results for the effect of being male compared to the reference. Model results are given in logits, so we need to take the exponential to read these as odds and odds ratios. This can be achieved by using the exp(coef()) command.

exp(coef(model1))

    (Intercept)     Male1
CIN   6.2973712 0.9140166
CPP   0.8562633 0.8272158

To calculate the 95% confidence interval (CI) and restructure the results to make it easier to read use the tab_model function.

tab_model(model1)

	Type f
Predictors	Odds Ratios	CI	p	Response
(Intercept)	6.30	6.14 – 6.46	<0.001	CIN
Male [1]	0.91	0.88 – 0.95	<0.001	CIN
(Intercept)	0.86	0.83 – 0.89	<0.001	CPP
Male [1]	0.83	0.79 – 0.87	<0.001	CPP
Observations	125700
R² / R² adjusted	0.000 / 0.000

Here we can see that the odds of becoming CIN rather than CLA are 6.3 for females. Compared to females, males have 0.91 times less odds of becoming CLA rather than becoming CIN. Similarly, the odds of becoming subject to Child Protection Plan rather than CLA are 0.83 times less for a male than a female. Underneath the table we can see the approximate R2 which is 0.000. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 7: Requesting predicted probabilities

Sometimes it is desirable to request the predicted probabilities to better understand the chances of receiving different Children’s Services Outcomes. This can be achieved by using the fitted() command.

predprob <-fitted(model1)
head(predprob)

        CLA       CIN       CPP
1 0.1226447 0.7723392 0.1050161
2 0.1339725 0.7711328 0.0948947
3 0.1339725 0.7711328 0.0948947
4 0.1339725 0.7711328 0.0948947
5 0.1226447 0.7723392 0.1050161
6 0.1226447 0.7723392 0.1050161

The first row of data relate to males. The table illustrates that the probability of a male becoming CIN is 0.77, CPP is 0.11 and CLA is 0.12. The second row refers to females, with probabilities of CIN = 0.13, CPP = 0.09 and CLA is 0.77. All rounded to 2.d.p

Step 8: Adding a continuous predictor

It is possible to extend the model by adding in the age of the child/young person. This variable is needs to be numeric. For the purposes of this model the reference category will be CLA.

mydata$Age <- as.numeric(mydata$Age)
model2 <-multinom(Type.f ~ Male + Age, data = mydata)

# weights:  12 (6 variable)
initial  value 138095.564686 
iter  10 value 86003.438226
iter  20 value 84340.937713
final  value 84340.823598 
converged

### model summary
summary(model2)

Call:
multinom(formula = Type.f ~ Male + Age, data = mydata)

Coefficients:
    (Intercept)       Male1         Age
CIN    2.010631 -0.08921876 -0.01681853
CPP    1.039220 -0.18518794 -0.14516870

Std. Errors:
    (Intercept)      Male1         Age
CIN  0.02088667 0.01714897 0.001589901
CPP  0.02602449 0.02437257 0.002306540

Residual Deviance: 168681.6 
AIC: 168693.6

tab_model(model2)

	Type f
Predictors	Odds Ratios	CI	p	Response
(Intercept)	7.47	7.17 – 7.78	<0.001	CIN
Male [1]	0.91	0.88 – 0.95	<0.001	CIN
Age	0.98	0.98 – 0.99	<0.001	CIN
(Intercept)	2.83	2.69 – 2.97	<0.001	CPP
Male [1]	0.83	0.79 – 0.87	<0.001	CPP
Age	0.86	0.86 – 0.87	<0.001	CPP
Observations	125700
R² / R² adjusted	0.033 / 0.033

The results in the table illustrate that the odds of becoming CIN rather than CLA tend to be lower as the age increases. For every year, the odds fall approximately 2% (1-0.98 = 0.02). This means that older children are less likely to become CIN rather LAC. The same is true when we consider the odds of CPP rather than LAC. The odds fall approximately 14%. Underneath the table we can see the approximate R2 which is 0.03. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 8: Exploring changes in predicted probability associated with gender whilst holding age constant

Finally, is we want to explore the changes in predicted probability associated with Sex whilst holding age constant, we can create another dataset varying gender, but holding the Age at the sample mean.

sub.Type <- data.frame(Male = c(0,1), Age = mean(mydata$Age))
sub.Type$Male <- as.factor(sub.Type$Male)
p <- predict(model2, newdata = sub.Type, "probs")
barplot(p, ylab = "Probability", ylim=c(0,1), xlab = "Children's Service Outcome", beside = T, legend = TRUE)

For someone of average age, the predicted probability of becoming CIN for a woman is 0.79 compared to 0.79 for a man. The probability of a male (of average age) becoming CPP is approximately 0.09 compared to 0.08 for a female. The probability of a male (of average age) becoming CLA is approximately 0.12 compared to 0.14 for a female.

MULTINOMIAL LOGISTIC REGRESSION MODEL PART TWO:

Categorical response variables with at least three levels
Model and Interpretation
Misclassification error and confusion matric
Prediction and Model Assessment

CARDIOTOCOGRAPH DATA (CTG DATA)

DATA PARTITION

MULTINOMIAL LOGISTIC REGRESSION MODEL

Import the data using the command below

mydata1 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata1)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : int  2 1 1 1 1 3 3 3 3 3 ...

Variable Definition

From the structure of the data above, the first 21 variables are independent variables. NSP = 1; Normal patient NSP = 2; Suspected Patient NSP = 3; Pathologic patient

LB - FHR baseline (beats per minute) AC - # of accelerations per second FM - # of fetal movements per second UC - # of uterine contractions per second DL - # of light decelerations per second DS - # of severe decelerations per second DP - # of prolongued decelerations per second ASTV - percentage of time with abnormal short term variability MSTV - mean value of short term variability ALTV - percentage of time with abnormal long term variability MLTV - mean value of long term variability Width - width of FHR histogram Min - minimum of FHR histogram Max - Maximum of FHR histogram Nmax - # of histogram peaks Nzeros - # of histogram zeros Mode - histogram mode Mean - histogram mean Median - histogram median Variance - histogram variance Tendency - histogram tendency CLASS - FHR pattern class code (1 to 10) NSP - fetal state class code (N=normal; S=suspect; P=pathologic)

Convert the NSP variable to factor

mydata1$NSP <- as.factor(mydata1$NSP)

Now check the structure

str(mydata1)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...

Data Partition

Let us make use of set.seed() so that if you partition the data set, you don’t get different train and test data set.

ind <- sample(2, nrow(mydata1),
              replace = TRUE,
              prob = c(0.6, 0.4))

Create the training and testing dataset

training <- mydata1[ind==1,]
testing <- mydata1[ind ==2,]

From the code above, we can see that the training data set has 1277 observations with 22 variables while the testing data set has 849 observations with 22 variables

Multinomial Logistic Model Estimation

## Load the following library
library(nnet) ### nnet stands for neuro network

One most important we must tell the system is the reference level. Out of levels 1,2 and 3, what is our reference level. In this case, our reference level will be the normal patient

training$NSP <- relevel(training$NSP, ref = "1")

In the command below, we are going to use all other variables as independent variables

#### the model
mymodel <- multinom(NSP~.,data = training)

# weights:  69 (44 variable)
initial  value 1405.125117 
iter  10 value 570.285714
iter  20 value 544.822165
iter  30 value 407.104754
iter  40 value 370.928272
iter  50 value 365.143677
iter  60 value 311.150926
iter  70 value 309.067691
iter  80 value 292.156603
iter  90 value 287.381568
iter 100 value 287.381315
final  value 287.381315 
stopped after 100 iterations

You can see, all quickly, it goes over 100 iterations and then stops. From the results, the first iteration start with a very high error and error keeps reducing. At 100 iteration, the error remains constant.

### Summary of the model
summary(mymodel)

Call:
multinom(formula = NSP ~ ., data = training)

Coefficients:
  (Intercept)          LB         AC        FM        UC         DL         DS
2   -17.31012 -0.02760348 -974.35380  9.892921 -338.8540  -83.49005 -0.2330054
3   -15.41322  0.45653346  -91.35794 18.082785 -450.5167 -235.30982  8.9150235
         DP       ASTV       MSTV       ALTV        MLTV       Width        Min
2  88.28689 0.06818958 -0.3194275 0.01767543 -0.03723673 0.006975364 0.01753461
3 125.06839 0.17025220 -1.1293653 0.06533746  0.07014182 0.013265491 0.02272449
         Max       Nmax      Nzeros        Mode       Mean     Median
2 0.02450997  0.1348233 -0.07132604 -0.02765878  0.3191267 -0.2073014
3 0.03598999 -0.3105369  0.66652874  0.03184939 -0.1523690 -0.3815680
    Variance  Tendency
2 0.05909011 0.2512291
3 0.07054533 0.7363657

Std. Errors:
  (Intercept)         LB          AC        FM          UC          DL
2    2.033101 0.03373358 0.001820328 0.1559106 0.007687022 0.002437925
3    0.458621 0.05064984 0.001237864 0.1546391 0.004267391 0.001239707
            DS           DP       ASTV      MSTV        ALTV       MLTV
2 2.762331e-05 6.180971e-05 0.01306283 0.3053944 0.007186479 0.04043266
3 2.739531e-05 1.778715e-04 0.02131316 0.4800381 0.012537553 0.06574320
        Width         Min        Max       Nmax    Nzeros       Mode       Mean
2 0.005392413 0.009131205 0.01189680 0.06922511 0.1658503 0.04037489 0.06747207
3 0.007502056 0.013456759 0.01305705 0.13035293 0.3267609 0.04546155 0.05035061
      Median   Variance  Tendency
2 0.08923069 0.01075478 0.3265719
3 0.07095862 0.01331678 0.4978992

Residual Deviance: 574.7626 
AIC: 658.7626

Model Interpretation

From the results above, we estimated the model with all 21 independent variable, therefore we have to check which variable is statistically significant. We therefore have to conduct a two-tailed Z-test.

### 2-tailed z-test
z <- summary(mymodel)$coefficients/summary(mymodel)$standard.errors
p <- (1-pnorm(abs(z),0,1)) *2
p

  (Intercept)        LB AC FM UC DL DS DP         ASTV       MSTV         ALTV
2           0 0.4131978  0  0  0  0  0  0 1.788047e-07 0.29558369 1.391154e-02
3           0 0.0000000  0  0  0  0  0  0 1.332268e-15 0.01863979 1.874807e-07
       MLTV      Width        Min         Max       Nmax     Nzeros      Mode
2 0.3570731 0.19582041 0.05482053 0.039377951 0.05146201 0.66714998 0.4933130
3 0.2860143 0.07701958 0.09127608 0.005844775 0.01720589 0.04136971 0.4835661
          Mean       Median     Variance  Tendency
2 2.247844e-06 2.016802e-02 3.922364e-08 0.4417201
3 2.476815e-03 7.559813e-08 1.174137e-07 0.1391549

From the results above, we can drop variable that are insignificant. In other words, such variable are contributing anything to the model. For instance, we can drop variables like MLTV, Width, Min, Mode, Nmax, Nzeros and Tendency. We can now run the model again.

mymodel <- multinom(NSP~. -MLTV -Width -Min -Mode -Nmax - Nzeros -Tendency,data = training)

# weights:  48 (30 variable)
initial  value 1405.125117 
iter  10 value 553.294583
iter  20 value 422.401296
iter  30 value 390.556415
iter  40 value 374.779936
iter  50 value 317.131571
iter  60 value 315.331314
iter  70 value 311.344612
iter  80 value 309.705667
iter  90 value 308.639685
iter 100 value 301.591114
final  value 301.591114 
stopped after 100 iterations

summary(mymodel)

Call:
multinom(formula = NSP ~ . - MLTV - Width - Min - Mode - Nmax - 
    Nzeros - Tendency, data = training)

Coefficients:
  (Intercept)          LB        AC        FM        UC        DL         DS
2   -15.79030 0.007146002 -674.8893  8.888663 -290.0121 -105.6467 -0.2778796
3   -13.84179 0.389746487  -71.2671 13.133329 -305.7423 -178.6945  6.8575186
        DP       ASTV       MSTV       ALTV        Max        Mean     Median
2 58.26660 0.06138004 -0.4759162 0.02116653 0.03138541  0.21240327 -0.1629087
3 70.41143 0.14405678 -1.0565171 0.05530839 0.01752014 -0.09907106 -0.2902961
    Variance
2 0.04740086
3 0.05509592

Std. Errors:
  (Intercept)         LB          AC       FM          UC          DL
2   1.8386025 0.03127154 0.011449897 2.164807 0.010586139 0.007383440
3   0.4227068 0.04389029 0.002072531 1.096782 0.007202979 0.003833048
            DS          DP       ASTV      MSTV        ALTV        Max
2 7.227337e-05 0.003915843 0.01192198 0.3209805 0.006273795 0.01129398
3 7.301935e-05 0.001910488 0.02013791 0.4395718 0.009685572 0.01125022
        Mean     Median    Variance
2 0.05776442 0.06491073 0.009016669
3 0.03775549 0.04616799 0.011546313

Residual Deviance: 603.1822 
AIC: 663.1822

z <- summary(mymodel)$coefficients/summary(mymodel)$standard.errors
p <- (1-pnorm(abs(z),0,1)) *2
p

  (Intercept)        LB AC           FM UC DL DS DP         ASTV       MSTV
2           0 0.8192463  0 4.025988e-05  0  0  0  0 2.626099e-07 0.13815543
3           0 0.0000000  0 0.000000e+00  0  0  0  0 8.457679e-13 0.01623832
          ALTV        Max         Mean       Median     Variance
2 7.413829e-04 0.00545349 0.0002359368 1.208218e-02 1.464039e-07
3 1.127175e-08 0.11939559 0.0086899387 3.219538e-10 1.826478e-06

The model above is now our final model. We will use the coefficient in the model above to write the equation

Model One

\[ \ln\left(\frac{P(NSP=2)}{P(NSP=1)}\right) = -15.79030 + 0.007146002*LB -674.8893*AC + 8.888663*FM -290.0121*UC -105.6467*DL -0.2778796*DS + 58.26660*DP+ --- + 0.04740086*Variance \]

Model Two

\[ \ln\left(\frac{P(NSP=3)}{P(NSP=1)}\right) = -13.84179 0.389746487*LB -71.2671*AC + 13.133329FM -305.7423*UC -178.6945*DL + 6.8575186*DS + 70.41143*DP + --- + 0.05509592*Variance \]

For the first equation, on the left hand side, we have the logarithm of the probability that NSP is 2 given that NSP is equal to 1. In other words, that is the probability that the patient is suspect given the probability that the patient is normal. That is the log-odds of the patient being suspect given that the patient is normal.

For the second equation, on the left hand side, we have the logarithm of the probability that NSP is 3 given that NSP is equal to 1. In other words, that is the probability that the patient is pathologic given the probability that the patient is normal. That is the log-odds of the patient being pathologic given that the patient is normal.

##Confusion Matrix ### Misclassification Error In this step we will create confusion matrix for the model, for both training and test data set. We shall of calculate the misclssification error once again for both training and test data set

Confusion Matrix for the training data set

Remember we created the model for variables that are only significant. We removed variables that are not significant from our model. We will therefore use our preferred model to create the confusion matrix

### Confusion matrix for the training data set
p <- predict(mymodel, training)
head(p)

[1] 2 1 1 2 1 2
Levels: 1 2 3

The results shows the classification of first six patients. We can now compare with actual patient in our training data set

head(training$NSP)

[1] 2 1 1 3 3 2
Levels: 1 2 3

From the results above, the first patient was predicted to be suspect and reality, the patient was suspect, the second and third patients were predicted to normal and in reality they were normal patients. The forth patient was predicted to suspect and in reality the patient was a pathologic. The fifth patient was predicted to be normal but in reality, the patient was pathologic. Lastly, the sixth patient was predicted to be suspect and in reality the patient is suspect. Let us now create the confusion matrix

confusion_matrix <- table(p, training$NSP)
confusion_matrix

   
p     1   2   3
  1 943  48   8
  2  32 123  19
  3   6  11  89

The column values 1,2 and 3 are the actual values while row values 1,2 and 3 are the predicted values. From the matrix, there were 943 people who were normal and the model predicted them as normal. Besides, there were 48 people who were suspect, however, the model predicted them as being normal. In addition, there were 8 pathologic patients but the model predicted them as being normal. The data has 123 suspect patients and model also predicted 123 suspect patients. The data has 89 pathologic patients and the model predicted 89 pathologic patients as well. The numbers on the diagnoal as correct classification with the rest as being misclassification.

Using the results above, we can calculate the accuracy rate of our model; ### Correct classification

sum(diag(confusion_matrix))/sum(confusion_matrix)

[1] 0.9030493

The model has 90.30498% correct classification

Misclassification

1-sum(diag(confusion_matrix))/sum(confusion_matrix)

[1] 0.09695074

The model has 9.695074% misclassification.

attach(training)
confusionMatrix(p, NSP, positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 943  48   8
         2  32 123  19
         3   6  11  89

Overall Statistics
                                          
               Accuracy : 0.903           
                 95% CI : (0.8855, 0.9187)
    No Information Rate : 0.767           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7408          
                                          
 Mcnemar's Test P-Value : 0.1317          

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9613  0.67582  0.76724
Specificity            0.8121  0.95351  0.98538
Pos Pred Value         0.9439  0.70690  0.83962
Neg Pred Value         0.8643  0.94661  0.97698
Prevalence             0.7670  0.14230  0.09070
Detection Rate         0.7373  0.09617  0.06959
Detection Prevalence   0.7811  0.13604  0.08288
Balanced Accuracy      0.8867  0.81467  0.87631

Confusion Matrix Using the Test Data

p1 <- predict(mymodel, testing)
head(p1)

[1] 1 1 3 3 1 1
Levels: 1 2 3

head(testing$NSP)

[1] 1 1 3 3 3 1
Levels: 1 2 3

Create the confusion matrix

confusion_matrix1 <- table(p1, testing$NSP)
confusion_matrix1

   
p1    1   2   3
  1 635  34   3
  2  34  76  11
  3   5   3  46

Model’s Accuracy using the Testing data

sum(diag(confusion_matrix1))/sum(confusion_matrix1)

[1] 0.8937426

Misclassification Error

1- sum(diag(confusion_matrix1))/sum(confusion_matrix1)

[1] 0.1062574

attach(testing)
confusionMatrix(p1, NSP, positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 635  34   3
         2  34  76  11
         3   5   3  46

Overall Statistics
                                         
               Accuracy : 0.8937         
                 95% CI : (0.871, 0.9137)
    No Information Rate : 0.7957         
    P-Value [Acc > NIR] : 1.888e-14      
                                         
                  Kappa : 0.6921         
                                         
 Mcnemar's Test P-Value : 0.1666         

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9421  0.67257  0.76667
Specificity            0.7861  0.93869  0.98983
Pos Pred Value         0.9449  0.62810  0.85185
Neg Pred Value         0.7771  0.94904  0.98235
Prevalence             0.7957  0.13341  0.07084
Detection Rate         0.7497  0.08973  0.05431
Detection Prevalence   0.7934  0.14286  0.06375
Balanced Accuracy      0.8641  0.80563  0.87825

Prediction and Model Asssesment

Accuracy and Sensitivity of the Model

In the context of multinomial logistic regression, accuracy and sensitivity are both important metrics for evaluating the performance of the model. Accuracy represents the overall correctness of the model’s predictions across all categories or classes. It measures the proportion of correctly predicted instances out of the total instances in the data set. While accuracy is a useful metric, especially when classes are balanced, it may not provide a complete picture of model performance when dealing with imbalanced datasets or when different classes have varying levels of importance.

On the other hand, sensitivity, also known as the true positive rate or recall, focuses on the model’s ability to correctly identify instances from a specific class. It measures the proportion of true positives (correctly predicted instances of a particular class) out of all actual instances belonging to that class. Sensitivity is particularly important when dealing with imbalanced datasets or when certain classes are more critical to identify correctly than others, such as in medical diagnosis.

In summary, while accuracy gives an overall view of model performance, sensitivity provides insights into the model’s ability to correctly detect specific classes, making it a valuable metric when dealing with imbalanced datasets or when certain classes have greater significance in the application. Both metrics should be considered in the context of the problem you’re addressing to fully assess the effectiveness of your multinomial logistic regression model.

Suppose we want to know how patients were normal, suspect and pathologic in the training data set, we can use the command below

frq(training, NSP, Tendency)

NSP <categorical> 
# total N=1279 valid N=1279 mean=1.32 sd=0.63

Value |   N | Raw % | Valid % | Cum. %
--------------------------------------
    1 | 981 | 76.70 |   76.70 |  76.70
    2 | 182 | 14.23 |   14.23 |  90.93
    3 | 116 |  9.07 |    9.07 | 100.00
 <NA> |   0 |  0.00 |    <NA> |   <NA>

Tendency <integer> 
# total N=1279 valid N=1279 mean=0.32 sd=0.61

Value |   N | Raw % | Valid % | Cum. %
--------------------------------------
   -1 | 101 |  7.90 |    7.90 |   7.90
    0 | 672 | 52.54 |   52.54 |  60.44
    1 | 506 | 39.56 |   39.56 | 100.00
 <NA> |   0 |  0.00 |    <NA> |   <NA>

From the training data, 981 patients was normal, 182 were suspect and 116 were pathologic.

n <- table(training$NSP)
n/sum(n)*100


        1         2         3 
76.700547 14.229867  9.069586

From the data, 76.7% of the patients were normal, 14.23% were suspect and 9.07% were pathologic.

Specific classification

confusion_matrix/colSums(confusion_matrix)

   
p             1           2           3
  1 0.961264016 0.048929664 0.008154944
  2 0.175824176 0.675824176 0.104395604
  3 0.051724138 0.094827586 0.767241379

From the results above, the model is 96.13% classification for normal patients, 67.58% correct classification for suspect patients and about 76.72% for pathologic patients. In other words, the model is doing good in classifying 1 and 3 as compared to 2.

Accuracy and Sensitivity for the Testing Data

confusion_matrix1/colSums(confusion_matrix1)

   
p1            1           2           3
  1 0.942136499 0.050445104 0.004451039
  2 0.300884956 0.672566372 0.097345133
  3 0.083333333 0.050000000 0.766666667

From the results above, the model predicts 94.21% correct classification for normal patients, 67.26% correct classification for suspect patients and about 76.67% for pathologic patients for the testing data. In other words, the model is doing good in classifying 1 and 3 as compared to 2 for the testing data.

ORDINAL LOGISTIC REGRESSION MODEL 0R PROPORTIONAL ODDS LOGISTIC REGRESSION

mydata2 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata2)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : int  2 1 1 1 1 3 3 3 3 3 ...

Convert the dependent variable to ordinal factor

mydata2$NSP <- as.ordered(mydata2$NSP)
mydata2$Tendency<-as.factor(mydata2$Tendency)

Check the structure again

str(mydata2)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: Factor w/ 3 levels "-1","0","1": 3 2 2 3 3 2 2 3 3 3 ...
 $ NSP     : Ord.factor w/ 3 levels "1"<"2"<"3": 2 1 1 1 1 3 3 3 3 3 ...

Frequency Distribution

frq(mydata2, NSP, Tendency)

NSP <ordinal> 
# total N=2126 valid N=2126 mean=1.30 sd=0.61

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
    1 | 1655 | 77.85 |   77.85 |  77.85
    2 |  295 | 13.88 |   13.88 |  91.72
    3 |  176 |  8.28 |    8.28 | 100.00
 <NA> |    0 |  0.00 |    <NA> |   <NA>

Tendency <categorical> 
# total N=2126 valid N=2126 mean=0.32 sd=0.61

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
   -1 |  165 |  7.76 |    7.76 |   7.76
    0 | 1115 | 52.45 |   52.45 |  60.21
    1 |  846 | 39.79 |   39.79 | 100.00
 <NA> |    0 |  0.00 |    <NA> |   <NA>

Check the Summary of the Data

summary(mydata2)

       LB              AC                 FM                 UC          
 Min.   :106.0   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
 1st Qu.:126.0   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.001876  
 Median :133.0   Median :0.001630   Median :0.000000   Median :0.004482  
 Mean   :133.3   Mean   :0.003170   Mean   :0.009474   Mean   :0.004357  
 3rd Qu.:140.0   3rd Qu.:0.005631   3rd Qu.:0.002512   3rd Qu.:0.006525  
 Max.   :160.0   Max.   :0.019284   Max.   :0.480634   Max.   :0.014925  
       DL                 DS                  DP                 ASTV      
 Min.   :0.000000   Min.   :0.000e+00   Min.   :0.0000000   Min.   :12.00  
 1st Qu.:0.000000   1st Qu.:0.000e+00   1st Qu.:0.0000000   1st Qu.:32.00  
 Median :0.000000   Median :0.000e+00   Median :0.0000000   Median :49.00  
 Mean   :0.001885   Mean   :3.585e-06   Mean   :0.0001566   Mean   :46.99  
 3rd Qu.:0.003264   3rd Qu.:0.000e+00   3rd Qu.:0.0000000   3rd Qu.:61.00  
 Max.   :0.015385   Max.   :1.353e-03   Max.   :0.0053476   Max.   :87.00  
      MSTV            ALTV             MLTV            Width       
 Min.   :0.200   Min.   : 0.000   Min.   : 0.000   Min.   :  3.00  
 1st Qu.:0.700   1st Qu.: 0.000   1st Qu.: 4.600   1st Qu.: 37.00  
 Median :1.200   Median : 0.000   Median : 7.400   Median : 67.50  
 Mean   :1.333   Mean   : 9.847   Mean   : 8.188   Mean   : 70.45  
 3rd Qu.:1.700   3rd Qu.:11.000   3rd Qu.:10.800   3rd Qu.:100.00  
 Max.   :7.000   Max.   :91.000   Max.   :50.700   Max.   :180.00  
      Min              Max           Nmax            Nzeros       
 Min.   : 50.00   Min.   :122   Min.   : 0.000   Min.   : 0.0000  
 1st Qu.: 67.00   1st Qu.:152   1st Qu.: 2.000   1st Qu.: 0.0000  
 Median : 93.00   Median :162   Median : 3.000   Median : 0.0000  
 Mean   : 93.58   Mean   :164   Mean   : 4.068   Mean   : 0.3236  
 3rd Qu.:120.00   3rd Qu.:174   3rd Qu.: 6.000   3rd Qu.: 0.0000  
 Max.   :159.00   Max.   :238   Max.   :18.000   Max.   :10.0000  
      Mode            Mean           Median         Variance      Tendency 
 Min.   : 60.0   Min.   : 73.0   Min.   : 77.0   Min.   :  0.00   -1: 165  
 1st Qu.:129.0   1st Qu.:125.0   1st Qu.:129.0   1st Qu.:  2.00   0 :1115  
 Median :139.0   Median :136.0   Median :139.0   Median :  7.00   1 : 846  
 Mean   :137.5   Mean   :134.6   Mean   :138.1   Mean   : 18.81            
 3rd Qu.:148.0   3rd Qu.:145.0   3rd Qu.:148.0   3rd Qu.: 24.00            
 Max.   :187.0   Max.   :182.0   Max.   :186.0   Max.   :269.00            
 NSP     
 1:1655  
 2: 295  
 3: 176

Create a two-way table

attach(mydata2)
xtabs(~NSP+Tendency, mydata2)

   Tendency
NSP  -1   0   1
  1 101 887 667
  2  15 137 143
  3  49  91  36

Partition the Data Set

Let us partition the data into train and test

ind1 <- sample(2, nrow(mydata2), replace = TRUE, prob = c(0.8, 0.2))
train <- mydata2[ind1 == 1,]
test <- mydata2[ind1 == 2,]

The train data set has 1729 observations with 22 variables, while the test data set has 397 observations with 22 variables

carry out the ordinal logistic regression or proportional logistic regression

To estimate this model, we need the library called MASS

library(MASS)

Data Visualization

Bar Chart

# Load the ggplot2 library
library(ggplot2)
# Calculate the percentage of each gender
percentage_data <- mydata2 %>%
  group_by(NSP) %>%
  summarize(Percentage = n() / nrow(mydata2) * 100)

# Create the bar plot
ggplot(percentage_data, aes(x = NSP, y = Percentage, fill = NSP)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.1f%%", Percentage)), vjust = -0.5) +
  labs(title = "Distribution of NSP",
       x = "NSP",
       y = "Percentage") +
  theme_minimal()

Grouped Bar Graph

ggplot(data= mydata2, aes(x=NSP, y =Mean, fill=Tendency)) +
  geom_bar(position = "dodge",
           stat = "summary",
           fun = "mean")+
  theme_minimal()+
  labs(title = "Bar Graph of Mean VS NSP and Tendency",
  x = "NSP",
  y = "Mean")+
  theme(axis.text.x = element_text(angle = 360, vjust = 0.5))

Grouped Bar Graphs

library(ggplot2)
ggplot(mydata2, aes(x=as.factor(NSP), fill=Tendency))+
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..]) ),
            stat="count", position=position_dodge(0.9), vjust=-0.5)+
  ylab("Percentage of NSP Across Tendency") +
  xlab("NSP")+
  labs(title = "Bar Graph Showing the Distribution of NSP For Various Tendencies")

  scale_y_continuous(labels = scales::percent)

<ScaleContinuousPosition>
 Range:  
 Limits:    0 --    1

### Estimate the model
mymodel2 <- polr(NSP ~ LB+AC+FM, train, Hess = TRUE)
summary(mymodel2)

Call:
polr(formula = NSP ~ LB + AC + FM, data = train, Hess = TRUE)

Coefficients:
        Value Std. Error    t value
LB    0.03836   0.006623  5.792e+00
AC -804.40604   0.005577 -1.442e+05
FM    7.42540   1.159603  6.403e+00

Intercepts:
    Value        Std. Error   t value     
1|2       5.4924       0.9074       6.0525
2|3       6.8839       0.9180       7.4991

Residual Deviance: 1820.135 
AIC: 1830.135

From the results above, we can see that for a unit increase in FM,we expect about 6.92902 increase in the expected value of NSP in the log odds scale, given that all other variables in the model are held constant

Calculating the P-values for our Model

(ctable <- coef(summary(mymodel2)))

            Value  Std. Error       t value
LB     0.03836167 0.006623396  5.791843e+00
AC  -804.40604304 0.005577380 -1.442265e+05
FM     7.42540119 1.159602960  6.403400e+00
1|2    5.49235454 0.907448582  6.052524e+00
2|3    6.88388486 0.917964521  7.499075e+00

p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))

            Value  Std. Error       t value      p-value
LB     0.03836167 0.006623396  5.791843e+00 6.961803e-09
AC  -804.40604304 0.005577380 -1.442265e+05 0.000000e+00
FM     7.42540119 1.159602960  6.403400e+00 1.519547e-10
1|2    5.49235454 0.907448582  6.052524e+00 1.425935e-09
2|3    6.88388486 0.917964521  7.499075e+00 6.426971e-14

Prediction

pred <- predict(mymodel2, train[1:5,], type = 'prob')
print(pred, digits = 3)

      1       2        3
1 0.709 0.19857 0.092752
2 0.996 0.00288 0.000956
3 0.955 0.03318 0.011496
5 0.997 0.00258 0.000858
6 0.768 0.16225 0.069947

train[1:5, 1:3]

   LB          AC FM
1 120 0.000000000  0
2 132 0.006379585  0
3 133 0.003322259  0
5 132 0.006514658  0
6 134 0.001049318  0

Using the results above, we can write the model and predict the probability of NSP being 1,2 or 3.

Equation and Calculating the Probabilities

Use all the variables

mymodel2 <- polr(NSP ~., train, Hess = TRUE)
summary(mymodel2)

Call:
polr(formula = NSP ~ ., data = train, Hess = TRUE)

Coefficients:
               Value Std. Error    t value
LB         7.603e-03  2.176e-02  3.493e-01
AC        -7.248e+02  8.192e-03 -8.848e+04
FM         4.207e+00  2.086e+00  2.017e+00
UC        -2.359e+02  1.316e-02 -1.792e+04
DL        -3.232e+00  3.184e-03 -1.015e+03
DS         2.121e+03  8.713e-05  2.435e+07
DP         3.280e+03  2.617e-03  1.254e+06
ASTV       9.513e-02  8.779e-03  1.084e+01
MSTV      -1.216e-01  2.209e-01 -5.505e-01
ALTV       3.469e-02  4.912e-03  7.063e+00
MLTV       3.134e-02  2.703e-02  1.159e+00
Width      4.019e-02  8.460e-03  4.751e+00
Min        5.858e-02  1.181e-02  4.962e+00
Nmax      -2.576e-02  4.864e-02 -5.296e-01
Nzeros     8.720e-02  1.183e-01  7.373e-01
Mode      -8.525e-02  1.943e-02 -4.388e+00
Mean       6.602e-02  2.417e-02  2.732e+00
Median    -6.256e-03  3.269e-02 -1.914e-01
Variance   2.093e-02  5.495e-03  3.809e+00
Tendency0  7.972e-01  3.833e-01  2.080e+00
Tendency1  1.124e+00  4.831e-01  2.326e+00

Intercepts:
    Value         Std. Error    t value      
1|2       12.7352        1.5281        8.3342
2|3       15.9194        1.5666       10.1615

Residual Deviance: 977.6412 
AIC: 1023.641

### Calculating the P-values for our Model
(ctable <- coef(summary(mymodel2)))

                  Value   Std. Error       t value
LB         7.602905e-03 2.176351e-02  3.493418e-01
AC        -7.248121e+02 8.192246e-03 -8.847538e+04
FM         4.206558e+00 2.086063e+00  2.016506e+00
UC        -2.358793e+02 1.316491e-02 -1.791728e+04
DL        -3.232100e+00 3.183799e-03 -1.015171e+03
DS         2.121300e+03 8.712903e-05  2.434665e+07
DP         3.280390e+03 2.616513e-03  1.253726e+06
ASTV       9.513327e-02 8.779237e-03  1.083617e+01
MSTV      -1.216135e-01 2.209055e-01 -5.505229e-01
ALTV       3.469108e-02 4.911631e-03  7.063045e+00
MLTV       3.133798e-02 2.703051e-02  1.159356e+00
Width      4.018983e-02 8.459766e-03  4.750702e+00
Min        5.858344e-02 1.180706e-02  4.961731e+00
Nmax      -2.576105e-02 4.864326e-02 -5.295913e-01
Nzeros     8.719934e-02 1.182689e-01  7.372971e-01
Mode      -8.525058e-02 1.942733e-02 -4.388179e+00
Mean       6.601984e-02 2.416965e-02  2.731518e+00
Median    -6.256364e-03 3.269334e-02 -1.913651e-01
Variance   2.092894e-02 5.494913e-03  3.808784e+00
Tendency0  7.971815e-01 3.832595e-01  2.080004e+00
Tendency1  1.123538e+00 4.830942e-01  2.325713e+00
1|2        1.273523e+01 1.528062e+00  8.334238e+00
2|3        1.591943e+01 1.566648e+00  1.016146e+01

p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))

                  Value   Std. Error       t value      p-value
LB         7.602905e-03 2.176351e-02  3.493418e-01 7.268327e-01
AC        -7.248121e+02 8.192246e-03 -8.847538e+04 0.000000e+00
FM         4.206558e+00 2.086063e+00  2.016506e+00 4.374712e-02
UC        -2.358793e+02 1.316491e-02 -1.791728e+04 0.000000e+00
DL        -3.232100e+00 3.183799e-03 -1.015171e+03 0.000000e+00
DS         2.121300e+03 8.712903e-05  2.434665e+07 0.000000e+00
DP         3.280390e+03 2.616513e-03  1.253726e+06 0.000000e+00
ASTV       9.513327e-02 8.779237e-03  1.083617e+01 2.319886e-27
MSTV      -1.216135e-01 2.209055e-01 -5.505229e-01 5.819608e-01
ALTV       3.469108e-02 4.911631e-03  7.063045e+00 1.628925e-12
MLTV       3.133798e-02 2.703051e-02  1.159356e+00 2.463110e-01
Width      4.018983e-02 8.459766e-03  4.750702e+00 2.027113e-06
Min        5.858344e-02 1.180706e-02  4.961731e+00 6.986763e-07
Nmax      -2.576105e-02 4.864326e-02 -5.295913e-01 5.963953e-01
Nzeros     8.719934e-02 1.182689e-01  7.372971e-01 4.609417e-01
Mode      -8.525058e-02 1.942733e-02 -4.388179e+00 1.143039e-05
Mean       6.601984e-02 2.416965e-02  2.731518e+00 6.304329e-03
Median    -6.256364e-03 3.269334e-02 -1.913651e-01 8.482396e-01
Variance   2.092894e-02 5.494913e-03  3.808784e+00 1.396520e-04
Tendency0  7.971815e-01 3.832595e-01  2.080004e+00 3.752514e-02
Tendency1  1.123538e+00 4.830942e-01  2.325713e+00 2.003387e-02
1|2        1.273523e+01 1.528062e+00  8.334238e+00 7.799903e-17
2|3        1.591943e+01 1.566648e+00  1.016146e+01 2.946346e-24

Estimate the new with significant variables only

### Estimate the model
mymodel2 <- polr(NSP ~ AC+FM+UC+DL+DS+DP+ASTV+Width+Min+Variance, train, Hess = TRUE)
summary(mymodel2)

Call:
polr(formula = NSP ~ AC + FM + UC + DL + DS + DP + ASTV + Width + 
    Min + Variance, data = train, Hess = TRUE)

Coefficients:
              Value Std. Error    t value
AC       -847.07581  0.0078071 -1.085e+05
FM          4.85295  2.0472349  2.370e+00
UC       -258.32577  0.0140492 -1.839e+04
DL       -102.94351  0.0021385 -4.814e+04
DS       2454.39606  0.0002667  9.202e+06
DP       2437.19066  0.0026491  9.200e+05
ASTV        0.09775  0.0072388  1.350e+01
Width       0.01970  0.0049289  3.996e+00
Min         0.03737  0.0061958  6.031e+00
Variance    0.02902  0.0041693  6.961e+00

Intercepts:
    Value        Std. Error   t value     
1|2      10.3025       1.0066      10.2348
2|3      12.9969       1.0332      12.5796

Residual Deviance: 1064.245 
AIC: 1088.245

### Calculating the P-values for our Model
(ctable <- coef(summary(mymodel2)))

                 Value   Std. Error       t value
AC       -847.07580590 0.0078070958 -1.085008e+05
FM          4.85295401 2.0472348679  2.370492e+00
UC       -258.32577168 0.0140491822 -1.838725e+04
DL       -102.94351232 0.0021385158 -4.813783e+04
DS       2454.39606181 0.0002667268  9.201912e+06
DP       2437.19066147 0.0026490693  9.200177e+05
ASTV        0.09774735 0.0072388477  1.350316e+01
Width       0.01969580 0.0049288793  3.995999e+00
Min         0.03736699 0.0061957662  6.031052e+00
Variance    0.02902359 0.0041693409  6.961193e+00
1|2        10.30254198 1.0066138832  1.023485e+01
2|3        12.99690142 1.0331704602  1.257963e+01

p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))

                 Value   Std. Error       t value      p-value
AC       -847.07580590 0.0078070958 -1.085008e+05 0.000000e+00
FM          4.85295401 2.0472348679  2.370492e+00 1.776443e-02
UC       -258.32577168 0.0140491822 -1.838725e+04 0.000000e+00
DL       -102.94351232 0.0021385158 -4.813783e+04 0.000000e+00
DS       2454.39606181 0.0002667268  9.201912e+06 0.000000e+00
DP       2437.19066147 0.0026490693  9.200177e+05 0.000000e+00
ASTV        0.09774735 0.0072388477  1.350316e+01 1.498002e-41
Width       0.01969580 0.0049288793  3.995999e+00 6.442203e-05
Min         0.03736699 0.0061957662  6.031052e+00 1.628959e-09
Variance    0.02902359 0.0041693409  6.961193e+00 3.374030e-12
1|2        10.30254198 1.0066138832  1.023485e+01 1.384097e-24
2|3        12.99690142 1.0331704602  1.257963e+01 2.733093e-36

Prediction

pred1 <- predict(mymodel2, train)
head(pred1,10)

 [1] 2 1 1 1 2 2 2 2 2 2
Levels: 1 2 3

###Confusion Matrix

tab2 <- table(pred1, train$NSP)
tab2

     
pred1    1    2    3
    1 1252   99    8
    2   62  124   46
    3    2   18   95

Calculate the Accuracy

sum(diag(tab2))/sum(tab2)

[1] 0.8622509

An accuracy score of 0.8635838 for the multinomial logistic regression model indicates that the model correctly predicted the outcome category for approximately 86.36% of the total instances in your dataset. In other words, out of all the observations in your dataset, the model made the correct prediction for about 86.36% of them. The model successfully classified 86.36% of the instances into their respective outcome categories. The remaining 100% - 86.36% = 13.64% of the instances were incorrectly classified by the model. Accuracy is a commonly used metric for evaluating the overall performance of classification models. However, it’s essential to interpret accuracy in the context of your specific problem and dataset. Here are a few considerations: If the dataset has imbalanced classes (i.e., some outcome categories are much more prevalent than others), a high accuracy score may not necessarily indicate a good model. In such cases, the model might achieve high accuracy by simply predicting the majority class most of the time. You should also consider other metrics like precision, recall, and F1-score, especially for minority classes.The significance of accuracy depends on the real-world consequences of misclassification. In some applications, misclassifying certain categories may have more severe consequences than others. Sensitivity and specificity metrics can provide a more nuanced view of the model’s performance for specific classes. It’s important to establish a baseline accuracy to compare your model’s performance. If the dataset has a natural baseline (e.g., randomly guessing), you should compare your model’s accuracy against that baseline.

In summary, an accuracy of 0.8635838 suggests that your multinomial logistic regression model is performing well in terms of overall classification accuracy. However, to get a more comprehensive understanding of its performance, it’s essential to consider other metrics and contextual factors, especially if your dataset has imbalanced classes or specific classes with different levels of importance. ### Miscalculation

1 - sum(diag(tab2))/sum(tab2)

[1] 0.1377491

attach(train)
confusionMatrix(pred1, NSP, positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1252   99    8
         2   62  124   46
         3    2   18   95

Overall Statistics
                                         
               Accuracy : 0.8623         
                 95% CI : (0.845, 0.8783)
    No Information Rate : 0.7714         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6178         
                                         
 Mcnemar's Test P-Value : 2.108e-05      

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9514  0.51452  0.63758
Specificity            0.7256  0.92628  0.98715
Pos Pred Value         0.9213  0.53448  0.82609
Neg Pred Value         0.8156  0.92062  0.96606
Prevalence             0.7714  0.14127  0.08734
Detection Rate         0.7339  0.07268  0.05569
Detection Prevalence   0.7966  0.13599  0.06741
Balanced Accuracy      0.8385  0.72040  0.81237

Calculate the Accuracy and Miclassification for the Test data

pred2 <- predict(mymodel2, test)
head(pred2,10)

 [1] 1 3 2 1 1 1 2 1 1 1
Levels: 1 2 3

Confusion Matrix

tab3 <- table(pred2, test$NSP)
tab3

     
pred2   1   2   3
    1 322  19   5
    2  16  32   9
    3   1   3  13

Calculate the Accuracy

sum(diag(tab3))/sum(tab3)

[1] 0.8738095

Miscalculation

1-sum(diag(tab3))/sum(tab3)

[1] 0.1261905

attach(test)
confusionMatrix(pred2, NSP, positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 322  19   5
         2  16  32   9
         3   1   3  13

Overall Statistics
                                         
               Accuracy : 0.8738         
                 95% CI : (0.8382, 0.904)
    No Information Rate : 0.8071         
    P-Value [Acc > NIR] : 0.0001836      
                                         
                  Kappa : 0.5994         
                                         
 Mcnemar's Test P-Value : 0.1153765      

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9499  0.59259  0.48148
Specificity            0.7037  0.93169  0.98982
Pos Pred Value         0.9306  0.56140  0.76471
Neg Pred Value         0.7703  0.93939  0.96526
Prevalence             0.8071  0.12857  0.06429
Detection Rate         0.7667  0.07619  0.03095
Detection Prevalence   0.8238  0.13571  0.04048
Balanced Accuracy      0.8268  0.76214  0.73565

The provided above represents the confusion matrix and various statistics for a multinomial logistic regression model. Let’s break down and explain each part of the output in detail:

Confusion Matrix

The confusion matrix is a tabular representation that summarizes the model’s performance by showing the count of correct and incorrect predictions for each class (category). In your case, you have three classes (1, 2, and 3) for the outcome variable. Rows represent the predicted classes, while columns represent the actual or reference classes. For example, in the top-left cell (Row 1, Column 1), you have 291 instances where the model correctly predicted class 1 when the actual class was also 1. In the top-middle cell (Row 1, Column 2), there are 33 instances where the model incorrectly predicted class 2 when the actual class was 1.

Overall Statistics

Accuracy: The overall accuracy of the model is 0.8485 or 84.85%. This means that the model correctly predicted the outcome for approximately 84.85% of all instances in the dataset. 95% CI (Confidence Interval): The confidence interval provides a range within which the true accuracy is likely to fall. In this case, it is between 0.8093 and 0.8823. No Information Rate (NIR): NIR is the accuracy achieved by always predicting the majority class. In this case, it’s 0.7551. Your model’s accuracy should be significantly better than NIR to be considered meaningful. Kappa: Kappa is a statistic that measures the agreement between the model’s predictions and the actual values. A kappa value of 0.5745 indicates moderate agreement beyond what would be expected by chance.

Mcnemar’s Test P-Value: Mcnemar’s test is a statistical test used to determine if there is a significant difference between the model’s predictions and the actual values. The low p-value (0.0004862) suggests that there is a significant difference.

Statistics by Class:

Sensitivity: Sensitivity, also known as True Positive Rate or Recall, measures the proportion of true positive predictions for each class. For Class 1, it’s 0.9732, which means the model is excellent at correctly identifying Class 1 instances.

Specificity: Specificity measures the proportion of true negatives for each class. For Class 3, it’s 0.97814, indicating the model’s ability to correctly identify Class 3 instances when they are not Class 3.

Pos Pred Value: Positive Predictive Value is the proportion of true positive predictions out of all positive predictions made by the model. For Class 1, it’s 0.8954, suggesting that when the model predicts Class 1, it’s correct approximately 89.54% of the time.

Neg Pred Value: Negative Predictive Value is the proportion of true negative predictions out of all negative predictions made by the model. For Class 3, it’s 0.96757, indicating a high accuracy in predicting Class 3 when it’s not Class 3.

Prevalence: Prevalence represents the proportion of instances belonging to each class in the dataset.

Detection Rate: Detection Rate measures how well the model detects instances of each class.

Detection Prevalence: Detection Prevalence is the proportion of instances that the model predicted as each class.

Balanced Accuracy: Balanced Accuracy is the average of sensitivity and specificity and provides a balanced measure of classification performance for each class.

In summary, this output provides a comprehensive assessment of the multinomial logistic regression model’s performance. It includes overall accuracy, class-specific metrics, and additional statistical measures, allowing you to evaluate the model’s strengths and weaknesses for each class and as a whole.

APPLICATION OF RANDOM FOREST IN CLASSIFICATION MODEL

Load the data set

mydata4 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata4)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : int  2 1 1 1 1 3 3 3 3 3 ...

attach(mydata4)

The data loaded has 2126 observations with 22 variables. This data is called is CTG. The data has the variable FHR, fetal heart rate and uterine contraction (UC) feature on cardiotocograms. 2126 fetal cardiotocograms (CTGs) automatically processed and diagnostic feature measured. CTG classified by three experrs obstetrician and consensus classification label as Normal, Suspect or Pathologic. The response variable is NSP (Normal, Suspect, Pathologic) and all the 21 variables are predictor variables. Remember NSP is an integer, we will have to convert it factor

mydata4$NSP <- as.factor(mydata4$NSP)
str(mydata4)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...

Now let us look at how many observations are present for each factor (Normal, Suspect, Pathologic)

frq(mydata4, NSP)

NSP <categorical> 
# total N=2126 valid N=2126 mean=1.30 sd=0.61

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
    1 | 1655 | 77.85 |   77.85 |  77.85
    2 |  295 | 13.88 |   13.88 |  91.72
    3 |  176 |  8.28 |    8.28 | 100.00
 <NA> |    0 |  0.00 |    <NA> |   <NA>

From the frequency table above, majority of the respondents were found in normal category (1655), followed by those in suspect category (195) and finally those in pathologic category (176)

Data Partitioning

We shall start with random seed so that we can make this analysis reproducible

set.seed(123)
ind <- sample(2, nrow(mydata4), replace = TRUE, prob = c(0.7, 0.3))
train_data <- mydata4[ind ==1,] ## 1495 observations
test_data <- mydata4[ind ==2,] ## 631 observations

Random Forest Model

Random Forest developed by aggregating trees
Can be used for classification or regression
Avoids overfitting
Can deal with large number of features
Helps with feature selection based on importance
Use-friendly: only 2 free parameters Trees - ntree, default 500 Variables randomly sampled as candidates at each split-mtry, default is sq.root(p) for classification & p/3 for regression

Random forest methodology is developed by aggregating several decision trees. Instead of using one decision tree, random forest uses several or hundreds of decision trees and aggregate the results from all the trees to come up with classification model. Remember, random forest can be used classification as well as regression. If the response variable is categorical variable, the algorithm will develop a classification model, and if the response variable is continuous, the algorithm will develop a regression model.

Steps

1. Draw ntree bootstrap samples

2. For each bootstrap sample, grow un-prunned tree by choosing best based on a random sample of mtry predictors at each node

3. Predict new data using the majority votes for classificatio and average for regression based on ntree trees

Consider an example below

knitr::include_graphics("random.png")

knitr::include_graphics("random2.png")

According to the information given, the algorithm predicted that the patient belongs to suspect category. We are going to make use of random fores; however, let consider case of sample case of decision tree

Sample Decision Tree

modFitA1 <- rpart(NSP ~ ., data=train_data, method="class")
fancyRpartPlot(modFitA1)

library(randomForest)
set.seed(222)

We will therefore estiamte our classification model with NSP as our dependent variable and all the remaining variable as independent variables

rf <- randomForest(NSP~.,data = train_data)

Let us now look at the model using print function

print(rf)


Call:
 randomForest(formula = NSP ~ ., data = train_data) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 5.62%
Confusion matrix:
     1   2   3 class.error
1 1142  17   1  0.01551724
2   47 164   2  0.23004695
3    6  11 105  0.13934426

The output is accompanied by the formular used, where we use NSP as DV, and all the other as IV. The data used in this algorithm is from the training data. This random forest is classification because the response variable is categorical/factor. The default number of trees is 500. The mtry, is the number of variable tried at each split is 4. Out of Bag estimate of error rate is approximately 5.75%. So we have about 94.25% accuracy, which is quite okay. The error when predicting Normal people is very low as compared to when predicting suspect or pathologic.

So what attributes does this model has, we can check using the command below;

attributes(rf)

$names
 [1] "call"            "type"            "predicted"       "err.rate"       
 [5] "confusion"       "votes"           "oob.times"       "classes"        
 [9] "importance"      "importanceSD"    "localImportance" "proximity"      
[13] "ntree"           "mtry"            "forest"          "y"              
[17] "test"            "inbag"           "terms"          

$class
[1] "randomForest.formula" "randomForest"

The model contains all the attributes listed above, for example, if we need the confusion matrix only, we can use the command below

rf$confusion

     1   2   3 class.error
1 1142  17   1  0.01551724
2   47 164   2  0.23004695
3    6  11 105  0.13934426

Prediction and Confusion Matrix

We will make prediction using the package caret

library(caret)

We can view the prediction from the model vs the real data

p1 <- predict(rf, train_data)
head(p1)

 1  3  6  7  9 10 
 2  1  3  3  3  3 
Levels: 1 2 3

Observations from the real data

head(train_data$NSP)

[1] 2 1 3 3 3 3
Levels: 1 2 3

The first six prediction are all accurate, that is 100% accuracy for the six observations

confusionMatrix(p1, train_data$NSP)

Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1160    2    0
         2    0  211    0
         3    0    0  122

Overall Statistics
                                          
               Accuracy : 0.9987          
                 95% CI : (0.9952, 0.9998)
    No Information Rate : 0.7759          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9964          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            1.0000   0.9906  1.00000
Specificity            0.9940   1.0000  1.00000
Pos Pred Value         0.9983   1.0000  1.00000
Neg Pred Value         1.0000   0.9984  1.00000
Prevalence             0.7759   0.1425  0.08161
Detection Rate         0.7759   0.1411  0.08161
Detection Prevalence   0.7773   0.1411  0.08161
Balanced Accuracy      0.9970   0.9953  1.00000

From the confusion matrix, the model predicted that 1160 people belong in grouped and were alos observed to belong in group 1. The model predicted that 211 people belong in group 2 and the were observed to belong to group 2 and lastly the model predicted that 122 people belong in groups and indeed were observed to belong in group 3. However, we two people we misclassified to belong to group while they were observed to belong in group 2

Sensitity statistics tell us how often classes were correctly classified. In other words, sensitity for class 1 is 1.00, that is, we have 100 accuracy that patient in class 1 were correctly classified to belong in class 1.

Out of Baf Error

For each bootstrap iteration and related tree prediction error using data not in bootstrap sample (also called out of bag of OOB data) is estimated * Classification: Accuracy * Regression: R-Sq & SMSE

Predicting with Test Data

Now that all the data points in the train data set are seen by the model, we can try predicting using the test data which is not seen by the model. We can view the prediction from the model vs the real data

p2 <- predict(rf, test_data)
head(p2)

 2  4  5  8 11 16 
 1  1  1  3  2  1 
Levels: 1 2 3

Observations from the real data

head(test_data$NSP)

[1] 1 1 1 3 2 1
Levels: 1 2 3

Create the Confusion Matrix

confusionMatrix(p2, test_data$NSP)

Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 482  17   4
         2  11  61   4
         3   2   4  46

Overall Statistics
                                          
               Accuracy : 0.9334          
                 95% CI : (0.9111, 0.9516)
    No Information Rate : 0.7845          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8109          
                                          
 Mcnemar's Test P-Value : 0.5823          

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9737  0.74390  0.85185
Specificity            0.8456  0.97268  0.98960
Pos Pred Value         0.9583  0.80263  0.88462
Neg Pred Value         0.8984  0.96216  0.98618
Prevalence             0.7845  0.12995  0.08558
Detection Rate         0.7639  0.09667  0.07290
Detection Prevalence   0.7971  0.12044  0.08241
Balanced Accuracy      0.9097  0.85829  0.92073

The test data has not been seen by the random forest model and the accuracy has come down to 93.34%. The 95% CI is still good, ranging between 91% to 95%. There is slightly higher misclassification from the test data as compared to the train data. Besides, sensitivity for preicting class 1 is still higher as compared to class 2 and 3.

Error Rate of Random Forest Model

We will make a plot using the command below to see error rate in our model

plot(rf, main ="Error Rate of the Random Forest Model")

As the number of trees grows, the out of bag error comes down to later remain constant. From the plot, we cannot improve the error after about 300 trees.

Tune the Random Forest Model

t<- tuneRF(train_data[,-22], train_data[,22],
       stepFactor = 0.5,
       plot = TRUE,
       ntreeTry = 300,
       trace = TRUE,
       improve = 0.05)## the comma in the square brackets indicates all rows expect 22

mtry = 4  OOB error = 5.55% 
Searching left ...
mtry = 8    OOB error = 5.22% 
0.06024096 0.05 
mtry = 16   OOB error = 5.55% 
-0.06410256 0.05 
Searching right ...
mtry = 2    OOB error = 6.82% 
-0.3076923 0.05

The out of bag error is hight when mtry =2 but comes down as mtry increases to 4 and comes further down when mtry = 8 and rises when mtry = 16. This gives us an idea of what mtry value we should choose. We can now go back to our random forest model and change a few things to tune our model

rf2 <- randomForest(NSP~.,data = train_data,
                   ntree = 300,
                   mtry = 8,
                   importance = TRUE)

Print the model

print(rf2)


Call:
 randomForest(formula = NSP ~ ., data = train_data, ntree = 300,      mtry = 8, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 300
No. of variables tried at each split: 8

        OOB estimate of  error rate: 5.02%
Confusion matrix:
     1   2   3 class.error
1 1138  16   6  0.01896552
2   41 171   1  0.19718310
3    7   4 111  0.09016393

Earlier, the out of bag error was 5.62% and this one, the OOB has come down to 5.28%. Besides, classification for predicting class 2 and 3 groups have improved as compared to the previous model. However, the classification error for predicting class 1 worsened. Let us check the accuracy level of our model

p_1 <- predict(rf2, train_data)
head(p_1)

 1  3  6  7  9 10 
 2  1  3  3  3  3 
Levels: 1 2 3

head(train_data$NSP)

[1] 2 1 3 3 3 3
Levels: 1 2 3

confusionMatrix(p_1, train_data$NSP)

Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1159    1    0
         2    1  212    0
         3    0    0  122

Overall Statistics
                                          
               Accuracy : 0.9987          
                 95% CI : (0.9952, 0.9998)
    No Information Rate : 0.7759          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9964          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9991   0.9953  1.00000
Specificity            0.9970   0.9992  1.00000
Pos Pred Value         0.9991   0.9953  1.00000
Neg Pred Value         0.9970   0.9992  1.00000
Prevalence             0.7759   0.1425  0.08161
Detection Rate         0.7753   0.1418  0.08161
Detection Prevalence   0.7759   0.1425  0.08161
Balanced Accuracy      0.9981   0.9973  1.00000

The real test to see how our model is performing will be based on the test data as shown below

p_2 <- predict(rf2, test_data)
head(p_2)

 2  4  5  8 11 16 
 1  1  1  3  2  1 
Levels: 1 2 3

head(test_data$NSP)

[1] 1 1 1 3 2 1
Levels: 1 2 3

confusionMatrix(p_2, test_data$NSP)

Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 482  16   3
         2  11  61   2
         3   2   5  49

Overall Statistics
                                          
               Accuracy : 0.9382          
                 95% CI : (0.9165, 0.9557)
    No Information Rate : 0.7845          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8256          
                                          
 Mcnemar's Test P-Value : 0.4915          

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9737  0.74390  0.90741
Specificity            0.8603  0.97632  0.98787
Pos Pred Value         0.9621  0.82432  0.87500
Neg Pred Value         0.9000  0.96230  0.99130
Prevalence             0.7845  0.12995  0.08558
Detection Rate         0.7639  0.09667  0.07765
Detection Prevalence   0.7940  0.11727  0.08875
Balanced Accuracy      0.9170  0.86011  0.94764

No much changes in our model but there are some improvements

Number of Nodes for Trees

Remember we have 300 trees in this model

hist(treesize(rf2),
     main = "Number of Nodes for the Trees",
     col = "blue")

The histogram shows the distribution of the number of nodes in each of those 300 trees. From the histogram above, there are about 80 trees with approximately 80 nodes in each. We also have few trees with sixty nodes and also few trees with more than 100 nodes. Majority of trees have an average of 80 nodes.

Variable Importance

We can also find which variable played an important role in the model using the command below

varImpPlot(rf2, main = "Variable Importance in our Random Forest Model")

The chart shows how worst the model would perform if we remove each variable. In other words, the first chart shows the mean decrease accuracy when a variable is removed. Some variable have a higher contribution to our model as compared to other. For example, DS variable has almost zero contribution while variable such as ASTV abd ALTV have higher contribution to our model. The second chart measure how pure the nodes are at the end of the tree without each variable. From the chart, the first four stands out as the most significant predictors, where if we remove them from model, mean Gini decreases significantly.

Let us view the fisrt top ten variables

varImpPlot(rf2,
           sort = TRUE,
           n.var = 10,
           main = "Variable Importance in our Random Forest Model: Top 10")

We can as well get the quantitative value to evaluate the variable importance.

rf2$importance

                    1            2             3 MeanDecreaseAccuracy
LB       1.553953e-02 0.0258200214  1.071221e-02         1.657837e-02
AC       2.943162e-02 0.0871365675  4.320153e-02         3.856144e-02
FM       2.486324e-03 0.0114239468  3.655310e-03         3.824810e-03
UC       5.177781e-03 0.0477066403  3.462117e-02         1.362509e-02
DL       1.390607e-03 0.0011991263  1.702274e-02         2.630076e-03
DS       7.698229e-06 0.0000000000  0.000000e+00         6.006006e-06
DP       1.127090e-02 0.0061564706  5.551498e-02         1.410008e-02
ASTV     2.836333e-02 0.2595434435  1.826003e-01         7.354497e-02
MSTV     2.988848e-02 0.1897893962  2.050543e-01         6.687412e-02
ALTV     3.806223e-02 0.1688399845  1.836021e-01         6.840553e-02
MLTV     6.206474e-03 0.0319690528  3.022055e-02         1.179525e-02
Width    1.034185e-02 0.0101418426  3.082483e-02         1.190649e-02
Min      9.490114e-03 0.0115299398  3.972243e-02         1.219832e-02
Max      1.356609e-02 0.0105077479  8.713479e-03         1.272368e-02
Nmax     3.965936e-03 0.0029536825  1.343104e-02         4.631204e-03
Nzeros   7.601632e-04 0.0015629388 -1.894073e-05         8.093686e-04
Mode     2.117570e-02 0.0233126568  3.929863e-02         2.287848e-02
Mean     3.464728e-02 0.0451425315  2.300968e-01         5.206579e-02
Median   2.396184e-02 0.0295561776  5.979045e-02         2.760608e-02
Variance 9.396708e-03 0.0087543808  2.225253e-02         1.037441e-02
Tendency 6.445705e-04 0.0004500091  2.162051e-03         7.167475e-04
         MeanDecreaseGini
LB            17.54785188
AC            25.78439515
FM             7.89618149
UC            19.55790339
DL             3.77273295
DS             0.06106956
DP            28.40131621
ASTV          88.66220697
MSTV          85.09145239
ALTV          77.94261740
MLTV          22.57844113
Width         13.62028859
Min           12.04096306
Max           14.62494129
Nmax           9.66268459
Nzeros         2.56566805
Mode          23.96025039
Mean          62.67833384
Median        25.72019197
Variance      10.03170156
Tendency       2.46446976

Let us now get which predictor variables are actually used in the random forest

varUsed(rf2)

 [1] 1299  996  897 1605  375    5  724 2156 1278 2240 1553 1247 1143 1352  982
[16]  327 1227 1504 1296  881  322

The results above shows how many time each variable occurred/was used in the model. DS was only used twice in the model.

Partial Dependence Plot

partial dependence plot give a good graphical depiction of marginal effect of a variable on the class probability (classification) or response (regression)

partialPlot(rf2, train_data, ASTV, "1")

When ASTV is less than 60, the model predicts the patient to belong in class 1 as compared to when ASTV is more than 60. Similarly we can look at the plot for class 3

partialPlot(rf2, train_data, ASTV, "3")

From the plot, when ASTV is more than 60, the model predict class 3 more than when ASTV is less than 60. From the previous results we saw that misclassification was generally high in class two. Lets make a plot to ASTV for class 2

partialPlot(rf2, train_data, ASTV, "2")

We can see there is more confusion with class 2.

Extracting A Single Tree

Let us get the first tree

getTree(rf2, 1, labelVar = TRUE)

    left daughter right daughter split var  split point status prediction
1               2              3      ALTV 7.500000e+00      1       <NA>
2               4              5      Mean 1.080000e+02      1       <NA>
3               6              7      MSTV 4.500000e-01      1       <NA>
4               8              9        DP 5.181345e-04      1       <NA>
5              10             11      ASTV 7.400000e+01      1       <NA>
6              12             13      ALTV 6.850000e+01      1       <NA>
7              14             15       Max 1.850000e+02      1       <NA>
8              16             17       Max 1.775000e+02      1       <NA>
9              18             19     Width 7.250000e+01      1       <NA>
10             20             21        DP 1.489997e-03      1       <NA>
11             22             23    Median 1.520000e+02      1       <NA>
12             24             25       Min 1.255000e+02      1       <NA>
13              0              0      <NA> 0.000000e+00     -1          3
14             26             27        UC 3.834495e-03      1       <NA>
15             28             29      ASTV 5.050000e+01      1       <NA>
16              0              0      <NA> 0.000000e+00     -1          3
17              0              0      <NA> 0.000000e+00     -1          1
18              0              0      <NA> 0.000000e+00     -1          2
19              0              0      <NA> 0.000000e+00     -1          3
20             30             31      Mode 1.040000e+02      1       <NA>
21             32             33      Mode 1.145000e+02      1       <NA>
22             34             35      MSTV 5.500000e-01      1       <NA>
23             36             37      Mode 1.530000e+02      1       <NA>
24             38             39        UC 5.547580e-03      1       <NA>
25             40             41       Max 1.425000e+02      1       <NA>
26             42             43        LB 1.365000e+02      1       <NA>
27             44             45      Mean 1.585000e+02      1       <NA>
28              0              0      <NA> 0.000000e+00     -1          1
29             46             47      MSTV 9.500000e-01      1       <NA>
30             48             49        DP 8.785115e-04      1       <NA>
31             50             51        LB 1.425000e+02      1       <NA>
32             52             53       Min 6.900000e+01      1       <NA>
33             54             55      ASTV 6.350000e+01      1       <NA>
34              0              0      <NA> 0.000000e+00     -1          3
35              0              0      <NA> 0.000000e+00     -1          2
36              0              0      <NA> 0.000000e+00     -1          1
37              0              0      <NA> 0.000000e+00     -1          2
38             56             57       Min 1.235000e+02      1       <NA>
39             58             59      Nmax 5.000000e-01      1       <NA>
40              0              0      <NA> 0.000000e+00     -1          3
41             60             61      ASTV 8.000000e+01      1       <NA>
42             62             63     Width 1.025000e+02      1       <NA>
43             64             65      MSTV 5.500000e-01      1       <NA>
44             66             67     Width 1.195000e+02      1       <NA>
45              0              0      <NA> 0.000000e+00     -1          2
46             68             69     Width 1.320000e+02      1       <NA>
47              0              0      <NA> 0.000000e+00     -1          2
48              0              0      <NA> 0.000000e+00     -1          2
49              0              0      <NA> 0.000000e+00     -1          1
50             70             71      MSTV 4.500000e-01      1       <NA>
51             72             73      ASTV 4.200000e+01      1       <NA>
52              0              0      <NA> 0.000000e+00     -1          2
53              0              0      <NA> 0.000000e+00     -1          1
54             74             75        FM 1.870177e-01      1       <NA>
55              0              0      <NA> 0.000000e+00     -1          3
56             76             77        FM 2.219280e-03      1       <NA>
57              0              0      <NA> 0.000000e+00     -1          1
58              0              0      <NA> 0.000000e+00     -1          2
59              0              0      <NA> 0.000000e+00     -1          1
60             78             79      MLTV 9.100000e+00      1       <NA>
61              0              0      <NA> 0.000000e+00     -1          3
62             80             81    Median 1.415000e+02      1       <NA>
63              0              0      <NA> 0.000000e+00     -1          2
64             82             83        LB 1.550000e+02      1       <NA>
65             84             85        AC 4.170140e-04      1       <NA>
66             86             87        FM 9.012281e-03      1       <NA>
67              0              0      <NA> 0.000000e+00     -1          2
68              0              0      <NA> 0.000000e+00     -1          2
69              0              0      <NA> 0.000000e+00     -1          1
70             88             89     Width 2.200000e+01      1       <NA>
71             90             91     Width 1.750000e+01      1       <NA>
72              0              0      <NA> 0.000000e+00     -1          1
73             92             93     Width 2.350000e+01      1       <NA>
74              0              0      <NA> 0.000000e+00     -1          1
75             94             95      MSTV 1.850000e+00      1       <NA>
76             96             97        FM 1.703756e-03      1       <NA>
77              0              0      <NA> 0.000000e+00     -1          2
78             98             99        UC 8.173059e-03      1       <NA>
79              0              0      <NA> 0.000000e+00     -1          1
80              0              0      <NA> 0.000000e+00     -1          1
81            100            101     Width 5.100000e+01      1       <NA>
82            102            103    Median 1.595000e+02      1       <NA>
83              0              0      <NA> 0.000000e+00     -1          1
84            104            105        LB 1.455000e+02      1       <NA>
85            106            107        UC 3.772002e-03      1       <NA>
86            108            109      Nmax 6.500000e+00      1       <NA>
87              0              0      <NA> 0.000000e+00     -1          2
88              0              0      <NA> 0.000000e+00     -1          2
89              0              0      <NA> 0.000000e+00     -1          1
90            110            111      ASTV 5.600000e+01      1       <NA>
91            112            113        DP 8.138395e-04      1       <NA>
92            114            115  Tendency 5.000000e-01      1       <NA>
93            116            117      MLTV 8.450000e+00      1       <NA>
94              0              0      <NA> 0.000000e+00     -1          1
95            118            119  Variance 4.650000e+01      1       <NA>
96            120            121      ALTV 1.500000e+01      1       <NA>
97              0              0      <NA> 0.000000e+00     -1          3
98            122            123      Mode 1.355000e+02      1       <NA>
99              0              0      <NA> 0.000000e+00     -1          1
100             0              0      <NA> 0.000000e+00     -1          1
101             0              0      <NA> 0.000000e+00     -1          2
102             0              0      <NA> 0.000000e+00     -1          2
103             0              0      <NA> 0.000000e+00     -1          1
104           124            125      ASTV 4.350000e+01      1       <NA>
105             0              0      <NA> 0.000000e+00     -1          1
106             0              0      <NA> 0.000000e+00     -1          1
107             0              0      <NA> 0.000000e+00     -1          2
108           126            127      Mode 1.620000e+02      1       <NA>
109           128            129  Variance 6.000000e+00      1       <NA>
110             0              0      <NA> 0.000000e+00     -1          1
111             0              0      <NA> 0.000000e+00     -1          2
112           130            131    Median 1.475000e+02      1       <NA>
113           132            133        FM 2.242762e-01      1       <NA>
114             0              0      <NA> 0.000000e+00     -1          2
115             0              0      <NA> 0.000000e+00     -1          1
116             0              0      <NA> 0.000000e+00     -1          1
117           134            135       Min 1.345000e+02      1       <NA>
118             0              0      <NA> 0.000000e+00     -1          2
119             0              0      <NA> 0.000000e+00     -1          3
120           136            137       Max 1.500000e+02      1       <NA>
121             0              0      <NA> 0.000000e+00     -1          2
122           138            139      Nmax 1.500000e+00      1       <NA>
123           140            141      ASTV 5.850000e+01      1       <NA>
124             0              0      <NA> 0.000000e+00     -1          1
125           142            143      ALTV 9.500000e+00      1       <NA>
126           144            145      ALTV 6.050000e+01      1       <NA>
127             0              0      <NA> 0.000000e+00     -1          2
128           146            147       Min 1.015000e+02      1       <NA>
129             0              0      <NA> 0.000000e+00     -1          1
130             0              0      <NA> 0.000000e+00     -1          1
131           148            149      MLTV 9.350000e+00      1       <NA>
132             0              0      <NA> 0.000000e+00     -1          1
133           150            151      Mode 1.380000e+02      1       <NA>
134           152            153      ASTV 4.350000e+01      1       <NA>
135           154            155      MSTV 7.500000e-01      1       <NA>
136             0              0      <NA> 0.000000e+00     -1          3
137             0              0      <NA> 0.000000e+00     -1          1
138             0              0      <NA> 0.000000e+00     -1          2
139             0              0      <NA> 0.000000e+00     -1          3
140             0              0      <NA> 0.000000e+00     -1          1
141             0              0      <NA> 0.000000e+00     -1          2
142             0              0      <NA> 0.000000e+00     -1          1
143           156            157       Max 1.470000e+02      1       <NA>
144             0              0      <NA> 0.000000e+00     -1          1
145           158            159       Min 1.160000e+02      1       <NA>
146             0              0      <NA> 0.000000e+00     -1          1
147             0              0      <NA> 0.000000e+00     -1          2
148             0              0      <NA> 0.000000e+00     -1          1
149           160            161        AC 1.562341e-03      1       <NA>
150             0              0      <NA> 0.000000e+00     -1          1
151             0              0      <NA> 0.000000e+00     -1          2
152           162            163       Min 8.950000e+01      1       <NA>
153           164            165        DL 3.508204e-03      1       <NA>
154             0              0      <NA> 0.000000e+00     -1          1
155             0              0      <NA> 0.000000e+00     -1          2
156             0              0      <NA> 0.000000e+00     -1          1
157             0              0      <NA> 0.000000e+00     -1          2
158             0              0      <NA> 0.000000e+00     -1          2
159             0              0      <NA> 0.000000e+00     -1          1
160           166            167       Min 1.225000e+02      1       <NA>
161             0              0      <NA> 0.000000e+00     -1          1
162             0              0      <NA> 0.000000e+00     -1          1
163             0              0      <NA> 0.000000e+00     -1          2
164           168            169      ALTV 6.500000e+00      1       <NA>
165             0              0      <NA> 0.000000e+00     -1          2
166             0              0      <NA> 0.000000e+00     -1          2
167             0              0      <NA> 0.000000e+00     -1          1
168             0              0      <NA> 0.000000e+00     -1          1
169             0              0      <NA> 0.000000e+00     -1          2

From the results, whenever it says, the status is -1, it means, this node is a terminal node and the classification based on this terminal node is the patient NSP value is 2 or the patient is suspect. Similarly at various terminals where status is negative the model predict patients’ NSP as 3 and 1 as well.

Testing the Model by classifying patients with following information.

patients <- read.csv("patients.csv")
patients$NSP <- as.factor(patients$NSP)
str(patients)

'data.frame':   6 obs. of  22 variables:
 $ LB      : int  134 122 122 122 151 131
 $ AC      : num  0.0014 0 0 0 0 ...
 $ FM      : num  0 0 0 0 0 ...
 $ UC      : num  0.012623 0 0.001517 0.002967 0.000834 ...
 $ DL      : num  0.008415 0 0 0 0.000834 ...
 $ DS      : int  0 0 0 0 0 0
 $ DP      : num  0.00281 0 0 0 0 ...
 $ ASTV    : int  29 83 84 86 64 28
 $ MSTV    : num  6.3 0.5 0.5 0.3 1.9 1.5
 $ ALTV    : int  0 6 5 6 9 0
 $ MLTV    : num  0 15.6 13.6 10.6 27.6 5.4
 $ Width   : int  150 68 68 68 130 87
 $ Min     : int  50 62 62 62 56 71
 $ Max     : int  200 130 130 130 186 158
 $ Nmax    : int  6 0 0 1 2 2
 $ Nzeros  : int  3 0 0 0 0 0
 $ Mode    : int  71 122 122 122 150 141
 $ Mean    : int  107 122 122 122 148 137
 $ Median  : int  106 123 123 123 151 141
 $ Variance: int  215 3 3 1 9 10
 $ Tendency: int  0 1 1 1 1 1
 $ NSP     : Factor w/ 3 levels "1","2","3": 3 3 3 3 2 1

head(patients,5)

   LB          AC FM          UC          DL DS          DP ASTV MSTV ALTV MLTV
1 134 0.001402525  0 0.012622721 0.008415147  0 0.002805049   29  6.3    0  0.0
2 122 0.000000000  0 0.000000000 0.000000000  0 0.000000000   83  0.5    6 15.6
3 122 0.000000000  0 0.001517451 0.000000000  0 0.000000000   84  0.5    5 13.6
4 122 0.000000000  0 0.002967359 0.000000000  0 0.000000000   86  0.3    6 10.6
5 151 0.000000000  0 0.000834028 0.000834028  0 0.000000000   64  1.9    9 27.6
  Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
1   150  50 200    6      3   71  107    106      215        0   3
2    68  62 130    0      0  122  122    123        3        1   3
3    68  62 130    0      0  122  122    123        3        1   3
4    68  62 130    1      0  122  122    123        1        1   3
5   130  56 186    2      0  150  148    151        9        1   2

# Use the trained random forest model to classify the patient
predicted_class <- predict(rf2, patients)

# The 'predicted_class' variable now contains the predicted class (1, 2, or 3)
# You can print or use it as needed
print(predicted_class)

1 2 3 4 5 6 
3 3 3 3 2 1 
Levels: 1 2 3

### Actual data
head(patients$NSP)

[1] 3 3 3 3 2 1
Levels: 1 2 3

Accuracy of the Model

confusionMatrix(predicted_class, patients$NSP)

Confusion Matrix and Statistics

          Reference
Prediction 1 2 3
         1 1 0 0
         2 0 1 0
         3 0 0 4

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.5407, 1)
    No Information Rate : 0.6667     
    P-Value [Acc > NIR] : 0.08779    
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            1.0000   1.0000   1.0000
Specificity            1.0000   1.0000   1.0000
Pos Pred Value         1.0000   1.0000   1.0000
Neg Pred Value         1.0000   1.0000   1.0000
Prevalence             0.1667   0.1667   0.6667
Detection Rate         0.1667   0.1667   0.6667
Detection Prevalence   0.1667   0.1667   0.6667
Balanced Accuracy      1.0000   1.0000   1.0000

Additional Machine Learning Algorithm to Predict Heart Failure

str(mydata4)

'data.frame':   2126 obs. of  22 variables:
 $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
 $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
 $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
 $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DP      : num  0 0 0 0 0 ...
 $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
 $ NSP     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...

par(mfrow=c(1,2))

boxplot(LB ~ NSP)
boxplot(ASTV ~ NSP)

boxplot(Width ~ NSP)
boxplot(Median ~ NSP)

Correlation Matrix

par(mfrow=c(1,1))
pairs( cbind(AC, ASTV, MSTV, Width, Min, Max), pch=19, lower.panel=NULL, cex=.5)

looking at classification based on p.hat = .5 cutoff

10-fold CV, repeated 5 times

train_model <- trainControl(method = "repeatedcv", number = 5, repeats=10)


model.cart <- train( NSP ~., 
  data = mydata4, 
  method = "rpart",
  trControl = train_model)

model.cart

CART 

2126 samples
  21 predictor
   3 classes: '1', '2', '3' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times) 
Summary of sample sizes: 1701, 1701, 1701, 1701, 1700, 1701, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.08917197  0.8742240  0.6550994
  0.17622081  0.8401331  0.5268462
  0.20806794  0.8002354  0.2604269

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.08917197.

model.cart$finalModel

n= 2126 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 2126 471 1 (0.778457197 0.138758231 0.082784572)  
  2) MSTV>=0.55 1754 202 1 (0.884834664 0.053591790 0.061573546)  
    4) Mean>=107.5 1649 107 1 (0.935112189 0.055791389 0.009096422) *
    5) Mean< 107.5 105  12 3 (0.095238095 0.019047619 0.885714286) *
  3) MSTV< 0.55 372 171 2 (0.276881720 0.540322581 0.182795699) *

confusionMatrix(predict(model.cart, mydata4), 
                reference=mydata4$NSP, positive="1")

Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1542   92   15
         2  103  201   68
         3   10    2   93

Overall Statistics
                                          
               Accuracy : 0.8636          
                 95% CI : (0.8483, 0.8779)
    No Information Rate : 0.7785          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6292          
                                          
 Mcnemar's Test P-Value : 8.841e-14       

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9317  0.68136  0.52841
Specificity            0.7728  0.90661  0.99385
Pos Pred Value         0.9351  0.54032  0.88571
Neg Pred Value         0.7631  0.94641  0.95893
Prevalence             0.7785  0.13876  0.08278
Detection Rate         0.7253  0.09454  0.04374
Detection Prevalence   0.7756  0.17498  0.04939
Balanced Accuracy      0.8523  0.79398  0.76113

fancyRpartPlot(model.cart$finalModel)

model.rf <- train(
  NSP ~.,
  data = mydata4, 
  method = "rf",
  trControl = train_model)
model.rf

Random Forest 

2126 samples
  21 predictor
   3 classes: '1', '2', '3' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times) 
Summary of sample sizes: 1701, 1700, 1701, 1701, 1701, 1701, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9318432  0.8044846
  11    0.9442125  0.8433624
  21    0.9394626  0.8303969

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 11.

summary(model.rf$finalModel)

                Length Class      Mode     
call               4   -none-     call     
type               1   -none-     character
predicted       2126   factor     numeric  
err.rate        2000   -none-     numeric  
confusion         12   -none-     numeric  
votes           6378   matrix     numeric  
oob.times       2126   -none-     numeric  
classes            3   -none-     character
importance        21   -none-     numeric  
importanceSD       0   -none-     NULL     
localImportance    0   -none-     NULL     
proximity          0   -none-     NULL     
ntree              1   -none-     numeric  
mtry               1   -none-     numeric  
forest            14   -none-     list     
y               2126   factor     numeric  
test               0   -none-     NULL     
inbag              0   -none-     NULL     
xNames            21   -none-     character
problemType        1   -none-     character
tuneValue          1   data.frame list     
obsLevels          3   -none-     character
param              0   -none-     list

model.rf$finalModel


Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 11

        OOB estimate of  error rate: 5.08%
Confusion matrix:
     1   2   3 class.error
1 1624  23   8  0.01873112
2   57 232   6  0.21355932
3    8   6 162  0.07954545

plot(model.rf$finalModel)

varImp(model.rf$finalModel)

             Overall
LB        22.9384967
AC        31.5312698
FM        12.6508906
UC        29.4776634
DL         5.1060651
DS         0.5361335
DP        44.8461939
ASTV     124.7502475
MSTV     137.3937671
ALTV     105.3323077
MLTV      25.7922601
Width     17.2978347
Min       17.1618847
Max       18.6372755
Nmax      14.0623581
Nzeros     2.8341471
Mode      33.9885301
Mean      95.8131455
Median    25.7409495
Variance  11.3537206
Tendency   3.5667884

plot( varImp(model.rf) )

yhat = 1-predict(model.rf$finalModel, type="prob")[,1]
plot(mydata4$ASTV, yhat)

scatter.smooth(mydata4$MSTV, yhat, span=.4)

scatter.smooth(mydata4$ASTV, yhat, span=.4)

scatter.smooth(mydata4$ALTV, yhat, span=.4)

scatter.smooth(mydata4$Mean, yhat, span=.4)

scatter.smooth(mydata4$DP, yhat, span=.4)

scatter.smooth(mydata4$Mode, yhat, span=.4)

scatter.smooth(mydata4$AC, yhat, span=.4)

scatter.smooth(mydata4$UC, yhat, span=.4)

boxplot(yhat ~ Tendency)

boxplot(yhat ~ mydata4$NSP)

confusionMatrix(predict(model.rf, mydata4), 
                reference=mydata4$NSP, positive="1")

Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1655    2    0
         2    0  293    0
         3    0    0  176

Overall Statistics
                                          
               Accuracy : 0.9991          
                 95% CI : (0.9966, 0.9999)
    No Information Rate : 0.7785          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9974          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            1.0000   0.9932  1.00000
Specificity            0.9958   1.0000  1.00000
Pos Pred Value         0.9988   1.0000  1.00000
Neg Pred Value         1.0000   0.9989  1.00000
Prevalence             0.7785   0.1388  0.08278
Detection Rate         0.7785   0.1378  0.08278
Detection Prevalence   0.7794   0.1378  0.08278
Balanced Accuracy      0.9979   0.9966  1.00000

BEFORE MOVING TO KNN, LETS HAVE A DECISION TREE ON HEART FAILURE

heart <- read.csv("heart_failure_clinical_records_dataset.csv")
head(heart,5)

  time age anaemia creatinine_phosphokinase diabetes ejection_fraction
1    4  75       0                      582        0                20
2    6  55       0                     7861        0                38
3    7  65       0                      146        0                20
4    7  50       1                      111        0                20
5    8  65       1                      160        1                20
  high_blood_pressure platelets serum_creatinine serum_sodium sex smoking
1                   1    265000              1.9          130   1       0
2                   0    263358              1.1          136   1       0
3                   0    162000              1.3          129   1       1
4                   0    210000              1.9          137   1       0
5                   0    327000              2.7          116   0       0
  heart.failure
1             1
2             1
3             1
4             1
5             1

Change some variable to factors

heart$anaemia <- as.factor(heart$anaemia)
heart$diabetes <- as.factor(heart$diabetes)
heart$high_blood_pressure <- as.factor(heart$high_blood_pressure)
heart$sex <- as.factor(heart$sex)
heart$smoking <- as.factor(heart$smoking)
heart$heart.failure <- as.factor(heart$heart.failure)
str(heart)

'data.frame':   299 obs. of  13 variables:
 $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
 $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ platelets               : num  265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ heart.failure           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Sample Decision Tree

Using Rpart.Plot to Display the Decision Tree Diagram

heart_failure <- rpart(heart.failure ~ .,
                       data=heart, 
                       method="class")
rpart.plot(heart_failure, main = "Decision Tree for Heart Failure")

TRYING A DIFFERENT DISPLAY

tree.model.party1 <- as.party(heart_failure)
plot(tree.model.party1)

Using fancyRpartPlot to Display the Decision Tree Diagram

fancyRpartPlot(heart_failure, main = "Decision Tree for Heart Failure")

We can use the decision tree plot above to make prediction on the heart failure. We we can now move to K-Nearest neighbors

CLASSIFICATION TREE WITH IRIS DATA SET

frq(iris, Species)

Species <categorical> 
# total N=150 valid N=150 mean=2.00 sd=0.82

Value      |  N | Raw % | Valid % | Cum. %
------------------------------------------
setosa     | 50 | 33.33 |   33.33 |  33.33
versicolor | 50 | 33.33 |   33.33 |  66.67
virginica  | 50 | 33.33 |   33.33 | 100.00
<NA>       |  0 |  0.00 |    <NA> |   <NA>

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Plot the Tree

flowers_data <- iris
flowers_data <- flowers_data[!(flowers_data$Species == 'virginica'),]
tree.model <- rpart(Species ~ . ,
                    data = flowers_data)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)

flowers_data <- iris
tree.model <- rpart(Species~ ., 
                    data = flowers_data)
tree.model.party <- as.party(tree.model)
plot(tree.model.party)

K NEAREST NEIGHBORS

Classification

K-Nearest Neighbors (K-NN) is a versatile and intuitive machine learning algorithm used for both classification and regression tasks. At its core, K-NN operates on the principle of similarity: it classifies or predicts a data point based on the majority class or average of its K nearest neighbors in the feature space. The choice of K, the number of neighbors, is a critical parameter that influences the algorithm’s performance. A smaller K can make the model sensitive to noise and outliers, potentially leading to overfitting, while a larger K may result in a smoother decision boundary or prediction but could miss important local patterns in the data.

One of the key strengths of K-NN is its simplicity and ease of implementation. It doesn’t require a training phase; instead, it relies on the entire dataset during prediction. Additionally, K-NN can be applied to both numerical and categorical data, making it a versatile choice for various types of datasets. However, it’s important to consider the impact of scaling and distance metric selection when working with K-NN, as the algorithm’s performance can be sensitive to these factors. Overall, K-NN is a valuable tool in the machine learning toolkit, especially for cases where interpretability and simplicity are desired, but it should be used judiciously, with careful consideration of K and data preprocessing steps to achieve optimal results. However many people confuses the k nearest neighbours to k means clustering. They are not the same thing. K means clustering is non supervised machine learning algorith while k-NN is supervised machine learning. This method is used for classification and regression.

Consider the figure below when K=3

knitr::include_graphics("knn.png")

Consider the figure below when K=5

knitr::include_graphics("knn2.png")

Application Example of K-NN

Recommendation Systems
Text categorization
Finance
Medicine
Anomaly detection

Load the following library

library(caret)
library(pROC)
library(mlbench)

Load the data

k_nn <- read.csv("binary.csv")
head(k_nn,5)

  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4

Check the structure of the data set

str(k_nn)

'data.frame':   400 obs. of  4 variables:
 $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Check the admit variable to factor variable

k_nn$admit <- as.factor(k_nn$admit)
str(k_nn)

'data.frame':   400 obs. of  4 variables:
 $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Create yes and no

k_nn$admit <- factor(k_nn$admit, levels = c(0,1),
                     labels = c("No","Yes"))

Structure

str(k_nn)

'data.frame':   400 obs. of  4 variables:
 $ admit: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 1 2 1 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Frequency

frq(k_nn, admit)

admit <categorical> 
# total N=400 valid N=400 mean=1.32 sd=0.47

Value |   N | Raw % | Valid % | Cum. %
--------------------------------------
No    | 273 | 68.25 |   68.25 |  68.25
Yes   | 127 | 31.75 |   31.75 | 100.00
<NA>  |   0 |  0.00 |    <NA> |   <NA>

Data Partitioning

set.seed(1234)
ind3 <- sample(2, nrow(k_nn), replace = TRUE, prob = c(0.7, 0.3))
training <- k_nn[ind3 == 1,]
testing <- k_nn[ind3 ==2,]

Check the structure of training and testing data set

str(testing)

'data.frame':   116 obs. of  4 variables:
 $ admit: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 2 2 1 ...
 $ gre  : int  520 700 480 800 520 780 400 500 520 400 ...
 $ gpa  : num  2.93 3.08 3.44 3.66 3.74 3.22 3.05 3.13 2.68 3.35 ...
 $ rank : int  4 2 3 1 4 2 2 2 3 3 ...

str(training)

'data.frame':   284 obs. of  4 variables:
 $ admit: Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 1 2 1 1 ...
 $ gre  : int  380 660 800 640 760 560 400 540 700 800 ...
 $ gpa  : num  3.61 3.67 4 3.19 3 2.98 3.08 3.39 3.92 4 ...
 $ rank : int  3 3 1 4 2 1 2 3 2 4 ...

K-Nearest Neighbors Method

Before we estimate our kNN methods, let us specify the train and test models

trControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 3)

Let us store the model in FIT

FIT <-  train(admit~.,
              data = training,
              method = 'knn',
              tuneLength = 20,
              trControl = trControl,
              preProc = c("center","scale"))

Model Performance

FIT

k-Nearest Neighbors 

284 samples
  3 predictor
  2 classes: 'No', 'Yes' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 256, 255, 256, 255, 255, 256, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa     
   5  0.6351040  0.09145898
   7  0.6762425  0.16991017
   9  0.6750930  0.15244021
  11  0.6682348  0.13513104
  13  0.6645430  0.12637105
  15  0.6668774  0.11457683
  17  0.6751314  0.13646317
  19  0.6716010  0.12128138
  21  0.6704543  0.11501118
  23  0.6784620  0.12415411
  25  0.6856431  0.13593414
  27  0.6821128  0.12652554
  29  0.6821538  0.12157653
  31  0.6915134  0.13821970
  33  0.6892556  0.13647228
  35  0.6870389  0.13270602
  37  0.6882321  0.12953000
  39  0.6823618  0.10789173
  41  0.6766119  0.09507794
  43  0.6811713  0.10907586

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 31.

The results above indicates that we have carried out the k- nearest neighbors with a sample of 284 observations, three predictors(GRA, GPA and RANK) and two classes, “yes” and “no”, where yes means the student was admitted and no means the student was not admitted. We have carried out a ten fold cross validation repeated three times. This means that in each cross validation the training data is splited into ten parts, and 9 of them are used in creating the model and the remaining part is used in assessing the performance of the model. In the model above, we have the model accuracy and kappa value. Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 31.

Let us plot the model

plot(FIT)

Get the Parameter Importance

varImp(FIT)

ROC curve variable importance

     Importance
gpa      100.00
rank      25.18
gre        0.00

From the output above, GPA turns out to be the most important variable and GRE turns to be least important variable in our model.

Evaluate Prediction

pred <- predict(FIT, newdata = testing)
confusionMatrix(pred, testing$admit)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  79  31
       Yes  3   3
                                          
               Accuracy : 0.7069          
                 95% CI : (0.6152, 0.7877)
    No Information Rate : 0.7069          
    P-Value [Acc > NIR] : 0.5461          
                                          
                  Kappa : 0.0681          
                                          
 Mcnemar's Test P-Value : 3.649e-06       
                                          
            Sensitivity : 0.96341         
            Specificity : 0.08824         
         Pos Pred Value : 0.71818         
         Neg Pred Value : 0.50000         
             Prevalence : 0.70690         
         Detection Rate : 0.68103         
   Detection Prevalence : 0.94828         
      Balanced Accuracy : 0.52582         
                                          
       'Positive' Class : No

The model has an accuracy value of 70.69%. No information rate is given as 70.69%. Basically, what this means is that, if all students were rejected, then the model would be correct 70.69% of the time.

We can make few changes to improve the performance of the model.

trControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 3,
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary)

Class probabilities are need if we want to use ROC for selecting the best model with optimal K

set.seed(222)
FIT <-  train(admit~.,
              data = training,
              method = 'knn',
              tuneLength = 20,
              trControl = trControl,
              preProc = c("center","scale"),
              metric = "ROC",
              tuneGrid = expand.grid(k= 1:60))

Model Performance

FIT

k-Nearest Neighbors 

284 samples
  3 predictor
  2 classes: 'No', 'Yes' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 256, 256, 256, 256, 255, 256, ... 
Resampling results across tuning parameters:

  k   ROC        Sens       Spec      
   1  0.5412281  0.7100000  0.36962963
   2  0.5631871  0.7019298  0.35703704
   3  0.5822368  0.7973684  0.34074074
   4  0.5622904  0.7770175  0.26111111
   5  0.5878558  0.8065789  0.28518519
   6  0.5911501  0.8241228  0.27740741
   7  0.5884016  0.8585965  0.28296296
   8  0.5892982  0.8550877  0.26851852
   9  0.5965010  0.8725439  0.29074074
  10  0.5899123  0.8657895  0.27444444
  11  0.5955945  0.8799123  0.28555556
  12  0.5876170  0.8695614  0.27740741
  13  0.5942300  0.8659649  0.24185185
  14  0.5949854  0.8885965  0.25703704
  15  0.5987768  0.8816667  0.22777778
  16  0.6125097  0.8974561  0.22074074
  17  0.6257943  0.8976316  0.23629630
  18  0.6256481  0.9008772  0.21481481
  19  0.6286647  0.8976316  0.22851852
  20  0.6372612  0.9027193  0.22185185
  21  0.6386209  0.9064035  0.21074074
  22  0.6393519  0.9113158  0.22518519
  23  0.6397710  0.9200000  0.21444444
  24  0.6438743  0.9268421  0.21740741
  25  0.6460039  0.9215789  0.19962963
  26  0.6482359  0.9252632  0.19925926
  27  0.6554288  0.9287719  0.20296296
  28  0.6596881  0.9356140  0.21037037
  29  0.6670419  0.9356140  0.19925926
  30  0.6726901  0.9426316  0.19888889
  31  0.6659649  0.9460526  0.17740741
  32  0.6657797  0.9478070  0.18148148
  33  0.6623830  0.9495614  0.18851852
  34  0.6627973  0.9461404  0.19259259
  35  0.6652242  0.9495614  0.18851852
  36  0.6679678  0.9442982  0.18148148
  37  0.6682261  0.9477193  0.18148148
  38  0.6704825  0.9426316  0.18555556
  39  0.6682846  0.9459649  0.18925926
  40  0.6617788  0.9442982  0.18185185
  41  0.6624708  0.9408772  0.17481481
  42  0.6600097  0.9443860  0.15703704
  43  0.6633138  0.9478070  0.16037037
  44  0.6617008  0.9495614  0.16000000
  45  0.6627924  0.9478070  0.14962963
  46  0.6605897  0.9478070  0.15037037
  47  0.6598294  0.9494737  0.14629630
  48  0.6578558  0.9512281  0.14962963
  49  0.6581481  0.9511404  0.14259259
  50  0.6615692  0.9546491  0.14592593
  51  0.6638060  0.9546491  0.13148148
  52  0.6637037  0.9581579  0.14222222
  53  0.6655507  0.9546491  0.12814815
  54  0.6666715  0.9528947  0.11703704
  55  0.6662671  0.9597368  0.11370370
  56  0.6639133  0.9614912  0.11777778
  57  0.6619737  0.9597368  0.10333333
  58  0.6612037  0.9632456  0.10333333
  59  0.6640156  0.9597368  0.09592593
  60  0.6681189  0.9597368  0.09962963

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 30.

The results above indicates that we have carried out the k- nearest neighbors with a sample of 284 observations, three predictors(GRA, GPA and RANK) and two classes, “yes” and “no”, where yes means the student was admitted and no means the student was not admitted. We have carried out a ten fold cross validation repeated three times. This means that in each cross validation the training data is splited into ten parts, and 9 of them are used in creating the model and the remaining part is used in assessing the performance of the model. In the model above, we have the model accuracy and kappa value. ROC was used to select the optimal model using the largest value. The final value used for the model was k = 30. ROC is the area under the curve. From the model, the area under curve is given as 0.6726901, that is 67.27%.

Plot FIT

plot(FIT)

View variable Importance

varImp(FIT)

ROC curve variable importance

     Importance
gpa      100.00
rank      25.18
gre        0.00

Prediction

pred <- predict(FIT, newdata = testing)
confusionMatrix(pred, testing$admit)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  79  29
       Yes  3   5
                                         
               Accuracy : 0.7241         
                 95% CI : (0.6334, 0.803)
    No Information Rate : 0.7069         
    P-Value [Acc > NIR] : 0.3848         
                                         
                  Kappa : 0.1423         
                                         
 Mcnemar's Test P-Value : 9.897e-06      
                                         
            Sensitivity : 0.9634         
            Specificity : 0.1471         
         Pos Pred Value : 0.7315         
         Neg Pred Value : 0.6250         
             Prevalence : 0.7069         
         Detection Rate : 0.6810         
   Detection Prevalence : 0.9310         
      Balanced Accuracy : 0.5552         
                                         
       'Positive' Class : No

APPLICATION OF K-NEAREST NEIGHBORS IN REGRESSION

For the purpose of the training, we will use Boston Housing Data. This data is found in “mlbench” library

data("Boston")
str(Boston)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

data <- Boston

View the First Few Observations

head(data,5)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2

The dependent variable of interest is “medv”. if you want to more about Boston Housing, use the comand below

?Boston

Boston Data Description A data set containing housing values in 506 suburbs of Boston. A data frame with 506 rows and 13 variables.

crim: per capita crime rate by town. zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox: nitrogen oxides concentration (parts per 10 million). rm: average number of rooms per dwelling. age: proportion of owner-occupied units built prior to 1940. dis: weighted mean of distances to five Boston employment centres. rad: index of accessibility to radial highways. tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town. lstat: lower status of the population (percent). medv: median value of owner-occupied homes in $1000s.

Data Partition

set.seed(1234)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
training <- data[ind == 1,]
testing <- data[ind ==2,]

KNN Model

trControl <- trainControl(method = 'repeatedcv',
                          number = 10,
                          repeats = 3)
set.seed(333)
attach(data)

FIT <- train(medv~.,
             data = training,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT

k-Nearest Neighbors 

355 samples
 13 predictor

Pre-processing: centered (13), scaled (13) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 320, 320, 319, 320, 319, 319, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  4.247396  0.7819818  2.799416
   2  3.979579  0.8083403  2.651736
   3  3.950952  0.8194244  2.602271
   4  4.081028  0.8117352  2.663417
   5  4.187455  0.8009220  2.751447
   6  4.269958  0.7936822  2.830755
   7  4.259754  0.7944348  2.840942
   8  4.285647  0.7923607  2.885038
   9  4.277810  0.7942649  2.875932
  10  4.288972  0.7960606  2.886287
  11  4.306883  0.7963191  2.899734
  12  4.311103  0.7955246  2.920360
  13  4.336054  0.7947843  2.950035
  14  4.405470  0.7888339  3.005893
  15  4.423224  0.7887495  3.030541
  16  4.492374  0.7852007  3.080057
  17  4.520643  0.7848370  3.103929
  18  4.566745  0.7827552  3.133610
  19  4.608306  0.7806416  3.162693
  20  4.638719  0.7801816  3.187662
  21  4.671443  0.7777975  3.221833
  22  4.715731  0.7762199  3.257161
  23  4.773820  0.7719154  3.309642
  24  4.810877  0.7707091  3.345512
  25  4.838732  0.7700164  3.370597
  26  4.867535  0.7704611  3.385367
  27  4.883757  0.7726904  3.402671
  28  4.913500  0.7720243  3.427319
  29  4.951461  0.7704870  3.453771
  30  4.977230  0.7704819  3.469728
  31  5.018419  0.7690603  3.497270
  32  5.041679  0.7698973  3.512520
  33  5.072888  0.7694062  3.539960
  34  5.103519  0.7669878  3.559422
  35  5.132717  0.7659705  3.576025
  36  5.170067  0.7641995  3.596993
  37  5.201979  0.7640409  3.619875
  38  5.242001  0.7625534  3.648506
  39  5.269697  0.7618512  3.666397
  40  5.313147  0.7587825  3.694215
  41  5.347397  0.7563999  3.718244
  42  5.377001  0.7539976  3.737528
  43  5.409438  0.7517684  3.756962
  44  5.432267  0.7502295  3.775538
  45  5.464056  0.7477553  3.799406
  46  5.492641  0.7452937  3.812879
  47  5.513741  0.7437770  3.825341
  48  5.532227  0.7436683  3.844284
  49  5.554146  0.7425758  3.860164
  50  5.569495  0.7407185  3.873060
  51  5.590183  0.7391508  3.884943
  52  5.611812  0.7376250  3.900505
  53  5.627486  0.7363782  3.915204
  54  5.644846  0.7348713  3.926904
  55  5.660921  0.7345985  3.940371
  56  5.682031  0.7333447  3.953439
  57  5.696958  0.7326176  3.963038
  58  5.712698  0.7318748  3.973957
  59  5.730909  0.7309311  3.987815
  60  5.749974  0.7301051  3.999038
  61  5.772356  0.7288745  4.014376
  62  5.790233  0.7273679  4.026197
  63  5.806207  0.7264301  4.037537
  64  5.821958  0.7251828  4.046288
  65  5.836567  0.7240744  4.055216
  66  5.848333  0.7232492  4.063122
  67  5.855005  0.7233243  4.068135
  68  5.870225  0.7226127  4.077585
  69  5.885600  0.7213242  4.089970
  70  5.902702  0.7203636  4.102866

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.

The results above shows that we have run K-Nearest Neighbors with 355 samples and 13 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 3.

Plot the model

plot(FIT)

The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.

Variable Importance

varImp(FIT)

loess r-squared variable importance

        Overall
lstat    100.00
rm        94.28
ptratio   40.21
tax       37.51
indus     36.41
crim      35.25
nox       31.91
rad       25.82
black     24.68
age       22.83
dis       19.78
zn        16.23
chas       0.00

The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.

Prediction

pred <- predict(FIT, newdata = testing)

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred, testing$medv)

[1] 6.127333

Make the Plot

plot(pred, testing$medv)

We can chooce R-square for Model Evaluation

FIT <- train(medv~.,
             data = training,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             metric = 'Rsquared',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT

k-Nearest Neighbors 

355 samples
 13 predictor

Pre-processing: centered (13), scaled (13) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 320, 320, 320, 319, 319, 320, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  4.155683  0.7889735  2.751110
   2  3.920046  0.8085024  2.615415
   3  3.894027  0.8191854  2.576399
   4  3.990820  0.8169846  2.633052
   5  4.098328  0.8092759  2.721242
   6  4.164762  0.8005553  2.782908
   7  4.250046  0.7913992  2.849220
   8  4.237511  0.7935728  2.849656
   9  4.241670  0.7954279  2.867654
  10  4.223247  0.8011824  2.864390
  11  4.256550  0.7990746  2.888878
  12  4.275542  0.7980135  2.909365
  13  4.305207  0.7965859  2.951738
  14  4.355170  0.7937387  2.994838
  15  4.370063  0.7945428  3.019494
  16  4.413595  0.7935461  3.052740
  17  4.459155  0.7910273  3.086315
  18  4.502035  0.7897903  3.119118
  19  4.555873  0.7856789  3.160073
  20  4.599092  0.7832938  3.192969
  21  4.637681  0.7810155  3.224354
  22  4.673391  0.7798530  3.253564
  23  4.720768  0.7770125  3.283033
  24  4.763076  0.7746145  3.315225
  25  4.796324  0.7735186  3.353459
  26  4.825073  0.7728740  3.378364
  27  4.850184  0.7736520  3.402117
  28  4.876660  0.7735504  3.423978
  29  4.913905  0.7728784  3.453461
  30  4.945720  0.7723983  3.475163
  31  4.986189  0.7706545  3.500545
  32  5.014103  0.7695456  3.513243
  33  5.042506  0.7700251  3.532718
  34  5.063848  0.7713883  3.547570
  35  5.096916  0.7698529  3.572919
  36  5.130072  0.7690975  3.594625
  37  5.158998  0.7682367  3.616503
  38  5.198007  0.7654431  3.638909
  39  5.230199  0.7640846  3.656377
  40  5.263299  0.7625566  3.676656
  41  5.295303  0.7606504  3.696649
  42  5.334168  0.7561353  3.721177
  43  5.365412  0.7538500  3.744440
  44  5.399053  0.7512384  3.763381
  45  5.431207  0.7482837  3.789278
  46  5.450129  0.7482528  3.802465
  47  5.471523  0.7474810  3.817650
  48  5.498410  0.7455536  3.828871
  49  5.519348  0.7443257  3.845853
  50  5.541579  0.7428898  3.861870
  51  5.568951  0.7407798  3.880670
  52  5.580325  0.7401795  3.892955
  53  5.597908  0.7379495  3.906649
  54  5.618826  0.7369065  3.920215
  55  5.634407  0.7365578  3.932015
  56  5.652281  0.7355451  3.945043
  57  5.672553  0.7340832  3.961924
  58  5.686516  0.7331445  3.974976
  59  5.706086  0.7318256  3.989420
  60  5.715973  0.7322444  3.992434
  61  5.736630  0.7303771  4.004595
  62  5.754764  0.7291911  4.017459
  63  5.772238  0.7279232  4.029730
  64  5.783690  0.7273407  4.035692
  65  5.797816  0.7260285  4.042914
  66  5.809293  0.7256345  4.048952
  67  5.823800  0.7251425  4.057399
  68  5.836885  0.7243181  4.068262
  69  5.849227  0.7239215  4.081166
  70  5.863107  0.7226044  4.092137

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 3.

plot(FIT)

varImp(FIT)

loess r-squared variable importance

        Overall
lstat    100.00
rm        94.28
ptratio   40.21
tax       37.51
indus     36.41
crim      35.25
nox       31.91
rad       25.82
black     24.68
age       22.83
dis       19.78
zn        16.23
chas       0.00

Prediction

pred <- predict(FIT, newdata = testing)
data_pred <- data.frame(testing, pred)
head(data_pred,5)

      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
5  0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
14 0.62976  0  8.14    0 0.538 5.949 61.8 4.7075   4 307    21.0 396.90  8.26
16 0.62739  0  8.14    0 0.538 5.834 56.5 4.4986   4 307    21.0 395.62  8.47
26 0.84054  0  8.14    0 0.538 5.599 85.7 4.4546   4 307    21.0 303.42 16.51
28 0.95577  0  8.14    0 0.538 6.047 88.8 4.4534   4 307    21.0 306.38 17.28
   medv     pred
5  36.2 32.26667
14 20.4 18.93333
16 19.9 19.20000
26 13.9 14.13333
28 14.8 14.40000

Check the Structure of the Data Set

str(data_pred)

'data.frame':   151 obs. of  15 variables:
 $ crim   : num  0.069 0.6298 0.6274 0.8405 0.9558 ...
 $ zn     : num  0 0 0 0 0 0 0 0 75 0 ...
 $ indus  : num  2.18 8.14 8.14 8.14 8.14 8.14 5.96 5.96 2.95 6.91 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.458 0.538 0.538 0.538 0.538 0.538 0.499 0.499 0.428 0.448 ...
 $ rm     : num  7.15 5.95 5.83 5.6 6.05 ...
 $ age    : num  54.2 61.8 56.5 85.7 88.8 94.4 68.2 30.2 21.8 62 ...
 $ dis    : num  6.06 4.71 4.5 4.45 4.45 ...
 $ rad    : int  3 4 4 4 4 4 5 5 3 3 ...
 $ tax    : num  222 307 307 307 307 307 279 279 252 233 ...
 $ ptratio: num  18.7 21 21 21 21 21 19.2 19.2 18.3 17.9 ...
 $ black  : num  397 397 396 303 306 ...
 $ lstat  : num  5.33 8.26 8.47 16.51 17.28 ...
 $ medv   : num  36.2 20.4 19.9 13.9 14.8 18.4 18.9 24.7 30.8 19.4 ...
 $ pred   : num  32.3 18.9 19.2 14.1 14.4 ...

Plot the Actual Response Variable and Predicted Values

# Sample data with two variables
data <- data.frame(data_pred$medv, data_pred$pred
)

# Create a line plot for both variables
plot(data$data_pred.medv, type = "l", col = "blue", xlab = "Time Axis", ylab = "MEDV", main = "Actual vs Predicted MEDV")
lines(data$data_pred.pred, col = "red")

# Add a legend
legend("topright", legend = c("Actual MEDV", "Predicted MEDV"), col = c("blue", "red"), lty = 1)

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred, testing$medv)

[1] 6.127333

Make the Plot

plot(pred, testing$medv)

No Much Improvement.

Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:

Prediction Process:

For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):

The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.

Distance Metric:

The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.

Non-Linearity:

Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.

Local vs. Global Patterns:

k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.

Visualizations:

Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.

Feature Importance:

k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.

Performance Metrics:

Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.

Rescaling Features:

Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.

In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.

For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:

Mean Absolute Error (MAE):

It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.

Mean Squared Error (MSE):

It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.

Root Mean Squared Error (RMSE):

It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.

R-squared (R2):

It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.

SUPPORT VECTOR MACHINE

A Support Vector Machine (SVM) is a versatile and widely-used supervised machine learning algorithm designed for classification and regression tasks. SVM aims to find an optimal hyperplane or decision boundary that best separates data points into different classes or predicts continuous values, while maximizing the margin, which is the distance between the boundary and the nearest data points of each class. What sets SVM apart is its ability to handle high-dimensional data effectively and its flexibility to deal with non-linear relationships in the data through the kernel trick, which allows data to be implicitly mapped into higher-dimensional spaces. In binary classification, SVM seeks to find the hyperplane that best separates the two classes, and in multiclass scenarios, it uses strategies like one-versus-rest (OvR) or one-versus-one (OvO) to extend classification capabilities. SVM has found applications in diverse fields, including text classification, image recognition, and financial prediction, owing to its robustness and effectiveness in modeling complex decision boundaries.

SVM’s key concepts include support vectors, which are data points that lie closest to the decision boundary and significantly influence the model, and the regularization parameter (C), which balances the trade-off between achieving a wide margin and minimizing misclassification errors. By maximizing the margin, SVM generally offers good generalization to new, unseen data. However, it can be computationally intensive for large datasets, and its interpretability is limited compared to simpler models like logistic regression. Despite these limitations, SVM remains a valuable tool in machine learning, particularly when dealing with high-dimensional data and complex classification problems, thanks to its ability to find optimal separating hyperplanes in various feature spaces.

Support Vector Machines (SVMs) have several key features and characteristics that make them a powerful and widely-used machine learning algorithm:

Effective for High-Dimensional Data: SVMs are effective in scenarios where the number of features (dimensions) is greater than the number of samples. They excel in high-dimensional spaces, such as text classification or image recognition.

Optimal Hyperplane: SVM aims to find the optimal hyperplane that maximizes the margin, which is the distance between the decision boundary and the nearest data points of each class. This leads to better generalization to unseen data.

Kernel Trick: SVMs can handle non-linear data by implicitly mapping it into higher-dimensional spaces using kernel functions (e.g., polynomial, radial basis function). This allows SVMs to model complex decision boundaries.

Sparsity: SVMs are “sparse” models, meaning only a subset of data points, called support vectors, influence the decision boundary. This makes them memory-efficient and suitable for large datasets.

Binary and Multiclass Classification: SVMs can be used for both binary and multiclass classification tasks. Strategies like one-versus-rest (OvR) or one-versus-one (OvO) are employed for multiclass classification.

Regularization Parameter (C): The parameter C controls the trade-off between maximizing the margin and minimizing the classification error. Higher C values lead to a narrower margin but fewer misclassifications, while lower C values prioritize a wider margin but may tolerate some misclassifications.

Robust to Outliers: SVMs are less sensitive to outliers in the data compared to some other algorithms because the decision boundary is influenced by support vectors rather than all data points.

Interpretability: While not as interpretable as linear models like logistic regression, SVMs can provide insights into feature importance through the examination of support vectors and their associated coefficients.

Versatile: SVMs are versatile and can be applied to various machine learning tasks, including classification, regression (Support Vector Regression), and outlier detection.

Well-Defined Theoretical Framework: SVMs are founded on a well-defined mathematical framework with strong theoretical underpinnings, making them attractive for rigorous analysis.

Popular in Real-World Applications: SVMs have been successfully applied in a wide range of real-world applications, including image and text classification, bioinformatics, finance, and more.

Scalability: While SVMs can be computationally intensive for large datasets, methods like Stochastic Gradient Descent (SGD) and the use of approximate solvers have been developed to enhance scalability.

Overall, Support Vector Machines offer a powerful tool for both linear and non-linear classification problems, with a strong emphasis on maximizing the margin between classes, robustness, and adaptability to various data types and applications.

In this support vector machine, we are going to use “iris” data set ## View the data set

data <- iris
head(data,5)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

View the structure of the dataset

str(data)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

attach(data)

The data set has 150 observations, 5 variables, 4 of which are numeric and one categorical variable, that is, species is a factor variable. Using these four numeric variables, we want to build a classification model will help us predict the correct species. We need to load some library including “ggplot2”.

Let us make a plot of petal length and petal width

qplot(Petal.Length, Petal.Width, data = data)

Let us color the scatter plot and see if there is a possibility of separation.

qplot(Petal.Length, Petal.Width, 
      color = Species, data = data)

From the scatter plot above, thee is seperation based on the species as shown by different colors. The red one are “setosa”, abd the green ones are” versicolor”, and lastly the blue ones are “virginica”. We can see that there is some seperation since the setosa species are far from the other two.

Let us make a plot of sepal length and sepal width

qplot(Sepal.Length, Sepal.Width, 
      color = Species, data = data)

Support Vector Machine

Let us get a quick review of what is support vector machine. Consider the figure below

knitr::include_graphics("support.png")

The point points which are color coded are data points for two categories. These are two distinct classes. In other words, we are looking for optimal separating hyperplane between these two classes. We can only achieve that by maximizing the margin around the separating hyperplane. Points that lie on the boundary are called the support vectors. The middle line is the separating hyperplane. In situation where we cannot obtain a linear separator, data points are then projected into to a higher dimensional space so that the data points can be linearly separable. For this project we make use kernels. The program that help us achieve this is called the support vector machine. For this purpose of this demonstration, load the following library; “e1071”, which is commonly used for support vector machine. We can now develope a support vector machine

Radial Kernel Function

library(e1071)
model <- svm(Species~., data = data)
summary(model)


Call:
svm(formula = Species ~ ., data = data)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  51

 ( 8 22 21 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

The output above gives the formula used. The type used is C-Classification with a “radial” SVM-Kernel. Since we are using Species as our dependent variable which is a factor variable, classification becomes the default type. Otherwise, if our dependent variable was a continuous, we would have a regression analysis in that case. But right now it is a classification. The cost implies that cost of constrained violation and the default is one, gamma is 0.25. The model has 51 support vectors. From the model, 8 support vectors belongs to the first species, 22 belongs to the second species and 21 belongs to the third species. We have three species are these are setosa, versicolor and virginica. Let us plot the results and see what comes out. But since we have four variables, we will to do some modification, otherwise, if we had only two variables (Petal length and Petal width), the code below would be sufficient.

## plot(model, data = data)

We will make use of slice for the other two quantitative variables. For the Sepal width we keep the constrain at 3 and sepat length constrained at 4.

plot(model, data = data, 
     Petal.Width~Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

The above plot shows visualization of the support vector machine. The plot shows the data, support vectors represented by crosses and and the decision boundary. for the three types of species. The lower part of the graphs comprises data points belonging to the first species, the middle one comprises data points belonging to the second species and the upper part comprises data points belonging to the third species. The color represents the predicted class for each species. Like we said, we have 51 support vectors, 8, 22 and 21 belong to the first, second and third species, respectively.

Confusion Matrix and Misclassification Error

pred <- predict(model, data)
confusionMatrix(pred, data$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         2
  virginica       0          2        48

Overall Statistics
                                          
               Accuracy : 0.9733          
                 95% CI : (0.9331, 0.9927)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.96            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           0.9600
Specificity                 1.0000            0.9800           0.9800
Pos Pred Value              1.0000            0.9600           0.9600
Neg Pred Value              1.0000            0.9800           0.9800
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.3200
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9700           0.9700

From the results above, there are were 50 setosa species and the model also predicted that 50 species belongs to setosa. Additionally, the data has 48 versicolor species and the predicted them to belong to versicolor species. Lastly, there are were 48 virginica species and the model predicted the same. However, we also have mis-classification. From the data, we have 2 virginical species but the model predicted them to belong to versicolor species. Similarly, there are were 2 versicolor species but the model predicted that they belong to virginica. However, the model has an accuracy of approximately 97.33%. The mis-classification is approximately 2.67%. It is also good to note that this model used radial Kernel type.

Linear Kernel Function

Other than radial function, we can use linear function as shown below.

model1 <- svm(Species~.,
             kernel = "linear",
             data = data)
summary(model1)


Call:
svm(formula = Species ~ ., data = data, kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 

Number of Support Vectors:  29

 ( 2 15 12 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

Now the kernel type is linear and not radial. The number of support vectors have reduced to 29 as compared to the previous 51. In this new model, 2 of the support vectors belong to the first species, 15 belongs to the second species and 12 support vectors belongs to the third species. Let us view the plot and see if there is any change.

plot(model1, data = data, 
     Petal.Width~Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

Let us look at the confusion matrix

pred1 <- predict(model1, data)
confusionMatrix(pred1, data$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46         1
  virginica       0          4        49

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.9239, 0.9891)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9200           0.9800
Specificity                 1.0000            0.9900           0.9600
Pos Pred Value              1.0000            0.9787           0.9245
Neg Pred Value              1.0000            0.9612           0.9897
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3067           0.3267
Detection Prevalence        0.3333            0.3133           0.3533
Balanced Accuracy           1.0000            0.9550           0.9700

In the new model, we have slightly higher mis-classification of approximately 3.33% with the model accuracy of approximately 96.67%.

Polynomial Kernel Function

model2 <- svm(Species~.,
             kernel = "polynomial",
             data = data)
summary(model2)


Call:
svm(formula = Species ~ ., data = data, kernel = "polynomial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  1 
     degree:  3 
     coef.0:  0 

Number of Support Vectors:  54

 ( 6 26 22 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

Using polynomial kernel function, the model has 54 support vectors where 6, 26 and 22 belonging to first, second and third category, respectively. Let us make the plot

plot(model2, data = data, 
     Petal.Width~Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

Confusion Matric and Misclassification Error

pred2 <- predict(model2, data)
confusionMatrix(pred2, data$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         50         7
  virginica       0          0        43

Overall Statistics
                                         
               Accuracy : 0.9533         
                 95% CI : (0.9062, 0.981)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.93           
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.8600
Specificity                 1.0000            0.9300           1.0000
Pos Pred Value              1.0000            0.8772           1.0000
Neg Pred Value              1.0000            1.0000           0.9346
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.2867
Detection Prevalence        0.3333            0.3800           0.2867
Balanced Accuracy           1.0000            0.9650           0.9300

From the model above, the accuracy reduced to 95.33% as compared to the previous model with an increase in mis-classification of approximately 4.67%. The model classify 7 species to belong in second category while in reality they belong in third category.

Sigmoid Kernel Function

model3 <- svm(Species~.,
             kernel = "sigmoid",
             data = data)
summary(model3)


Call:
svm(formula = Species ~ ., data = data, kernel = "sigmoid")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  sigmoid 
       cost:  1 
     coef.0:  0 

Number of Support Vectors:  54

 ( 6 26 22 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

The results above represents the training of a Support Vector Machine (SVM) classifier using the sigmoid kernel for a multi-class classification problem. The formula “Species ~ .” suggests that the goal is to predict the “Species” variable using all the other variables in the dataset. This type of SVM is a C-classification, which means it is designed for multi-class classification tasks. The “cost” parameter is set to 1, indicating the trade-off between maximizing the margin and minimizing classification errors. The “coef.0” parameter is set to 0, which is the bias term.

The model has identified 54 support vectors. Support vectors are the data points closest to the decision boundary, and they play a crucial role in defining the decision boundary of the SVM. In this case, there are 6 support vectors for the “setosa” class, 26 support vectors for the “versicolor” class, and 22 support vectors for the “virginica” class.

There are three classes in this classification problem: “setosa,” “versicolor,” and “virginica.” The model has learned to distinguish between these three classes based on the provided features. The sigmoid kernel is a non-linear kernel that can capture complex relationships between features and class labels. Overall, this SVM model with the sigmoid kernel has been trained to classify data into one of these three flower species based on the given dataset.

Plot

plot(model3, data = data, 
     Petal.Width~Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

Confusion Matrix

pred3 <- predict(model3, data)
confusionMatrix(pred3, data$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         49          0         0
  versicolor      1         41         7
  virginica       0          9        43

Overall Statistics
                                          
               Accuracy : 0.8867          
                 95% CI : (0.8248, 0.9326)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.83            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 0.9800            0.8200           0.8600
Specificity                 1.0000            0.9200           0.9100
Pos Pred Value              1.0000            0.8367           0.8269
Neg Pred Value              0.9901            0.9109           0.9286
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3267            0.2733           0.2867
Detection Prevalence        0.3267            0.3267           0.3467
Balanced Accuracy           0.9900            0.8700           0.8850

The confusion matrix and associated statistics provide a comprehensive evaluation of the performance of a classification model, presumably for a multiclass problem (with three classes: setosa, versicolor, and virginica). Here’s what each part of the output means:

The confusion matrix displays the counts of true positive, true negative, false positive, and false negative predictions for each class. In this case, the three classes are setosa, versicolor, and virginica. For example, in the first row, it shows that 49 instances of setosa were correctly classified as setosa (true positives), and there were no false positives or false negatives for setosa. In the second row, it shows that 41 instances of versicolor were correctly classified as versicolor (true positives), 1 was incorrectly classified as setosa (false negative), and 7 were incorrectly classified as virginica (false positive). Overall Statistics:

Accuracy is the overall classification accuracy of the model, which is approximately 88.67%. It’s the ratio of correctly predicted instances to the total instances. The 95% Confidence Interval (CI) provides a range for the accuracy estimate. The No Information Rate (NIR) represents the accuracy that would be achieved by simply predicting the majority class all the time (0.3333 in this case). The Kappa statistic measures the agreement between actual and predicted classifications, with a value of 0.83 indicating substantial agreement. This model has a lower accuracy as compared to the previous one, with a higher mis-classification error of approximately 11.33%.

Sensitivity (also known as True Positive Rate or Recall) measures the proportion of true positives for each class. It tells you how well the model is at correctly identifying instances of each class. Specificity measures the proportion of true negatives for each class. It tells you how well the model is at correctly identifying instances not belonging to each class. Pos Pred Value (Positive Predictive Value or Precision) is the proportion of true positives among all predicted positives for each class. Neg Pred Value (Negative Predictive Value) is the proportion of true negatives among all predicted negatives for each class. Prevalence is the proportion of each class in the dataset. Detection Rate is the rate at which the model correctly detects each class. Detection Prevalence is the rate at which the model predicts each class. Balanced Accuracy is a measure that averages sensitivity and specificity and is particularly useful when dealing with imbalanced datasets.

In summary, this analysis provides a detailed breakdown of the model’s performance for each class, including its ability to correctly classify instances, along with overall accuracy and additional statistics. It’s essential to consider these metrics in the context of your specific problem and objectives to determine the model’s effectiveness.

From all the support vector machine models above, signoid kernel function was the worst and radial kernel function was the best. Let us see if we can tune our model to get better classification.

Tuning/ Hyperparameter Optimization

This kind of optimzation helps to select the best model.

set.seed(222) ### To attain reproducibility

### Tuning
t.model<-tune(svm, Species~.,
     data = data,
     ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:9)))

The code above is the tune function to tune hyperparameters for a Support Vector Machine (SVM) classifier using the “Species” variable as the target variable and all other available features in the “data” dataset. Specifically, it seems to be tuning two hyperparameters: “epsilon” and “cost.”

tune: This function is typically used for hyperparameter tuning in R. It takes various arguments to specify the model to be tuned, the data, and the hyperparameter search space.

svm: The svm function is the model that you are trying to tune. It’s an SVM classifier for classification tasks.

Species~.: This formula specifies that you want to predict the “Species” variable based on all other variables in the dataset (denoted by the .).

data = data: This argument specifies the dataset from which the model should be trained and validated.

ranges: This argument specifies the hyperparameter search space. It appears to be defining a range for two hyperparameters:

epsilon: A range from 0 to 1 with a step of 0.1. cost: A range of values obtained by raising 2 to the power of integers from 2 to 9. The tune function will perform a grid search over the specified hyperparameter ranges, training and evaluating the SVM model with different combinations of “epsilon” and “cost.” It will likely use some form of cross-validation to assess the model’s performance for each combination of hyperparameters.

After running this code, you should get the best combination of hyperparameters that optimizes the model’s performance on the dataset, as well as the corresponding model performance metrics for each combination. This is a common approach to find the best hyperparameters for machine learning models.

Plot the Model

plot(t.model)

The graph give the performance of SVM of the two parameters, “cost” and “epsilon”. Darker regions implies better results. From the graph if reduce the cost value we can as well get better results instead of going all the way to 2^9.

t.model1<-tune(svm, Species~.,
     data = data,
     ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:7)))

Now make the plot

plot(t.model1)

The model seems to perform quite better. Let us now get the summary of this model

summary(t.model1)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 epsilon cost
       0    4

- best performance: 0.04666667 

- Detailed performance results:
   epsilon cost      error dispersion
1      0.0    4 0.04666667 0.04499657
2      0.1    4 0.04666667 0.04499657
3      0.2    4 0.04666667 0.04499657
4      0.3    4 0.04666667 0.04499657
5      0.4    4 0.04666667 0.04499657
6      0.5    4 0.04666667 0.04499657
7      0.6    4 0.04666667 0.04499657
8      0.7    4 0.04666667 0.04499657
9      0.8    4 0.04666667 0.04499657
10     0.9    4 0.04666667 0.04499657
11     1.0    4 0.04666667 0.04499657
12     0.0    8 0.04666667 0.04499657
13     0.1    8 0.04666667 0.04499657
14     0.2    8 0.04666667 0.04499657
15     0.3    8 0.04666667 0.04499657
16     0.4    8 0.04666667 0.04499657
17     0.5    8 0.04666667 0.04499657
18     0.6    8 0.04666667 0.04499657
19     0.7    8 0.04666667 0.04499657
20     0.8    8 0.04666667 0.04499657
21     0.9    8 0.04666667 0.04499657
22     1.0    8 0.04666667 0.04499657
23     0.0   16 0.05333333 0.04216370
24     0.1   16 0.05333333 0.04216370
25     0.2   16 0.05333333 0.04216370
26     0.3   16 0.05333333 0.04216370
27     0.4   16 0.05333333 0.04216370
28     0.5   16 0.05333333 0.04216370
29     0.6   16 0.05333333 0.04216370
30     0.7   16 0.05333333 0.04216370
31     0.8   16 0.05333333 0.04216370
32     0.9   16 0.05333333 0.04216370
33     1.0   16 0.05333333 0.04216370
34     0.0   32 0.04666667 0.03220306
35     0.1   32 0.04666667 0.03220306
36     0.2   32 0.04666667 0.03220306
37     0.3   32 0.04666667 0.03220306
38     0.4   32 0.04666667 0.03220306
39     0.5   32 0.04666667 0.03220306
40     0.6   32 0.04666667 0.03220306
41     0.7   32 0.04666667 0.03220306
42     0.8   32 0.04666667 0.03220306
43     0.9   32 0.04666667 0.03220306
44     1.0   32 0.04666667 0.03220306
45     0.0   64 0.05333333 0.04216370
46     0.1   64 0.05333333 0.04216370
47     0.2   64 0.05333333 0.04216370
48     0.3   64 0.05333333 0.04216370
49     0.4   64 0.05333333 0.04216370
50     0.5   64 0.05333333 0.04216370
51     0.6   64 0.05333333 0.04216370
52     0.7   64 0.05333333 0.04216370
53     0.8   64 0.05333333 0.04216370
54     0.9   64 0.05333333 0.04216370
55     1.0   64 0.05333333 0.04216370
56     0.0  128 0.07333333 0.04919099
57     0.1  128 0.07333333 0.04919099
58     0.2  128 0.07333333 0.04919099
59     0.3  128 0.07333333 0.04919099
60     0.4  128 0.07333333 0.04919099
61     0.5  128 0.07333333 0.04919099
62     0.6  128 0.07333333 0.04919099
63     0.7  128 0.07333333 0.04919099
64     0.8  128 0.07333333 0.04919099
65     0.9  128 0.07333333 0.04919099
66     1.0  128 0.07333333 0.04919099

The sampling method is 10 fold cross validation with the best parameters as epsilon = 0 and cost = 8. Using these results we can choose our best model.

mymodel <- t.model1$best.model
summary(mymodel)


Call:
best.tune(METHOD = svm, train.x = Species ~ ., data = data, ranges = list(epsilon = seq(0, 
    1, 0.1), cost = 2^(2:7)))


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  4 

Number of Support Vectors:  37

 ( 6 17 14 )


Number of Classes:  3 

Levels: 
 setosa versicolor virginica

The output above represents the best-tuned Support Vector Machine (SVM) model for a classification task with three classes (setosa, versicolor, virginica). Here’s an explanation of the key details:

Model Type: The model type is C-classification, indicating that it’s a classification model.

SVM Kernel: The best-tuned SVM model uses the radial kernel. The radial kernel, also known as the Radial Basis Function (RBF) kernel, is a popular choice for SVMs as it can capture complex, non-linear decision boundaries effectively.

Hyperparameters:

The best-tuned hyperparameters are: Cost: 8 Epsilon: The value of epsilon is not explicitly mentioned in this output, but it may have been chosen based on the tuning process.

Number of Support Vectors: The model has 35 support vectors in total. 6 support vectors belong to the setosa class. 15 support vectors belong to the versicolor class. 14 support vectors belong to the virginica class.

Number of Classes: There are three classes in this classification problem: setosa, versicolor, and virginica. Levels:

This section lists the levels or class labels for the target variable, which are setosa, versicolor, and virginica. The best-tuned SVM model is selected based on the hyperparameters that maximize its performance on the training dataset, typically through techniques like cross-validation. In this case, the radial kernel and a cost value of 8 have been identified as the optimal hyperparameters. These hyperparameters are chosen to strike a balance between achieving a good fit to the training data and preventing overfitting. The number of support vectors indicates that the model is relying on a relatively small subset of the training data to define the decision boundary. This can be beneficial for generalization and model simplicity.

Overall, this best-tuned SVM model is designed to classify new data points into one of the three classes (setosa, versicolor, or virginica) using the radial kernel and the specified cost parameter, which have been determined as the optimal choices through the tuning process.

Plot the Best Model

plot(mymodel, data = data, 
     Petal.Width~Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

Confusion Matrix and Misclassification Error

pred4 <- predict(mymodel, data)
confusionMatrix(pred4, data$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         0
  virginica       0          2        50

Overall Statistics
                                          
               Accuracy : 0.9867          
                 95% CI : (0.9527, 0.9984)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.98            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           1.0000
Specificity                 1.0000            1.0000           0.9800
Pos Pred Value              1.0000            1.0000           0.9615
Neg Pred Value              1.0000            0.9804           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.3333
Detection Prevalence        0.3333            0.3200           0.3467
Balanced Accuracy           1.0000            0.9800           0.9900

From the model above, the accuracy level (98.67%) improved significantly as compared to all previous models with a very low misc-lassification error of about 1.33%

PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis (PCA) is a widely used dimensionality reduction technique in data analysis and machine learning. It aims to simplify complex datasets while preserving as much of the original information as possible. PCA works by transforming the original features into a new set of orthogonal (uncorrelated) features called principal components. These components are ordered in such a way that the first component explains the maximum variance in the data, the second component explains the second most, and so on.

The first step in PCA involves centering the data by subtracting the mean from each feature, ensuring that the data has a mean of zero. Then, PCA calculates the covariance matrix of the centered data, which captures the relationships between the features. The eigenvectors and eigenvalues of this covariance matrix are computed. The eigenvectors represent the direction of maximum variance in the data, and the eigenvalues indicate the amount of variance explained by each eigenvector. These eigenvectors are the principal components, and they form an orthogonal basis for the data.

In practice, PCA is used for various purposes, such as data compression, visualization, and noise reduction. By retaining only a subset of the top principal components that explain most of the variance, you can reduce the dimensionality of the data while maintaining its essential structure. This can be particularly useful for visualizing high-dimensional data or for speeding up machine learning algorithms that suffer from the curse of dimensionality. However, it’s essential to keep in mind that PCA assumes linear relationships between variables, and it may not work well for data with nonlinear dependencies.

In summary, PCA is a powerful technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. It relies on the computation of eigenvectors and eigenvalues of the covariance matrix and can be applied to a wide range of data analysis tasks, making it a fundamental tool in the toolkit of data scientists and machine learning practitioners.

IRIS DATA SET

The Iris dataset is a well-known dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Ronald A. Fisher in 1936. The dataset consists of 150 samples of iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample has four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. The primary purpose of this dataset is to serve as a benchmark for classification and pattern recognition algorithms. Researchers often use it for tasks like supervised learning, clustering, and data visualization. It’s a simple and widely used dataset for practicing and demonstrating various machine learning techniques.

data <- iris
head(data,5)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

Check the structure

str(data)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We are going to use Iris data with 150 observations with 5 variables. Among the five variable four are numeric with one categorical variable. Check the summary of the data set using the command below

summary(data)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

The variable we are seeing have different scales, it therefore a good idea to normalize the data. Data normalization help in ensuring that one variable does not overshadow other variables. For the species variable, each species has fifty observations.

Data Partitioning

This is fist and most important step in machine learning. Partitioning the data into training and testing.

set.seed(123)
ind <- sample(2, nrow(data),
              replace = TRUE,
              prob = c(0.8, 0.2))
training <- data[ind == 1,] ## this is the training data set
testing <- data[ind ==2,] ### this is the testing data set

We are going make use of the variable “psych”

library(psych)

Now, we want to see the correlation among independent variables.

pairs.panels(training[,-5],
             gap = 0,
             bg = c("red","yellow","blue")[training$Species],
             pch = 21)

This the plot that we get with low lower triangular part giving the scatter plot and species are color coded and the part triangular part give the correlation coefficient. The correlation coefficient among independent variables is very high, that give rise to multicollinearity problem. In such cases, the estimate will be unstable with inaccutae predictions. One way of handling such cases is the application of principal component analysis.

Principal Component Analysis

The use of center and scale as TRUE helps to ensure that before performance of the principal component analysis, all the four variables are normalized, that is, with a mean of zero and a variance of one.

pc <- prcomp(training[,-5],
             center = TRUE,
             scale. = TRUE)

Check the attributes stored in pc object

attributes(pc)

$names
[1] "sdev"     "rotation" "center"   "scale"    "x"       

$class
[1] "prcomp"

We now check what is contained in all the attributes, for example we can check what is contained in center attribute

pc$center

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.826446     3.021488     3.759504     1.201653

Check from the training data set

mean(training$Sepal.Length)

[1] 5.826446

pc$scale

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.8105316    0.4270454    1.7377945    0.7557539

sd(training$Sepal.Length)

[1] 0.8105316

Print Object pc

print(pc)

Standard deviations (1, .., p=4):
[1] 1.7237041 0.9309494 0.3749824 0.1468522

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5201832 -0.37206837  0.7292637  0.2432058
Sepal.Width  -0.2937566 -0.92188195 -0.2184802 -0.1269139
Petal.Length  0.5747475 -0.03477237 -0.1603396 -0.8017153
Petal.Width   0.5592690 -0.10241915 -0.6282771  0.5310334

When we print pc, we get standard deviation and rotations which are also called loadings. Because we have four variable, we will have four principal components, that is PC1, C2, PC3, and PC4. Each principal component is a normalized linear combinations of the original variable. Rotations of loadings are the coefficients of continous variables. The values we are seeing in the matrix above will lie between -1 and 1. From the results above, we can see that PC1 increases as sepal length, petal length and petal width increases. While on the other hand, as PC1 increases, Sepal width decreases.

Summary of Principal Components

summary(pc)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7237 0.9309 0.37498 0.14685
Proportion of Variance 0.7428 0.2167 0.03515 0.00539
Cumulative Proportion  0.7428 0.9595 0.99461 1.00000

From the results above, the principal component 1 (PC1) captures the majority of the variability (74.28%) while PC2 capture 21.67%, PC3 captures 3.515% and finally PC4 captures 0.539%. Again looking at the cumulative proportion, when we reach PC2, mores than 95% of the variability has been explained (95.84%). Principal 3 and 4 do not play impotant role in terms of capturing the variability in independent variables. We can now say that PC1 and PC2 captures the majority of the variability.

Orthogonality of the Principal Component

The orthogonality of principal components is a fundamental concept in Principal Component Analysis (PCA) and plays a crucial role in the dimensionality reduction process. In PCA, the goal is to find a set of new orthogonal axes (principal components) in the data space that captures the maximum variance. Here’s a detailed explanation of why these principal components are orthogonal:

Centering the Data: PCA begins by centering the data, which means subtracting the mean of each feature from the data points. This step ensures that the data has a mean of zero for each feature, which simplifies subsequent calculations.

Covariance Matrix: The next step in PCA involves calculating the covariance matrix of the centered data. The covariance matrix is a square matrix where each element represents the covariance between two features. Importantly, the covariance matrix is symmetric, meaning that the covariance between feature A and feature B is the same as the covariance between feature B and feature A.

Eigenvalue Decomposition: PCA proceeds by performing eigenvalue decomposition on the covariance matrix. This involves finding the eigenvalues and corresponding eigenvectors of the matrix. The eigenvectors are the directions in the original feature space, and the eigenvalues represent the amount of variance explained by each eigenvector.

Orthogonality Emerges: The key observation is that the eigenvectors (principal components) obtained from the eigenvalue decomposition of the covariance matrix are orthogonal to each other. This orthogonality arises from the properties of the covariance matrix. Here’s why:

The covariance matrix is symmetric, as mentioned earlier.

Symmetric matrices always have orthogonal eigenvectors. In other words, the eigenvectors extracted from a symmetric matrix are perpendicular to each other. Orthogonal Principal Components: As a result of this property, the principal components obtained from PCA are orthogonal axes in the original feature space. This orthogonality means that they are uncorrelated with each other, capturing different directions of maximum variance in the data. The first principal component represents the direction of maximum variance, the second principal component represents the second most significant variance, and so on.

Dimensionality Reduction: The orthogonality of principal components allows for efficient dimensionality reduction. By selecting a subset of the principal components and projecting the data onto these components, you can reduce the dimensionality of the data while preserving most of its variance. This is particularly valuable in cases where the original data has a high dimensionality, as it simplifies analysis and visualization.

In summary, the orthogonality of principal components in PCA emerges from the properties of the covariance matrix, which is symmetric. This orthogonality ensures that the principal components capture different directions of maximum variance in the data, making PCA a powerful technique for dimensionality reduction and data analysis.

Create the Scatter Plot

We are going to use pairs.panels() function, the components are stored in x

pairs.panels(pc$x,
             gap = 0,
             bg = c("red","yellow","blue")[training$Species],
             pch = 21)

The correlation correlation coefficients are zero, implying that there is no correlation between the four independent variables. Since the principal components are orthogonal to each other, the correlation between them is always zero. This helps us to get rid of multicollinearity problem.

Bi-plot

library(devtools)
library(ggbiplot)
g <- ggbiplot(pc,
              obs.scale = 1,
              var.scale = 1,
              groups = training$Species,
              ellipse = TRUE,
              circle = TRUE,
              ellipse.prob = 0.68)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
               legend.position =  'top') 
print(g)

On the x-axis we have PC1 which explained 74.3% of the variability in the independent variable, while on y-axis we have PC2 which explained approximately 21.7% of the variability in independent variables. Species are color coded as we see. The ellipse are made in a way that it can capture 68% of the data, as seen the code above. The distance from one arrow to the other, tell more about the correlation between the two variables. For instance, the correlation between Petal Length and Petal Width are strongly correlated. You can check from the previous (first) correlation graph. Similarly, Sepal length seems to have a high correlation with Petal length and petal width. However the correlation between Sepal width and all the other is very low. Additionally, the correlation between PC1 and Sepal length, petal length and petal with is positive while Sepal width and PC1 have a negative correlation. It is important to note that PC2 has a negative correlation with all the variables. However, it is good to note that PC2 has a higher negative correlation with Sepal width.

Prediction With Principal Components

trg <- predict(pc, training)
head (trg,20)

         PC1         PC2         PC3          PC4
1  -2.316985 -0.51656076  0.15198103  0.024535104
2  -2.101400  0.65462004  0.22783845  0.113119002
3  -2.400406  0.31668054 -0.04520373  0.039803496
6  -2.152380 -1.55087801  0.02331317 -0.002195439
7  -2.495085 -0.08471701 -0.32985812 -0.025509039
9  -2.353502  1.10001571 -0.17086821 -0.007190525
10 -2.211116  0.45029655  0.25058344 -0.032999514
12 -2.374584 -0.16697547 -0.08523184 -0.128031117
13 -2.239579  0.71407620  0.22099743  0.012847900
14 -2.659690  0.94960025 -0.20119026  0.001221504
15 -2.277826 -1.91326086  0.54444439  0.178247932
17 -2.284673 -1.54287422  0.06021963  0.182340825
18 -2.242983 -0.53011268  0.06884854  0.094800487
19 -1.965059 -1.45916436  0.42752708  0.047275401
22 -2.273484 -0.97741447 -0.12583235  0.059493671
23 -2.838956 -0.49491022 -0.31214094  0.029323708
25 -2.275364 -0.17297831 -0.11291169 -0.266433315
26 -1.971075  0.60471390  0.29935873  0.050856588
27 -2.098225 -0.28588779 -0.07154981  0.072511087
28 -2.219733 -0.56446595  0.23272793  0.008406757

Add the PCs to our training data

trg <- data.frame(trg, training[5])
head(trg,5)

        PC1         PC2         PC3          PC4 Species
1 -2.316985 -0.51656076  0.15198103  0.024535104  setosa
2 -2.101400  0.65462004  0.22783845  0.113119002  setosa
3 -2.400406  0.31668054 -0.04520373  0.039803496  setosa
6 -2.152380 -1.55087801  0.02331317 -0.002195439  setosa
7 -2.495085 -0.08471701 -0.32985812 -0.025509039  setosa

Add the PCs to our testing data

tst <- predict(pc, testing)
tst <- data.frame(tst, testing[5])
head(tst,5)

         PC1        PC2         PC3         PC4 Species
4  -2.329649  0.5744574 -0.10246958 -0.05275129  setosa
5  -2.449951 -0.6865310  0.01084664 -0.03518968  setosa
8  -2.279301 -0.2567830  0.10394179 -0.02188561  setosa
11 -2.228953 -1.0880234  0.31035317  0.00898006  setosa
16 -2.369933 -2.7639612  0.05588250  0.03149451  setosa

Our data is now ready for further analysis, in this case, we are going to use multinomial logistic regression model with PCs. We are going to use PC1 and PC2 only because they have more than 95% of the variability capture

library(nnet)
trg$Species <- relevel(trg$Species, ref = "setosa")
mymodel <- multinom(Species~ PC1 + PC2, data = trg)

# weights:  12 (6 variable)
initial  value 132.932087 
iter  10 value 19.674294
iter  20 value 17.413814
iter  30 value 17.299839
iter  40 value 17.297242
iter  50 value 17.296292
final  value 17.294539 
converged

summary(mymodel)

Call:
multinom(formula = Species ~ PC1 + PC2, data = trg)

Coefficients:
           (Intercept)      PC1      PC2
versicolor   10.584590 15.26832 4.019053
virginica     3.007363 21.95663 4.228617

Std. Errors:
           (Intercept)      PC1      PC2
versicolor    273.3561 177.9735 223.6350
virginica     273.3636 177.9818 223.6359

Residual Deviance: 34.58908 
AIC: 46.58908

Confusion Matric and Misclassification Error

p <- predict(mymodel, trg)
confusionMatrix(p, trg$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         39          0         0
  versicolor      0         38         4
  virginica       0          5        35

Overall Statistics
                                          
               Accuracy : 0.9256          
                 95% CI : (0.8635, 0.9654)
    No Information Rate : 0.3554          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8884          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8837           0.8974
Specificity                 1.0000            0.9487           0.9390
Pos Pred Value              1.0000            0.9048           0.8750
Neg Pred Value              1.0000            0.9367           0.9506
Prevalence                  0.3223            0.3554           0.3223
Detection Rate              0.3223            0.3140           0.2893
Detection Prevalence        0.3223            0.3471           0.3306
Balanced Accuracy           1.0000            0.9162           0.9182

Interpretation

The provided information represents the performance evaluation of a classification model through a confusion matrix and various statistical metrics. The confusion matrix reveals that the model has correctly identified 39 samples of the “setosa” class, 38 samples of the “versicolor” class, and 35 samples of the “virginica” class, with very few misclassifications. Overall, the model demonstrates a high accuracy of 92.56%, significantly outperforming the No Information Rate (NIR) of 35.54%. The Kappa statistic also indicates substantial agreement between predictions and actual values. In a class-specific analysis, the model shows good sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) across all classes, reflecting its ability to correctly classify samples from each class. Additionally, the balanced accuracy metric indicates a well-balanced performance across classes. Overall, this model appears to perform impressively well in classifying the three classes, with a particular strength in distinguishing the “setosa” class and reasonable performance in the “versicolor” and “virginica” classes.

Confusion Matrix and Misclassification on the Testing Data

p1 <- predict(mymodel, tst)
confusionMatrix(p1, tst$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         11          0         0
  versicolor      0          7         2
  virginica       0          0         9

Overall Statistics
                                          
               Accuracy : 0.931           
                 95% CI : (0.7723, 0.9915)
    No Information Rate : 0.3793          
    P-Value [Acc > NIR] : 7.016e-10       
                                          
                  Kappa : 0.8961          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.8182
Specificity                 1.0000            0.9091           1.0000
Pos Pred Value              1.0000            0.7778           1.0000
Neg Pred Value              1.0000            1.0000           0.9000
Prevalence                  0.3793            0.2414           0.3793
Detection Rate              0.3793            0.2414           0.3103
Detection Prevalence        0.3793            0.3103           0.3103
Balanced Accuracy           1.0000            0.9545           0.9091

Interpretation

K-MEANS CLUSTERING

K-means clustering is a widely used unsupervised machine learning algorithm that is employed for data segmentation and pattern recognition. It’s a partitioning algorithm that categorizes data points into K distinct and non-overlapping clusters, where K is a user-defined parameter. The algorithm’s objective is to minimize the sum of squared distances (variance) within each cluster, making it a “centroid-based” clustering method.

The K-means algorithm can be summarized in four key steps:

Initialization: It begins with the selection of K initial cluster centroids. These centroids can be randomly chosen from the dataset or through other initialization methods like K-means++ to enhance the convergence of the algorithm. The centroids represent the initial cluster centers.

Assignment: In this step, each data point is assigned to the cluster whose centroid is closest to it. The distance metric, often Euclidean distance, is used to determine the proximity of a point to each centroid. Consequently, each data point becomes a member of the nearest cluster.

Update Centroids: After assigning data points to clusters, the algorithm recalculates the centroids of each cluster. These new centroids are computed by taking the mean (average) of all data points in each cluster. The centroids are updated iteratively until they converge or until a predefined convergence criterion is met.

Iteration: Steps 2 and 3 are repeated iteratively until either the centroids do not change significantly between iterations or a maximum number of iterations is reached. The algorithm converges when the centroids stabilize, indicating that the clusters have been formed.

K-means has several applications, including image compression, customer segmentation, anomaly detection, and more. However, it has some limitations. It assumes that clusters are spherical and equally sized, which may not always hold in real-world data. It is also sensitive to the initial placement of centroids and may converge to local minima. Addressing these limitations may involve using alternative clustering algorithms, such as hierarchical clustering or density-based clustering, depending on the nature of the data and the goals of the analysis. Proper evaluation and validation of clustering results are crucial to ensuring the effectiveness of K-means in a given context.

NAIVE BAYES

Naive Bayes is a simple yet effective classification algorithm widely used in machine learning and natural language processing. It’s based on Bayes’ theorem and makes a “naive” assumption that all features are conditionally independent given the class label, which means that the presence or absence of one feature does not affect the presence or absence of another feature. This simplifying assumption allows for efficient and straightforward probabilistic classification.

The core idea behind Naive Bayes is to compute the probability of a given instance belonging to a particular class based on the joint probabilities of the individual features and the class label. This is done using Bayes’ theorem, which calculates the posterior probability of the class given the data. Mathematically, it can be represented as:

\[ P(class | features)= \left(\frac{P(features)*P(features | class)}{P(features)}\right) \]

where; \[ p(class/features) \]

is the posterior probability of the class given the features.

\[ p(class) \]

is the prior probability of the class.

\[ p(features/class) \]

is the likelihood of the features given the class.

\[ p(features) \]

is the marginal probability of the features.

In practice, p(features) can be challenging to compute, so Naive Bayes makes the simplifying assumption that all features are conditionally independent, which allows us to factorize P(features∣class) as the product of individual probabilities:

$$ P() = P() P() P()

This simplification significantly reduces the computational complexity and is the reason why it’s called “naive.” Naive Bayes is commonly used in text classification tasks like spam detection and sentiment analysis. It’s particularly efficient with high-dimensional data, where other algorithms may suffer from the curse of dimensionality. Despite its simplicity and the independence assumption, Naive Bayes often performs surprisingly well, especially when the features are conditionally independent or when the data is well-suited to the independence assumption. However, it may not work well when there are strong dependencies between features, as it fails to capture these relationships.

EXTREME GRADIENT BOOSTING

Extreme Gradient Boosting (XGBoost) is a powerful and widely-used ensemble machine learning algorithm known for its exceptional performance in various predictive modeling tasks, particularly in structured/tabular data problems. XGBoost belongs to the gradient boosting family of algorithms, which iteratively improves the predictive capability of a weak learner, typically decision trees, by combining them into a strong predictive model. What sets XGBoost apart is its efficiency, scalability, and regularized learning approach.

XGBoost employs a gradient boosting framework, where each new decision tree is trained to correct the errors made by the ensemble of previous trees. It optimizes a loss function, typically a measure of prediction error, by fitting each tree to the negative gradient of the loss. This way, it focuses on the most challenging examples, continuously refining its predictions. XGBoost incorporates several techniques to improve model performance, such as regularization, which helps prevent overfitting, and a second-order approximation of the loss function for more accurate and faster convergence. Moreover, it allows users to specify custom evaluation metrics and handles missing data gracefully. XGBoost’s parallel processing capabilities and the use of distributed computing frameworks make it highly scalable and suitable for large datasets. Its success has led to its integration into various machine learning libraries and platforms, making it accessible to a wide range of users.

In practice, XGBoost has demonstrated its effectiveness across a multitude of applications, including Kaggle competitions and real-world problems in areas like finance, healthcare, and recommendation systems. Its versatility, speed, and ability to handle structured data make it a preferred choice for many data scientists and machine learning practitioners when seeking state-of-the-art performance in predictive modeling tasks. As a result, XGBoost has become an essential tool in the machine learning toolbox and continues to drive advances in the field.

COMPARING THE PERFORMANCE OF VARIOUS CLASSIFICATION MODELS: APPLICATION OF MACHINE LEARNING ALGORITHMS

Lumumba Wandera Victor

October 10, 2023

MACHINE LEARNING

SUPERVISED AND UNSUPERVISED LEARNING

Supervised Learning:

Workflow:

Unsupervised Learning:

Semi-Supervised Learning:

CLASSIFICATION

Introduction to Classification:

MULTINOMIAL LOGISTIC REGRESSION

Overview

Assumptions:

Step 1: Installing the packages

Step 2: Creating a data frame

Using the wakefield package to easily generate reproducible “CPP” sample data

Using the wakefield package to easily generate reproducible “CIN” sample data

Using the wakefield package to easily generate reproducible “CLA” sample data

Merging the dataframes

Step 3: Exploratory data analysis

Step 4: Run a multinomial logistic regression

Step 5: Model results

Step 7: Requesting predicted probabilities

Step 8: Adding a continuous predictor

Step 8: Exploring changes in predicted probability associated with gender whilst holding age constant

MULTINOMIAL LOGISTIC REGRESSION MODEL PART TWO:

CARDIOTOCOGRAPH DATA (CTG DATA)

DATA PARTITION

MULTINOMIAL LOGISTIC REGRESSION MODEL

Import the data using the command below

Variable Definition

Convert the NSP variable to factor

Now check the structure

Data Partition

Create the training and testing dataset

Multinomial Logistic Model Estimation

Model Interpretation

Model One

Model Two

Confusion Matrix for the training data set

Misclassification

Confusion Matrix Using the Test Data

Create the confusion matrix

Model’s Accuracy using the Testing data

Misclassification Error

Prediction and Model Asssesment

Accuracy and Sensitivity of the Model

Specific classification

Accuracy and Sensitivity for the Testing Data

ORDINAL LOGISTIC REGRESSION MODEL 0R PROPORTIONAL ODDS LOGISTIC REGRESSION

Convert the dependent variable to ordinal factor

Check the structure again

Frequency Distribution

Check the Summary of the Data

Create a two-way table

Partition the Data Set

carry out the ordinal logistic regression or proportional logistic regression

Data Visualization

Bar Chart

Grouped Bar Graph

Grouped Bar Graphs

Calculating the P-values for our Model

Prediction

Equation and Calculating the Probabilities

Use all the variables

Estimate the new with significant variables only

Prediction

Calculate the Accuracy

Calculate the Accuracy and Miclassification for the Test data

Confusion Matrix

Calculate the Accuracy

Miscalculation

Confusion Matrix

Overall Statistics

Statistics by Class:

APPLICATION OF RANDOM FOREST IN CLASSIFICATION MODEL

Load the data set

Data Partitioning

Random Forest Model