Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computer systems to learn from and make predictions or decisions based on data. In other words, it’s a field of study and practice where computers are trained to perform tasks without being explicitly programmed for each task.
Supervised learning and unsupervised learning are two fundamental categories of machine learning, each with its own distinct characteristics and use cases:
Definition: Supervised learning is a type of machine learning where the algorithm learns from labeled training data. It involves a clear relationship between input and output, where the model is trained to map input data to corresponding target labels.
Key Characteristics: * The training data consists of pairs of input features (attributes or variables) and their corresponding target labels. * The goal is to learn a mapping function that can accurately predict the target label for new, unseen data based on the input features. * Supervised learning tasks include classification (predicting discrete labels) and regression (predicting continuous values).
Examples: * Classification: Spam email detection, image recognition, sentiment analysis. * Regression: House price prediction, stock price forecasting, temperature prediction.
Evaluation Metrics: * For classification: Accuracy, precision, recall, F1-score, ROC curve, etc. * For regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, etc.
Data pre-processing, feature engineering, model selection, training, evaluation, and testing are common steps in a supervised learning pipeline.
Definition: Unsupervised learning is a type of machine learning where the algorithm learns from unlabeled data. It aims to discover patterns, structures, or relationships in the data without explicit guidance or target labels.
Key Characteristics: * The training data does not contain labeled target information. * The primary objective is to explore the inherent structure within the data, such as clustering similar data points or reducing dimensionality.
Examples: * Clustering: Grouping similar customer profiles for targeted marketing. * Dimensionality Reduction: Reducing the number of features while preserving essential information (e.g., Principal Component Analysis).
Evaluation Metrics: Evaluation in unsupervised learning can be more challenging since there are no ground truth labels. Metrics depend on the specific task, such as silhouette score for clustering or explained variance ratio for dimensionality reduction.
Workflow: * Data preprocessing, model selection, training, and evaluation are steps in an unsupervised learning pipeline. * Visualization and domain knowledge are often essential for interpreting results.
In addition to supervised and unsupervised learning, there is also a category known as semi-supervised learning, which combines elements of both. In semi-supervised learning, you have a small amount of labeled data and a larger amount of unlabeled data. The goal is to leverage both sources of information to build better models.
In summary, supervised learning is about learning from labeled data to make predictions, unsupervised learning is about discovering patterns in unlabeled data, and semi-supervised learning combines elements of both to leverage labeled and unlabeled data for learning and prediction.
Classification is a fundamental concept in machine learning that involves categorizing data into predefined classes or labels based on the features or attributes of the data. Classification models are widely used in various fields, including finance, healthcare, natural language processing, image recognition, and more. Here’s a comprehensive write-up on classification models in machine learning:
Classification is a supervised learning task where the goal is to assign a class label to an input data point. The classes can be binary (two classes, often referred to as positive and negative) or multi-class (more than two classes). The objective of classification is to build a model that can generalize from the training data to make accurate predictions on new, unseen data.
Multinomial logistic regression is used in situations where the dependent variable has more than two categories. It is an extension of the basic logistic regression model. For example, if the dependent variable has three categories, A, B and C we would estimate regression models of A vs B, and A vs C. It is important to note that these models are estimated simultaneously. This model is used to predict the probabilities of categorically dependent variable, which has more than two outcome classes. The dependent variables are nominal in nature. This means that there is no kind of ordering in the target classes ie these classes cannot be meaningfully ordered (eg 1,2,3,4 etc). Examples of multinomial logistic regression topics include:
Favorite ice cream flavor * Dependent Variable: Ice cream flavor - Chocolate, Vanilla, Coffee, Strawberry * Independent Variables: Gender, Age, Ethnicity
Current employment status * Dependent Variable: Employment status - Full-time employment, Part-time employment, Training, Unemployed * Independent Variables: Gender, Age, Qualifications, Area a person lives, number of children
Sentencing outcome status * Independent Variable: Sentencing status - Prison, Community Penalty, Fine, Discharge * Dependent Variables: Gender, Age, Previous Convictions
Multinomial logistic regression assumes that:
1 Each independent variable has a single value for each case. 2 The dependent variable cannot be perfectly predicted from the independent variables for each case. 3 No multicollinearity between independent variables - This occurs when two or more independent variables are highly correlated with each other. This makes it difficult to understand how much every independent variable contributes to the category of dependent variable. 4 There should be no outliers in the data points.
The following packagaes are required sjmisc and sjPlot. To create the multinomial models the nnet package needs to be installed.
library(sjmisc)
library(sjPlot)
library(nnet)
library(wakefield)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(caret)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)
For the purposes of this exercise the wakefield package will be used to generate a random data set. This will be a psuedo data set to resemble the number of children in local authority who are either: a Child in Need (CIN), Subject to a Child Protection Plan (CP), or child looked after (CLA).
set.seed(999)
df.CPP <-
r_data_frame(
n = 12500,
r_sample(n = 255, x = "CPP", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.52, 0.48),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.12, 0.08, 0.08, 0.08, 0.08, 0.06, 0.06, 0.06, 0.06, 0.06, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.01, 0.01),
name = "Age")
)
set.seed(999)
df.CIN <-
r_data_frame(
n = 97000,
r_sample(n = 255, x = "CIN", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.55, 0.45),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.05, 0.04, 0.04, 0.04, 0.04, 0.044, 0.044, 0.044, 0.044, 0.044, 0.053, 0.053, 0.053, 0.053, 0.053, 0.053, 0.115, 0.115),
name = "Age")
)
set.seed(999)
df.CLA <-
r_data_frame(
n = 16200,
r_sample(n = 255, x = "CLA", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.57, 0.43),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.05, 0.035, 0.035, 0.035, 0.035, 0.038, 0.038, 0.038, 0.038, 0.038, 0.065, 0.065, 0.065, 0.065, 0.065, 0.065, 0.115, 0.115),
name = "Age")
)
mydata <- rbind(df.CIN, df.CLA, df.CPP)
head(mydata,5)
# A tibble: 5 × 4
Type ID Sex Age
<chr> <chr> <chr> <chr>
1 CIN 00001 Female 16
2 CIN 00002 Male 10
3 CIN 00003 Male 2
4 CIN 00004 Male 14
5 CIN 00005 Female 14
This set of data contains four variables. The dependent varaible is type which details the Children’s Services Outcome for each child. Using multinomial modeling we will explore how this outcome varies by sex. The next step is to start be requesting the frequencies of these two variables to check how they are coded.
frq(mydata, Type, Sex)
Type <character>
# total N=125700 valid N=125700 mean=1.33 sd=0.65
Value | N | Raw % | Valid % | Cum. %
----------------------------------------
CIN | 97000 | 77.17 | 77.17 | 77.17
CLA | 16200 | 12.89 | 12.89 | 90.06
CPP | 12500 | 9.94 | 9.94 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Sex <character>
# total N=125700 valid N=125700 mean=1.55 sd=0.50
Value | N | Raw % | Valid % | Cum. %
-----------------------------------------
Female | 56513 | 44.96 | 44.96 | 44.96
Male | 69187 | 55.04 | 55.04 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
The most common outcome was CIN (77% of all children), followed by CPP( 13%). We can also see that there were 69,014 males and 56,686 females. It is useful to see if there are any differences in Type by Sex with a cross-tabulation. Here we can see that CLA Children’s Services Outcomes are much more likely for males, and CPP are more common for females.
sjt.xtab(mydata$Type, mydata$Sex, show.col.prc = TRUE)
| Type | Sex | Total | |
|---|---|---|---|
| Female | Male | ||
| CIN |
43647 77.2 % |
53353 77.1 % |
97000 77.2 % |
| CLA |
6931 12.3 % |
9269 13.4 % |
16200 12.9 % |
| CPP |
5935 10.5 % |
6565 9.5 % |
12500 9.9 % |
| Total |
56513 100 % |
69187 100 % |
125700 100 % |
χ2=63.131 · df=2 · Cramer’s V=0.022 · p=0.000 |
To run this regression the first step is to create a new “factor” variable, Type.f that replicates the values of Type but which replaces the values with text labels. Check that the coding of the variable has worked.
mydata$Type.f <- factor(mydata$Type, labels = c("CIN", "CLA", "CPP"))
head(mydata, n = 10)
# A tibble: 10 × 5
Type ID Sex Age Type.f
<chr> <chr> <chr> <chr> <fct>
1 CIN 00001 Female 16 CIN
2 CIN 00002 Male 10 CIN
3 CIN 00003 Male 2 CIN
4 CIN 00004 Male 14 CIN
5 CIN 00005 Female 14 CIN
6 CIN 00006 Female 16 CIN
7 CIN 00007 Male 16 CIN
8 CIN 00008 Female 17 CIN
9 CIN 00009 Male 7 CIN
10 CIN 00010 Female 2 CIN
The next step of the multinomial logistic regression is to select a reference category. By default R orders the factors alphabetically, so would treat CIN as the reference group. However, it is possible to change this using the relevel command. We will choose CLA as the reference group.
mydata$Type.f <- relevel(mydata$Type.f, ref = "CLA")
It is useful to create a new dummy variable, male which has a 0 if female and 1 if male. The purpose of this is to make the gender effect clearer to see in the model.
mydata$Male <- rec(mydata$Sex, rec = "Male = 1; Female = 0; else = NA")
Finally, the multinom command is used to run a multinomial model. The summary() command can be used once the model has converged to view the results.
### Fit the model
model1 <-multinom(Type.f ~ Male, data = mydata)
# weights: 9 (4 variable)
initial value 138095.564686
iter 10 value 87153.669778
iter 10 value 87153.669764
final value 87153.669764
converged
### Model summary
summary(model1)
Call:
multinom(formula = Type.f ~ Male, data = mydata)
Coefficients:
(Intercept) Male1
CIN 1.8401323 -0.08990659
CPP -0.1551774 -0.18968963
Std. Errors:
(Intercept) Male1
CIN 0.01293019 0.01714112
CPP 0.01768551 0.02393687
Residual Deviance: 174307.3
AIC: 174315.3
The multinomial regression model results are included in the table Coefficients: with the associated standard errors included in the table Std.Errors: . The first colum (INtercept) includes the estimates for the reference category (female), whilst column male returns the results for the effect of being male compared to the reference. Model results are given in logits, so we need to take the exponential to read these as odds and odds ratios. This can be achieved by using the exp(coef()) command.
exp(coef(model1))
(Intercept) Male1
CIN 6.2973712 0.9140166
CPP 0.8562633 0.8272158
To calculate the 95% confidence interval (CI) and restructure the results to make it easier to read use the tab_model function.
tab_model(model1)
| Type f | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Response |
| (Intercept) | 6.30 | 6.14 – 6.46 | <0.001 | CIN |
| Male [1] | 0.91 | 0.88 – 0.95 | <0.001 | CIN |
| (Intercept) | 0.86 | 0.83 – 0.89 | <0.001 | CPP |
| Male [1] | 0.83 | 0.79 – 0.87 | <0.001 | CPP |
| Observations | 125700 | |||
| R2 / R2 adjusted | 0.000 / 0.000 | |||
Here we can see that the odds of becoming CIN rather than CLA are 6.3 for females. Compared to females, males have 0.91 times less odds of becoming CLA rather than becoming CIN. Similarly, the odds of becoming subject to Child Protection Plan rather than CLA are 0.83 times less for a male than a female. Underneath the table we can see the approximate R2 which is 0.000. We would therefore want to consider other potential predictors of Children’s Services Outcomes.
Sometimes it is desirable to request the predicted probabilities to better understand the chances of receiving different Children’s Services Outcomes. This can be achieved by using the fitted() command.
predprob <-fitted(model1)
head(predprob)
CLA CIN CPP
1 0.1226447 0.7723392 0.1050161
2 0.1339725 0.7711328 0.0948947
3 0.1339725 0.7711328 0.0948947
4 0.1339725 0.7711328 0.0948947
5 0.1226447 0.7723392 0.1050161
6 0.1226447 0.7723392 0.1050161
The first row of data relate to males. The table illustrates that the probability of a male becoming CIN is 0.77, CPP is 0.11 and CLA is 0.12. The second row refers to females, with probabilities of CIN = 0.13, CPP = 0.09 and CLA is 0.77. All rounded to 2.d.p
It is possible to extend the model by adding in the age of the child/young person. This variable is needs to be numeric. For the purposes of this model the reference category will be CLA.
mydata$Age <- as.numeric(mydata$Age)
model2 <-multinom(Type.f ~ Male + Age, data = mydata)
# weights: 12 (6 variable)
initial value 138095.564686
iter 10 value 86003.438226
iter 20 value 84340.937713
final value 84340.823598
converged
### model summary
summary(model2)
Call:
multinom(formula = Type.f ~ Male + Age, data = mydata)
Coefficients:
(Intercept) Male1 Age
CIN 2.010631 -0.08921876 -0.01681853
CPP 1.039220 -0.18518794 -0.14516870
Std. Errors:
(Intercept) Male1 Age
CIN 0.02088667 0.01714897 0.001589901
CPP 0.02602449 0.02437257 0.002306540
Residual Deviance: 168681.6
AIC: 168693.6
tab_model(model2)
| Type f | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Response |
| (Intercept) | 7.47 | 7.17 – 7.78 | <0.001 | CIN |
| Male [1] | 0.91 | 0.88 – 0.95 | <0.001 | CIN |
| Age | 0.98 | 0.98 – 0.99 | <0.001 | CIN |
| (Intercept) | 2.83 | 2.69 – 2.97 | <0.001 | CPP |
| Male [1] | 0.83 | 0.79 – 0.87 | <0.001 | CPP |
| Age | 0.86 | 0.86 – 0.87 | <0.001 | CPP |
| Observations | 125700 | |||
| R2 / R2 adjusted | 0.033 / 0.033 | |||
The results in the table illustrate that the odds of becoming CIN rather than CLA tend to be lower as the age increases. For every year, the odds fall approximately 2% (1-0.98 = 0.02). This means that older children are less likely to become CIN rather LAC. The same is true when we consider the odds of CPP rather than LAC. The odds fall approximately 14%. Underneath the table we can see the approximate R2 which is 0.03. We would therefore want to consider other potential predictors of Children’s Services Outcomes.
Finally, is we want to explore the changes in predicted probability associated with Sex whilst holding age constant, we can create another dataset varying gender, but holding the Age at the sample mean.
sub.Type <- data.frame(Male = c(0,1), Age = mean(mydata$Age))
sub.Type$Male <- as.factor(sub.Type$Male)
p <- predict(model2, newdata = sub.Type, "probs")
barplot(p, ylab = "Probability", ylim=c(0,1), xlab = "Children's Service Outcome", beside = T, legend = TRUE)
For someone of average age, the predicted probability of becoming CIN for a woman is 0.79 compared to 0.79 for a man. The probability of a male (of average age) becoming CPP is approximately 0.09 compared to 0.08 for a female. The probability of a male (of average age) becoming CLA is approximately 0.12 compared to 0.14 for a female.
mydata1 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata1)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : int 2 1 1 1 1 3 3 3 3 3 ...
From the structure of the data above, the first 21 variables are independent variables. NSP = 1; Normal patient NSP = 2; Suspected Patient NSP = 3; Pathologic patient
LB - FHR baseline (beats per minute) AC - # of accelerations per second FM - # of fetal movements per second UC - # of uterine contractions per second DL - # of light decelerations per second DS - # of severe decelerations per second DP - # of prolongued decelerations per second ASTV - percentage of time with abnormal short term variability MSTV - mean value of short term variability ALTV - percentage of time with abnormal long term variability MLTV - mean value of long term variability Width - width of FHR histogram Min - minimum of FHR histogram Max - Maximum of FHR histogram Nmax - # of histogram peaks Nzeros - # of histogram zeros Mode - histogram mode Mean - histogram mean Median - histogram median Variance - histogram variance Tendency - histogram tendency CLASS - FHR pattern class code (1 to 10) NSP - fetal state class code (N=normal; S=suspect; P=pathologic)
mydata1$NSP <- as.factor(mydata1$NSP)
str(mydata1)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...
Let us make use of set.seed() so that if you partition the data set, you don’t get different train and test data set.
ind <- sample(2, nrow(mydata1),
replace = TRUE,
prob = c(0.6, 0.4))
training <- mydata1[ind==1,]
testing <- mydata1[ind ==2,]
From the code above, we can see that the training data set has 1277 observations with 22 variables while the testing data set has 849 observations with 22 variables
## Load the following library
library(nnet) ### nnet stands for neuro network
One most important we must tell the system is the reference level. Out of levels 1,2 and 3, what is our reference level. In this case, our reference level will be the normal patient
training$NSP <- relevel(training$NSP, ref = "1")
In the command below, we are going to use all other variables as independent variables
#### the model
mymodel <- multinom(NSP~.,data = training)
# weights: 69 (44 variable)
initial value 1405.125117
iter 10 value 570.285714
iter 20 value 544.822165
iter 30 value 407.104754
iter 40 value 370.928272
iter 50 value 365.143677
iter 60 value 311.150926
iter 70 value 309.067691
iter 80 value 292.156603
iter 90 value 287.381568
iter 100 value 287.381315
final value 287.381315
stopped after 100 iterations
You can see, all quickly, it goes over 100 iterations and then stops. From the results, the first iteration start with a very high error and error keeps reducing. At 100 iteration, the error remains constant.
### Summary of the model
summary(mymodel)
Call:
multinom(formula = NSP ~ ., data = training)
Coefficients:
(Intercept) LB AC FM UC DL DS
2 -17.31012 -0.02760348 -974.35380 9.892921 -338.8540 -83.49005 -0.2330054
3 -15.41322 0.45653346 -91.35794 18.082785 -450.5167 -235.30982 8.9150235
DP ASTV MSTV ALTV MLTV Width Min
2 88.28689 0.06818958 -0.3194275 0.01767543 -0.03723673 0.006975364 0.01753461
3 125.06839 0.17025220 -1.1293653 0.06533746 0.07014182 0.013265491 0.02272449
Max Nmax Nzeros Mode Mean Median
2 0.02450997 0.1348233 -0.07132604 -0.02765878 0.3191267 -0.2073014
3 0.03598999 -0.3105369 0.66652874 0.03184939 -0.1523690 -0.3815680
Variance Tendency
2 0.05909011 0.2512291
3 0.07054533 0.7363657
Std. Errors:
(Intercept) LB AC FM UC DL
2 2.033101 0.03373358 0.001820328 0.1559106 0.007687022 0.002437925
3 0.458621 0.05064984 0.001237864 0.1546391 0.004267391 0.001239707
DS DP ASTV MSTV ALTV MLTV
2 2.762331e-05 6.180971e-05 0.01306283 0.3053944 0.007186479 0.04043266
3 2.739531e-05 1.778715e-04 0.02131316 0.4800381 0.012537553 0.06574320
Width Min Max Nmax Nzeros Mode Mean
2 0.005392413 0.009131205 0.01189680 0.06922511 0.1658503 0.04037489 0.06747207
3 0.007502056 0.013456759 0.01305705 0.13035293 0.3267609 0.04546155 0.05035061
Median Variance Tendency
2 0.08923069 0.01075478 0.3265719
3 0.07095862 0.01331678 0.4978992
Residual Deviance: 574.7626
AIC: 658.7626
From the results above, we estimated the model with all 21 independent variable, therefore we have to check which variable is statistically significant. We therefore have to conduct a two-tailed Z-test.
### 2-tailed z-test
z <- summary(mymodel)$coefficients/summary(mymodel)$standard.errors
p <- (1-pnorm(abs(z),0,1)) *2
p
(Intercept) LB AC FM UC DL DS DP ASTV MSTV ALTV
2 0 0.4131978 0 0 0 0 0 0 1.788047e-07 0.29558369 1.391154e-02
3 0 0.0000000 0 0 0 0 0 0 1.332268e-15 0.01863979 1.874807e-07
MLTV Width Min Max Nmax Nzeros Mode
2 0.3570731 0.19582041 0.05482053 0.039377951 0.05146201 0.66714998 0.4933130
3 0.2860143 0.07701958 0.09127608 0.005844775 0.01720589 0.04136971 0.4835661
Mean Median Variance Tendency
2 2.247844e-06 2.016802e-02 3.922364e-08 0.4417201
3 2.476815e-03 7.559813e-08 1.174137e-07 0.1391549
From the results above, we can drop variable that are insignificant. In other words, such variable are contributing anything to the model. For instance, we can drop variables like MLTV, Width, Min, Mode, Nmax, Nzeros and Tendency. We can now run the model again.
mymodel <- multinom(NSP~. -MLTV -Width -Min -Mode -Nmax - Nzeros -Tendency,data = training)
# weights: 48 (30 variable)
initial value 1405.125117
iter 10 value 553.294583
iter 20 value 422.401296
iter 30 value 390.556415
iter 40 value 374.779936
iter 50 value 317.131571
iter 60 value 315.331314
iter 70 value 311.344612
iter 80 value 309.705667
iter 90 value 308.639685
iter 100 value 301.591114
final value 301.591114
stopped after 100 iterations
summary(mymodel)
Call:
multinom(formula = NSP ~ . - MLTV - Width - Min - Mode - Nmax -
Nzeros - Tendency, data = training)
Coefficients:
(Intercept) LB AC FM UC DL DS
2 -15.79030 0.007146002 -674.8893 8.888663 -290.0121 -105.6467 -0.2778796
3 -13.84179 0.389746487 -71.2671 13.133329 -305.7423 -178.6945 6.8575186
DP ASTV MSTV ALTV Max Mean Median
2 58.26660 0.06138004 -0.4759162 0.02116653 0.03138541 0.21240327 -0.1629087
3 70.41143 0.14405678 -1.0565171 0.05530839 0.01752014 -0.09907106 -0.2902961
Variance
2 0.04740086
3 0.05509592
Std. Errors:
(Intercept) LB AC FM UC DL
2 1.8386025 0.03127154 0.011449897 2.164807 0.010586139 0.007383440
3 0.4227068 0.04389029 0.002072531 1.096782 0.007202979 0.003833048
DS DP ASTV MSTV ALTV Max
2 7.227337e-05 0.003915843 0.01192198 0.3209805 0.006273795 0.01129398
3 7.301935e-05 0.001910488 0.02013791 0.4395718 0.009685572 0.01125022
Mean Median Variance
2 0.05776442 0.06491073 0.009016669
3 0.03775549 0.04616799 0.011546313
Residual Deviance: 603.1822
AIC: 663.1822
z <- summary(mymodel)$coefficients/summary(mymodel)$standard.errors
p <- (1-pnorm(abs(z),0,1)) *2
p
(Intercept) LB AC FM UC DL DS DP ASTV MSTV
2 0 0.8192463 0 4.025988e-05 0 0 0 0 2.626099e-07 0.13815543
3 0 0.0000000 0 0.000000e+00 0 0 0 0 8.457679e-13 0.01623832
ALTV Max Mean Median Variance
2 7.413829e-04 0.00545349 0.0002359368 1.208218e-02 1.464039e-07
3 1.127175e-08 0.11939559 0.0086899387 3.219538e-10 1.826478e-06
The model above is now our final model. We will use the coefficient in the model above to write the equation
\[ \ln\left(\frac{P(NSP=2)}{P(NSP=1)}\right) = -15.79030 + 0.007146002*LB -674.8893*AC + 8.888663*FM -290.0121*UC -105.6467*DL -0.2778796*DS + 58.26660*DP+ --- + 0.04740086*Variance \]
\[ \ln\left(\frac{P(NSP=3)}{P(NSP=1)}\right) = -13.84179 0.389746487*LB -71.2671*AC + 13.133329FM -305.7423*UC -178.6945*DL + 6.8575186*DS + 70.41143*DP + --- + 0.05509592*Variance \]
For the first equation, on the left hand side, we have the logarithm of the probability that NSP is 2 given that NSP is equal to 1. In other words, that is the probability that the patient is suspect given the probability that the patient is normal. That is the log-odds of the patient being suspect given that the patient is normal.
For the second equation, on the left hand side, we have the logarithm of the probability that NSP is 3 given that NSP is equal to 1. In other words, that is the probability that the patient is pathologic given the probability that the patient is normal. That is the log-odds of the patient being pathologic given that the patient is normal.
##Confusion Matrix ### Misclassification Error In this step we will create confusion matrix for the model, for both training and test data set. We shall of calculate the misclssification error once again for both training and test data set
Remember we created the model for variables that are only significant. We removed variables that are not significant from our model. We will therefore use our preferred model to create the confusion matrix
### Confusion matrix for the training data set
p <- predict(mymodel, training)
head(p)
[1] 2 1 1 2 1 2
Levels: 1 2 3
The results shows the classification of first six patients. We can now compare with actual patient in our training data set
head(training$NSP)
[1] 2 1 1 3 3 2
Levels: 1 2 3
From the results above, the first patient was predicted to be suspect and reality, the patient was suspect, the second and third patients were predicted to normal and in reality they were normal patients. The forth patient was predicted to suspect and in reality the patient was a pathologic. The fifth patient was predicted to be normal but in reality, the patient was pathologic. Lastly, the sixth patient was predicted to be suspect and in reality the patient is suspect. Let us now create the confusion matrix
confusion_matrix <- table(p, training$NSP)
confusion_matrix
p 1 2 3
1 943 48 8
2 32 123 19
3 6 11 89
The column values 1,2 and 3 are the actual values while row values 1,2 and 3 are the predicted values. From the matrix, there were 943 people who were normal and the model predicted them as normal. Besides, there were 48 people who were suspect, however, the model predicted them as being normal. In addition, there were 8 pathologic patients but the model predicted them as being normal. The data has 123 suspect patients and model also predicted 123 suspect patients. The data has 89 pathologic patients and the model predicted 89 pathologic patients as well. The numbers on the diagnoal as correct classification with the rest as being misclassification.
Using the results above, we can calculate the accuracy rate of our model; ### Correct classification
sum(diag(confusion_matrix))/sum(confusion_matrix)
[1] 0.9030493
The model has 90.30498% correct classification
1-sum(diag(confusion_matrix))/sum(confusion_matrix)
[1] 0.09695074
The model has 9.695074% misclassification.
attach(training)
confusionMatrix(p, NSP, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 943 48 8
2 32 123 19
3 6 11 89
Overall Statistics
Accuracy : 0.903
95% CI : (0.8855, 0.9187)
No Information Rate : 0.767
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7408
Mcnemar's Test P-Value : 0.1317
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9613 0.67582 0.76724
Specificity 0.8121 0.95351 0.98538
Pos Pred Value 0.9439 0.70690 0.83962
Neg Pred Value 0.8643 0.94661 0.97698
Prevalence 0.7670 0.14230 0.09070
Detection Rate 0.7373 0.09617 0.06959
Detection Prevalence 0.7811 0.13604 0.08288
Balanced Accuracy 0.8867 0.81467 0.87631
p1 <- predict(mymodel, testing)
head(p1)
[1] 1 1 3 3 1 1
Levels: 1 2 3
head(testing$NSP)
[1] 1 1 3 3 3 1
Levels: 1 2 3
confusion_matrix1 <- table(p1, testing$NSP)
confusion_matrix1
p1 1 2 3
1 635 34 3
2 34 76 11
3 5 3 46
sum(diag(confusion_matrix1))/sum(confusion_matrix1)
[1] 0.8937426
1- sum(diag(confusion_matrix1))/sum(confusion_matrix1)
[1] 0.1062574
attach(testing)
confusionMatrix(p1, NSP, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 635 34 3
2 34 76 11
3 5 3 46
Overall Statistics
Accuracy : 0.8937
95% CI : (0.871, 0.9137)
No Information Rate : 0.7957
P-Value [Acc > NIR] : 1.888e-14
Kappa : 0.6921
Mcnemar's Test P-Value : 0.1666
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9421 0.67257 0.76667
Specificity 0.7861 0.93869 0.98983
Pos Pred Value 0.9449 0.62810 0.85185
Neg Pred Value 0.7771 0.94904 0.98235
Prevalence 0.7957 0.13341 0.07084
Detection Rate 0.7497 0.08973 0.05431
Detection Prevalence 0.7934 0.14286 0.06375
Balanced Accuracy 0.8641 0.80563 0.87825
In the context of multinomial logistic regression, accuracy and sensitivity are both important metrics for evaluating the performance of the model. Accuracy represents the overall correctness of the model’s predictions across all categories or classes. It measures the proportion of correctly predicted instances out of the total instances in the data set. While accuracy is a useful metric, especially when classes are balanced, it may not provide a complete picture of model performance when dealing with imbalanced datasets or when different classes have varying levels of importance.
On the other hand, sensitivity, also known as the true positive rate or recall, focuses on the model’s ability to correctly identify instances from a specific class. It measures the proportion of true positives (correctly predicted instances of a particular class) out of all actual instances belonging to that class. Sensitivity is particularly important when dealing with imbalanced datasets or when certain classes are more critical to identify correctly than others, such as in medical diagnosis.
In summary, while accuracy gives an overall view of model performance, sensitivity provides insights into the model’s ability to correctly detect specific classes, making it a valuable metric when dealing with imbalanced datasets or when certain classes have greater significance in the application. Both metrics should be considered in the context of the problem you’re addressing to fully assess the effectiveness of your multinomial logistic regression model.
Suppose we want to know how patients were normal, suspect and pathologic in the training data set, we can use the command below
frq(training, NSP, Tendency)
NSP <categorical>
# total N=1279 valid N=1279 mean=1.32 sd=0.63
Value | N | Raw % | Valid % | Cum. %
--------------------------------------
1 | 981 | 76.70 | 76.70 | 76.70
2 | 182 | 14.23 | 14.23 | 90.93
3 | 116 | 9.07 | 9.07 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Tendency <integer>
# total N=1279 valid N=1279 mean=0.32 sd=0.61
Value | N | Raw % | Valid % | Cum. %
--------------------------------------
-1 | 101 | 7.90 | 7.90 | 7.90
0 | 672 | 52.54 | 52.54 | 60.44
1 | 506 | 39.56 | 39.56 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
From the training data, 981 patients was normal, 182 were suspect and 116 were pathologic.
n <- table(training$NSP)
n/sum(n)*100
1 2 3
76.700547 14.229867 9.069586
From the data, 76.7% of the patients were normal, 14.23% were suspect and 9.07% were pathologic.
confusion_matrix/colSums(confusion_matrix)
p 1 2 3
1 0.961264016 0.048929664 0.008154944
2 0.175824176 0.675824176 0.104395604
3 0.051724138 0.094827586 0.767241379
From the results above, the model is 96.13% classification for normal patients, 67.58% correct classification for suspect patients and about 76.72% for pathologic patients. In other words, the model is doing good in classifying 1 and 3 as compared to 2.
confusion_matrix1/colSums(confusion_matrix1)
p1 1 2 3
1 0.942136499 0.050445104 0.004451039
2 0.300884956 0.672566372 0.097345133
3 0.083333333 0.050000000 0.766666667
From the results above, the model predicts 94.21% correct classification for normal patients, 67.26% correct classification for suspect patients and about 76.67% for pathologic patients for the testing data. In other words, the model is doing good in classifying 1 and 3 as compared to 2 for the testing data.
mydata2 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata2)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : int 2 1 1 1 1 3 3 3 3 3 ...
mydata2$NSP <- as.ordered(mydata2$NSP)
mydata2$Tendency<-as.factor(mydata2$Tendency)
str(mydata2)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: Factor w/ 3 levels "-1","0","1": 3 2 2 3 3 2 2 3 3 3 ...
$ NSP : Ord.factor w/ 3 levels "1"<"2"<"3": 2 1 1 1 1 3 3 3 3 3 ...
frq(mydata2, NSP, Tendency)
NSP <ordinal>
# total N=2126 valid N=2126 mean=1.30 sd=0.61
Value | N | Raw % | Valid % | Cum. %
---------------------------------------
1 | 1655 | 77.85 | 77.85 | 77.85
2 | 295 | 13.88 | 13.88 | 91.72
3 | 176 | 8.28 | 8.28 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Tendency <categorical>
# total N=2126 valid N=2126 mean=0.32 sd=0.61
Value | N | Raw % | Valid % | Cum. %
---------------------------------------
-1 | 165 | 7.76 | 7.76 | 7.76
0 | 1115 | 52.45 | 52.45 | 60.21
1 | 846 | 39.79 | 39.79 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
summary(mydata2)
LB AC FM UC
Min. :106.0 Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:126.0 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.001876
Median :133.0 Median :0.001630 Median :0.000000 Median :0.004482
Mean :133.3 Mean :0.003170 Mean :0.009474 Mean :0.004357
3rd Qu.:140.0 3rd Qu.:0.005631 3rd Qu.:0.002512 3rd Qu.:0.006525
Max. :160.0 Max. :0.019284 Max. :0.480634 Max. :0.014925
DL DS DP ASTV
Min. :0.000000 Min. :0.000e+00 Min. :0.0000000 Min. :12.00
1st Qu.:0.000000 1st Qu.:0.000e+00 1st Qu.:0.0000000 1st Qu.:32.00
Median :0.000000 Median :0.000e+00 Median :0.0000000 Median :49.00
Mean :0.001885 Mean :3.585e-06 Mean :0.0001566 Mean :46.99
3rd Qu.:0.003264 3rd Qu.:0.000e+00 3rd Qu.:0.0000000 3rd Qu.:61.00
Max. :0.015385 Max. :1.353e-03 Max. :0.0053476 Max. :87.00
MSTV ALTV MLTV Width
Min. :0.200 Min. : 0.000 Min. : 0.000 Min. : 3.00
1st Qu.:0.700 1st Qu.: 0.000 1st Qu.: 4.600 1st Qu.: 37.00
Median :1.200 Median : 0.000 Median : 7.400 Median : 67.50
Mean :1.333 Mean : 9.847 Mean : 8.188 Mean : 70.45
3rd Qu.:1.700 3rd Qu.:11.000 3rd Qu.:10.800 3rd Qu.:100.00
Max. :7.000 Max. :91.000 Max. :50.700 Max. :180.00
Min Max Nmax Nzeros
Min. : 50.00 Min. :122 Min. : 0.000 Min. : 0.0000
1st Qu.: 67.00 1st Qu.:152 1st Qu.: 2.000 1st Qu.: 0.0000
Median : 93.00 Median :162 Median : 3.000 Median : 0.0000
Mean : 93.58 Mean :164 Mean : 4.068 Mean : 0.3236
3rd Qu.:120.00 3rd Qu.:174 3rd Qu.: 6.000 3rd Qu.: 0.0000
Max. :159.00 Max. :238 Max. :18.000 Max. :10.0000
Mode Mean Median Variance Tendency
Min. : 60.0 Min. : 73.0 Min. : 77.0 Min. : 0.00 -1: 165
1st Qu.:129.0 1st Qu.:125.0 1st Qu.:129.0 1st Qu.: 2.00 0 :1115
Median :139.0 Median :136.0 Median :139.0 Median : 7.00 1 : 846
Mean :137.5 Mean :134.6 Mean :138.1 Mean : 18.81
3rd Qu.:148.0 3rd Qu.:145.0 3rd Qu.:148.0 3rd Qu.: 24.00
Max. :187.0 Max. :182.0 Max. :186.0 Max. :269.00
NSP
1:1655
2: 295
3: 176
attach(mydata2)
xtabs(~NSP+Tendency, mydata2)
Tendency
NSP -1 0 1
1 101 887 667
2 15 137 143
3 49 91 36
Let us partition the data into train and test
ind1 <- sample(2, nrow(mydata2), replace = TRUE, prob = c(0.8, 0.2))
train <- mydata2[ind1 == 1,]
test <- mydata2[ind1 == 2,]
The train data set has 1729 observations with 22 variables, while the test data set has 397 observations with 22 variables
To estimate this model, we need the library called MASS
library(MASS)
# Load the ggplot2 library
library(ggplot2)
# Calculate the percentage of each gender
percentage_data <- mydata2 %>%
group_by(NSP) %>%
summarize(Percentage = n() / nrow(mydata2) * 100)
# Create the bar plot
ggplot(percentage_data, aes(x = NSP, y = Percentage, fill = NSP)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.1f%%", Percentage)), vjust = -0.5) +
labs(title = "Distribution of NSP",
x = "NSP",
y = "Percentage") +
theme_minimal()
ggplot(data= mydata2, aes(x=NSP, y =Mean, fill=Tendency)) +
geom_bar(position = "dodge",
stat = "summary",
fun = "mean")+
theme_minimal()+
labs(title = "Bar Graph of Mean VS NSP and Tendency",
x = "NSP",
y = "Mean")+
theme(axis.text.x = element_text(angle = 360, vjust = 0.5))
library(ggplot2)
ggplot(mydata2, aes(x=as.factor(NSP), fill=Tendency))+
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..]) ),
stat="count", position=position_dodge(0.9), vjust=-0.5)+
ylab("Percentage of NSP Across Tendency") +
xlab("NSP")+
labs(title = "Bar Graph Showing the Distribution of NSP For Various Tendencies")
scale_y_continuous(labels = scales::percent)
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
### Estimate the model
mymodel2 <- polr(NSP ~ LB+AC+FM, train, Hess = TRUE)
summary(mymodel2)
Call:
polr(formula = NSP ~ LB + AC + FM, data = train, Hess = TRUE)
Coefficients:
Value Std. Error t value
LB 0.03836 0.006623 5.792e+00
AC -804.40604 0.005577 -1.442e+05
FM 7.42540 1.159603 6.403e+00
Intercepts:
Value Std. Error t value
1|2 5.4924 0.9074 6.0525
2|3 6.8839 0.9180 7.4991
Residual Deviance: 1820.135
AIC: 1830.135
From the results above, we can see that for a unit increase in FM,we expect about 6.92902 increase in the expected value of NSP in the log odds scale, given that all other variables in the model are held constant
(ctable <- coef(summary(mymodel2)))
Value Std. Error t value
LB 0.03836167 0.006623396 5.791843e+00
AC -804.40604304 0.005577380 -1.442265e+05
FM 7.42540119 1.159602960 6.403400e+00
1|2 5.49235454 0.907448582 6.052524e+00
2|3 6.88388486 0.917964521 7.499075e+00
p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))
Value Std. Error t value p-value
LB 0.03836167 0.006623396 5.791843e+00 6.961803e-09
AC -804.40604304 0.005577380 -1.442265e+05 0.000000e+00
FM 7.42540119 1.159602960 6.403400e+00 1.519547e-10
1|2 5.49235454 0.907448582 6.052524e+00 1.425935e-09
2|3 6.88388486 0.917964521 7.499075e+00 6.426971e-14
pred <- predict(mymodel2, train[1:5,], type = 'prob')
print(pred, digits = 3)
1 2 3
1 0.709 0.19857 0.092752
2 0.996 0.00288 0.000956
3 0.955 0.03318 0.011496
5 0.997 0.00258 0.000858
6 0.768 0.16225 0.069947
train[1:5, 1:3]
LB AC FM
1 120 0.000000000 0
2 132 0.006379585 0
3 133 0.003322259 0
5 132 0.006514658 0
6 134 0.001049318 0
Using the results above, we can write the model and predict the probability of NSP being 1,2 or 3.
mymodel2 <- polr(NSP ~., train, Hess = TRUE)
summary(mymodel2)
Call:
polr(formula = NSP ~ ., data = train, Hess = TRUE)
Coefficients:
Value Std. Error t value
LB 7.603e-03 2.176e-02 3.493e-01
AC -7.248e+02 8.192e-03 -8.848e+04
FM 4.207e+00 2.086e+00 2.017e+00
UC -2.359e+02 1.316e-02 -1.792e+04
DL -3.232e+00 3.184e-03 -1.015e+03
DS 2.121e+03 8.713e-05 2.435e+07
DP 3.280e+03 2.617e-03 1.254e+06
ASTV 9.513e-02 8.779e-03 1.084e+01
MSTV -1.216e-01 2.209e-01 -5.505e-01
ALTV 3.469e-02 4.912e-03 7.063e+00
MLTV 3.134e-02 2.703e-02 1.159e+00
Width 4.019e-02 8.460e-03 4.751e+00
Min 5.858e-02 1.181e-02 4.962e+00
Nmax -2.576e-02 4.864e-02 -5.296e-01
Nzeros 8.720e-02 1.183e-01 7.373e-01
Mode -8.525e-02 1.943e-02 -4.388e+00
Mean 6.602e-02 2.417e-02 2.732e+00
Median -6.256e-03 3.269e-02 -1.914e-01
Variance 2.093e-02 5.495e-03 3.809e+00
Tendency0 7.972e-01 3.833e-01 2.080e+00
Tendency1 1.124e+00 4.831e-01 2.326e+00
Intercepts:
Value Std. Error t value
1|2 12.7352 1.5281 8.3342
2|3 15.9194 1.5666 10.1615
Residual Deviance: 977.6412
AIC: 1023.641
### Calculating the P-values for our Model
(ctable <- coef(summary(mymodel2)))
Value Std. Error t value
LB 7.602905e-03 2.176351e-02 3.493418e-01
AC -7.248121e+02 8.192246e-03 -8.847538e+04
FM 4.206558e+00 2.086063e+00 2.016506e+00
UC -2.358793e+02 1.316491e-02 -1.791728e+04
DL -3.232100e+00 3.183799e-03 -1.015171e+03
DS 2.121300e+03 8.712903e-05 2.434665e+07
DP 3.280390e+03 2.616513e-03 1.253726e+06
ASTV 9.513327e-02 8.779237e-03 1.083617e+01
MSTV -1.216135e-01 2.209055e-01 -5.505229e-01
ALTV 3.469108e-02 4.911631e-03 7.063045e+00
MLTV 3.133798e-02 2.703051e-02 1.159356e+00
Width 4.018983e-02 8.459766e-03 4.750702e+00
Min 5.858344e-02 1.180706e-02 4.961731e+00
Nmax -2.576105e-02 4.864326e-02 -5.295913e-01
Nzeros 8.719934e-02 1.182689e-01 7.372971e-01
Mode -8.525058e-02 1.942733e-02 -4.388179e+00
Mean 6.601984e-02 2.416965e-02 2.731518e+00
Median -6.256364e-03 3.269334e-02 -1.913651e-01
Variance 2.092894e-02 5.494913e-03 3.808784e+00
Tendency0 7.971815e-01 3.832595e-01 2.080004e+00
Tendency1 1.123538e+00 4.830942e-01 2.325713e+00
1|2 1.273523e+01 1.528062e+00 8.334238e+00
2|3 1.591943e+01 1.566648e+00 1.016146e+01
p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))
Value Std. Error t value p-value
LB 7.602905e-03 2.176351e-02 3.493418e-01 7.268327e-01
AC -7.248121e+02 8.192246e-03 -8.847538e+04 0.000000e+00
FM 4.206558e+00 2.086063e+00 2.016506e+00 4.374712e-02
UC -2.358793e+02 1.316491e-02 -1.791728e+04 0.000000e+00
DL -3.232100e+00 3.183799e-03 -1.015171e+03 0.000000e+00
DS 2.121300e+03 8.712903e-05 2.434665e+07 0.000000e+00
DP 3.280390e+03 2.616513e-03 1.253726e+06 0.000000e+00
ASTV 9.513327e-02 8.779237e-03 1.083617e+01 2.319886e-27
MSTV -1.216135e-01 2.209055e-01 -5.505229e-01 5.819608e-01
ALTV 3.469108e-02 4.911631e-03 7.063045e+00 1.628925e-12
MLTV 3.133798e-02 2.703051e-02 1.159356e+00 2.463110e-01
Width 4.018983e-02 8.459766e-03 4.750702e+00 2.027113e-06
Min 5.858344e-02 1.180706e-02 4.961731e+00 6.986763e-07
Nmax -2.576105e-02 4.864326e-02 -5.295913e-01 5.963953e-01
Nzeros 8.719934e-02 1.182689e-01 7.372971e-01 4.609417e-01
Mode -8.525058e-02 1.942733e-02 -4.388179e+00 1.143039e-05
Mean 6.601984e-02 2.416965e-02 2.731518e+00 6.304329e-03
Median -6.256364e-03 3.269334e-02 -1.913651e-01 8.482396e-01
Variance 2.092894e-02 5.494913e-03 3.808784e+00 1.396520e-04
Tendency0 7.971815e-01 3.832595e-01 2.080004e+00 3.752514e-02
Tendency1 1.123538e+00 4.830942e-01 2.325713e+00 2.003387e-02
1|2 1.273523e+01 1.528062e+00 8.334238e+00 7.799903e-17
2|3 1.591943e+01 1.566648e+00 1.016146e+01 2.946346e-24
### Estimate the model
mymodel2 <- polr(NSP ~ AC+FM+UC+DL+DS+DP+ASTV+Width+Min+Variance, train, Hess = TRUE)
summary(mymodel2)
Call:
polr(formula = NSP ~ AC + FM + UC + DL + DS + DP + ASTV + Width +
Min + Variance, data = train, Hess = TRUE)
Coefficients:
Value Std. Error t value
AC -847.07581 0.0078071 -1.085e+05
FM 4.85295 2.0472349 2.370e+00
UC -258.32577 0.0140492 -1.839e+04
DL -102.94351 0.0021385 -4.814e+04
DS 2454.39606 0.0002667 9.202e+06
DP 2437.19066 0.0026491 9.200e+05
ASTV 0.09775 0.0072388 1.350e+01
Width 0.01970 0.0049289 3.996e+00
Min 0.03737 0.0061958 6.031e+00
Variance 0.02902 0.0041693 6.961e+00
Intercepts:
Value Std. Error t value
1|2 10.3025 1.0066 10.2348
2|3 12.9969 1.0332 12.5796
Residual Deviance: 1064.245
AIC: 1088.245
### Calculating the P-values for our Model
(ctable <- coef(summary(mymodel2)))
Value Std. Error t value
AC -847.07580590 0.0078070958 -1.085008e+05
FM 4.85295401 2.0472348679 2.370492e+00
UC -258.32577168 0.0140491822 -1.838725e+04
DL -102.94351232 0.0021385158 -4.813783e+04
DS 2454.39606181 0.0002667268 9.201912e+06
DP 2437.19066147 0.0026490693 9.200177e+05
ASTV 0.09774735 0.0072388477 1.350316e+01
Width 0.01969580 0.0049288793 3.995999e+00
Min 0.03736699 0.0061957662 6.031052e+00
Variance 0.02902359 0.0041693409 6.961193e+00
1|2 10.30254198 1.0066138832 1.023485e+01
2|3 12.99690142 1.0331704602 1.257963e+01
p <- pnorm(abs(ctable[,"t value"]),lower.tail = FALSE)*2
(ctable<- cbind(ctable, "p-value" = p))
Value Std. Error t value p-value
AC -847.07580590 0.0078070958 -1.085008e+05 0.000000e+00
FM 4.85295401 2.0472348679 2.370492e+00 1.776443e-02
UC -258.32577168 0.0140491822 -1.838725e+04 0.000000e+00
DL -102.94351232 0.0021385158 -4.813783e+04 0.000000e+00
DS 2454.39606181 0.0002667268 9.201912e+06 0.000000e+00
DP 2437.19066147 0.0026490693 9.200177e+05 0.000000e+00
ASTV 0.09774735 0.0072388477 1.350316e+01 1.498002e-41
Width 0.01969580 0.0049288793 3.995999e+00 6.442203e-05
Min 0.03736699 0.0061957662 6.031052e+00 1.628959e-09
Variance 0.02902359 0.0041693409 6.961193e+00 3.374030e-12
1|2 10.30254198 1.0066138832 1.023485e+01 1.384097e-24
2|3 12.99690142 1.0331704602 1.257963e+01 2.733093e-36
pred1 <- predict(mymodel2, train)
head(pred1,10)
[1] 2 1 1 1 2 2 2 2 2 2
Levels: 1 2 3
###Confusion Matrix
tab2 <- table(pred1, train$NSP)
tab2
pred1 1 2 3
1 1252 99 8
2 62 124 46
3 2 18 95
sum(diag(tab2))/sum(tab2)
[1] 0.8622509
An accuracy score of 0.8635838 for the multinomial logistic regression model indicates that the model correctly predicted the outcome category for approximately 86.36% of the total instances in your dataset. In other words, out of all the observations in your dataset, the model made the correct prediction for about 86.36% of them. The model successfully classified 86.36% of the instances into their respective outcome categories. The remaining 100% - 86.36% = 13.64% of the instances were incorrectly classified by the model. Accuracy is a commonly used metric for evaluating the overall performance of classification models. However, it’s essential to interpret accuracy in the context of your specific problem and dataset. Here are a few considerations: If the dataset has imbalanced classes (i.e., some outcome categories are much more prevalent than others), a high accuracy score may not necessarily indicate a good model. In such cases, the model might achieve high accuracy by simply predicting the majority class most of the time. You should also consider other metrics like precision, recall, and F1-score, especially for minority classes.The significance of accuracy depends on the real-world consequences of misclassification. In some applications, misclassifying certain categories may have more severe consequences than others. Sensitivity and specificity metrics can provide a more nuanced view of the model’s performance for specific classes. It’s important to establish a baseline accuracy to compare your model’s performance. If the dataset has a natural baseline (e.g., randomly guessing), you should compare your model’s accuracy against that baseline.
In summary, an accuracy of 0.8635838 suggests that your multinomial logistic regression model is performing well in terms of overall classification accuracy. However, to get a more comprehensive understanding of its performance, it’s essential to consider other metrics and contextual factors, especially if your dataset has imbalanced classes or specific classes with different levels of importance. ### Miscalculation
1 - sum(diag(tab2))/sum(tab2)
[1] 0.1377491
attach(train)
confusionMatrix(pred1, NSP, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1252 99 8
2 62 124 46
3 2 18 95
Overall Statistics
Accuracy : 0.8623
95% CI : (0.845, 0.8783)
No Information Rate : 0.7714
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6178
Mcnemar's Test P-Value : 2.108e-05
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9514 0.51452 0.63758
Specificity 0.7256 0.92628 0.98715
Pos Pred Value 0.9213 0.53448 0.82609
Neg Pred Value 0.8156 0.92062 0.96606
Prevalence 0.7714 0.14127 0.08734
Detection Rate 0.7339 0.07268 0.05569
Detection Prevalence 0.7966 0.13599 0.06741
Balanced Accuracy 0.8385 0.72040 0.81237
pred2 <- predict(mymodel2, test)
head(pred2,10)
[1] 1 3 2 1 1 1 2 1 1 1
Levels: 1 2 3
tab3 <- table(pred2, test$NSP)
tab3
pred2 1 2 3
1 322 19 5
2 16 32 9
3 1 3 13
sum(diag(tab3))/sum(tab3)
[1] 0.8738095
1-sum(diag(tab3))/sum(tab3)
[1] 0.1261905
attach(test)
confusionMatrix(pred2, NSP, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 322 19 5
2 16 32 9
3 1 3 13
Overall Statistics
Accuracy : 0.8738
95% CI : (0.8382, 0.904)
No Information Rate : 0.8071
P-Value [Acc > NIR] : 0.0001836
Kappa : 0.5994
Mcnemar's Test P-Value : 0.1153765
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9499 0.59259 0.48148
Specificity 0.7037 0.93169 0.98982
Pos Pred Value 0.9306 0.56140 0.76471
Neg Pred Value 0.7703 0.93939 0.96526
Prevalence 0.8071 0.12857 0.06429
Detection Rate 0.7667 0.07619 0.03095
Detection Prevalence 0.8238 0.13571 0.04048
Balanced Accuracy 0.8268 0.76214 0.73565
The provided above represents the confusion matrix and various statistics for a multinomial logistic regression model. Let’s break down and explain each part of the output in detail:
The confusion matrix is a tabular representation that summarizes the model’s performance by showing the count of correct and incorrect predictions for each class (category). In your case, you have three classes (1, 2, and 3) for the outcome variable. Rows represent the predicted classes, while columns represent the actual or reference classes. For example, in the top-left cell (Row 1, Column 1), you have 291 instances where the model correctly predicted class 1 when the actual class was also 1. In the top-middle cell (Row 1, Column 2), there are 33 instances where the model incorrectly predicted class 2 when the actual class was 1.
Accuracy: The overall accuracy of the model is 0.8485 or 84.85%. This means that the model correctly predicted the outcome for approximately 84.85% of all instances in the dataset. 95% CI (Confidence Interval): The confidence interval provides a range within which the true accuracy is likely to fall. In this case, it is between 0.8093 and 0.8823. No Information Rate (NIR): NIR is the accuracy achieved by always predicting the majority class. In this case, it’s 0.7551. Your model’s accuracy should be significantly better than NIR to be considered meaningful. Kappa: Kappa is a statistic that measures the agreement between the model’s predictions and the actual values. A kappa value of 0.5745 indicates moderate agreement beyond what would be expected by chance.
Mcnemar’s Test P-Value: Mcnemar’s test is a statistical test used to determine if there is a significant difference between the model’s predictions and the actual values. The low p-value (0.0004862) suggests that there is a significant difference.
Sensitivity: Sensitivity, also known as True Positive Rate or Recall, measures the proportion of true positive predictions for each class. For Class 1, it’s 0.9732, which means the model is excellent at correctly identifying Class 1 instances.
Specificity: Specificity measures the proportion of true negatives for each class. For Class 3, it’s 0.97814, indicating the model’s ability to correctly identify Class 3 instances when they are not Class 3.
Pos Pred Value: Positive Predictive Value is the proportion of true positive predictions out of all positive predictions made by the model. For Class 1, it’s 0.8954, suggesting that when the model predicts Class 1, it’s correct approximately 89.54% of the time.
Neg Pred Value: Negative Predictive Value is the proportion of true negative predictions out of all negative predictions made by the model. For Class 3, it’s 0.96757, indicating a high accuracy in predicting Class 3 when it’s not Class 3.
Prevalence: Prevalence represents the proportion of instances belonging to each class in the dataset.
Detection Rate: Detection Rate measures how well the model detects instances of each class.
Detection Prevalence: Detection Prevalence is the proportion of instances that the model predicted as each class.
Balanced Accuracy: Balanced Accuracy is the average of sensitivity and specificity and provides a balanced measure of classification performance for each class.
In summary, this output provides a comprehensive assessment of the multinomial logistic regression model’s performance. It includes overall accuracy, class-specific metrics, and additional statistical measures, allowing you to evaluate the model’s strengths and weaknesses for each class and as a whole.
mydata4 <- read.csv("Cardiotocographic.csv", header = TRUE)
str(mydata4)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : int 2 1 1 1 1 3 3 3 3 3 ...
attach(mydata4)
The data loaded has 2126 observations with 22 variables. This data is called is CTG. The data has the variable FHR, fetal heart rate and uterine contraction (UC) feature on cardiotocograms. 2126 fetal cardiotocograms (CTGs) automatically processed and diagnostic feature measured. CTG classified by three experrs obstetrician and consensus classification label as Normal, Suspect or Pathologic. The response variable is NSP (Normal, Suspect, Pathologic) and all the 21 variables are predictor variables. Remember NSP is an integer, we will have to convert it factor
mydata4$NSP <- as.factor(mydata4$NSP)
str(mydata4)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...
Now let us look at how many observations are present for each factor (Normal, Suspect, Pathologic)
frq(mydata4, NSP)
NSP <categorical>
# total N=2126 valid N=2126 mean=1.30 sd=0.61
Value | N | Raw % | Valid % | Cum. %
---------------------------------------
1 | 1655 | 77.85 | 77.85 | 77.85
2 | 295 | 13.88 | 13.88 | 91.72
3 | 176 | 8.28 | 8.28 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
From the frequency table above, majority of the respondents were found in normal category (1655), followed by those in suspect category (195) and finally those in pathologic category (176)
We shall start with random seed so that we can make this analysis reproducible
set.seed(123)
ind <- sample(2, nrow(mydata4), replace = TRUE, prob = c(0.7, 0.3))
train_data <- mydata4[ind ==1,] ## 1495 observations
test_data <- mydata4[ind ==2,] ## 631 observations
Random forest methodology is developed by aggregating several decision trees. Instead of using one decision tree, random forest uses several or hundreds of decision trees and aggregate the results from all the trees to come up with classification model. Remember, random forest can be used classification as well as regression. If the response variable is categorical variable, the algorithm will develop a classification model, and if the response variable is continuous, the algorithm will develop a regression model.
Consider an example below
knitr::include_graphics("random.png")
knitr::include_graphics("random2.png")
According to the information given, the algorithm predicted that the patient belongs to suspect category. We are going to make use of random fores; however, let consider case of sample case of decision tree
modFitA1 <- rpart(NSP ~ ., data=train_data, method="class")
fancyRpartPlot(modFitA1)
library(randomForest)
set.seed(222)
We will therefore estiamte our classification model with NSP as our dependent variable and all the remaining variable as independent variables
rf <- randomForest(NSP~.,data = train_data)
Let us now look at the model using print function
print(rf)
Call:
randomForest(formula = NSP ~ ., data = train_data)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 5.62%
Confusion matrix:
1 2 3 class.error
1 1142 17 1 0.01551724
2 47 164 2 0.23004695
3 6 11 105 0.13934426
The output is accompanied by the formular used, where we use NSP as DV, and all the other as IV. The data used in this algorithm is from the training data. This random forest is classification because the response variable is categorical/factor. The default number of trees is 500. The mtry, is the number of variable tried at each split is 4. Out of Bag estimate of error rate is approximately 5.75%. So we have about 94.25% accuracy, which is quite okay. The error when predicting Normal people is very low as compared to when predicting suspect or pathologic.
So what attributes does this model has, we can check using the command below;
attributes(rf)
$names
[1] "call" "type" "predicted" "err.rate"
[5] "confusion" "votes" "oob.times" "classes"
[9] "importance" "importanceSD" "localImportance" "proximity"
[13] "ntree" "mtry" "forest" "y"
[17] "test" "inbag" "terms"
$class
[1] "randomForest.formula" "randomForest"
The model contains all the attributes listed above, for example, if we need the confusion matrix only, we can use the command below
rf$confusion
1 2 3 class.error
1 1142 17 1 0.01551724
2 47 164 2 0.23004695
3 6 11 105 0.13934426
We will make prediction using the package caret
library(caret)
We can view the prediction from the model vs the real data
p1 <- predict(rf, train_data)
head(p1)
1 3 6 7 9 10
2 1 3 3 3 3
Levels: 1 2 3
Observations from the real data
head(train_data$NSP)
[1] 2 1 3 3 3 3
Levels: 1 2 3
The first six prediction are all accurate, that is 100% accuracy for the six observations
confusionMatrix(p1, train_data$NSP)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1160 2 0
2 0 211 0
3 0 0 122
Overall Statistics
Accuracy : 0.9987
95% CI : (0.9952, 0.9998)
No Information Rate : 0.7759
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9964
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 0.9906 1.00000
Specificity 0.9940 1.0000 1.00000
Pos Pred Value 0.9983 1.0000 1.00000
Neg Pred Value 1.0000 0.9984 1.00000
Prevalence 0.7759 0.1425 0.08161
Detection Rate 0.7759 0.1411 0.08161
Detection Prevalence 0.7773 0.1411 0.08161
Balanced Accuracy 0.9970 0.9953 1.00000
From the confusion matrix, the model predicted that 1160 people belong in grouped and were alos observed to belong in group 1. The model predicted that 211 people belong in group 2 and the were observed to belong to group 2 and lastly the model predicted that 122 people belong in groups and indeed were observed to belong in group 3. However, we two people we misclassified to belong to group while they were observed to belong in group 2
Sensitity statistics tell us how often classes were correctly classified. In other words, sensitity for class 1 is 1.00, that is, we have 100 accuracy that patient in class 1 were correctly classified to belong in class 1.
For each bootstrap iteration and related tree prediction error using data not in bootstrap sample (also called out of bag of OOB data) is estimated * Classification: Accuracy * Regression: R-Sq & SMSE
Now that all the data points in the train data set are seen by the model, we can try predicting using the test data which is not seen by the model. We can view the prediction from the model vs the real data
p2 <- predict(rf, test_data)
head(p2)
2 4 5 8 11 16
1 1 1 3 2 1
Levels: 1 2 3
Observations from the real data
head(test_data$NSP)
[1] 1 1 1 3 2 1
Levels: 1 2 3
confusionMatrix(p2, test_data$NSP)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 482 17 4
2 11 61 4
3 2 4 46
Overall Statistics
Accuracy : 0.9334
95% CI : (0.9111, 0.9516)
No Information Rate : 0.7845
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8109
Mcnemar's Test P-Value : 0.5823
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9737 0.74390 0.85185
Specificity 0.8456 0.97268 0.98960
Pos Pred Value 0.9583 0.80263 0.88462
Neg Pred Value 0.8984 0.96216 0.98618
Prevalence 0.7845 0.12995 0.08558
Detection Rate 0.7639 0.09667 0.07290
Detection Prevalence 0.7971 0.12044 0.08241
Balanced Accuracy 0.9097 0.85829 0.92073
The test data has not been seen by the random forest model and the accuracy has come down to 93.34%. The 95% CI is still good, ranging between 91% to 95%. There is slightly higher misclassification from the test data as compared to the train data. Besides, sensitivity for preicting class 1 is still higher as compared to class 2 and 3.
We will make a plot using the command below to see error rate in our model
plot(rf, main ="Error Rate of the Random Forest Model")
As the number of trees grows, the out of bag error comes down to later remain constant. From the plot, we cannot improve the error after about 300 trees.
t<- tuneRF(train_data[,-22], train_data[,22],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 300,
trace = TRUE,
improve = 0.05)## the comma in the square brackets indicates all rows expect 22
mtry = 4 OOB error = 5.55%
Searching left ...
mtry = 8 OOB error = 5.22%
0.06024096 0.05
mtry = 16 OOB error = 5.55%
-0.06410256 0.05
Searching right ...
mtry = 2 OOB error = 6.82%
-0.3076923 0.05
The out of bag error is hight when mtry =2 but comes down as mtry increases to 4 and comes further down when mtry = 8 and rises when mtry = 16. This gives us an idea of what mtry value we should choose. We can now go back to our random forest model and change a few things to tune our model
rf2 <- randomForest(NSP~.,data = train_data,
ntree = 300,
mtry = 8,
importance = TRUE)
print(rf2)
Call:
randomForest(formula = NSP ~ ., data = train_data, ntree = 300, mtry = 8, importance = TRUE)
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 8
OOB estimate of error rate: 5.02%
Confusion matrix:
1 2 3 class.error
1 1138 16 6 0.01896552
2 41 171 1 0.19718310
3 7 4 111 0.09016393
Earlier, the out of bag error was 5.62% and this one, the OOB has come down to 5.28%. Besides, classification for predicting class 2 and 3 groups have improved as compared to the previous model. However, the classification error for predicting class 1 worsened. Let us check the accuracy level of our model
p_1 <- predict(rf2, train_data)
head(p_1)
1 3 6 7 9 10
2 1 3 3 3 3
Levels: 1 2 3
head(train_data$NSP)
[1] 2 1 3 3 3 3
Levels: 1 2 3
confusionMatrix(p_1, train_data$NSP)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1159 1 0
2 1 212 0
3 0 0 122
Overall Statistics
Accuracy : 0.9987
95% CI : (0.9952, 0.9998)
No Information Rate : 0.7759
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9964
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9991 0.9953 1.00000
Specificity 0.9970 0.9992 1.00000
Pos Pred Value 0.9991 0.9953 1.00000
Neg Pred Value 0.9970 0.9992 1.00000
Prevalence 0.7759 0.1425 0.08161
Detection Rate 0.7753 0.1418 0.08161
Detection Prevalence 0.7759 0.1425 0.08161
Balanced Accuracy 0.9981 0.9973 1.00000
The real test to see how our model is performing will be based on the test data as shown below
p_2 <- predict(rf2, test_data)
head(p_2)
2 4 5 8 11 16
1 1 1 3 2 1
Levels: 1 2 3
head(test_data$NSP)
[1] 1 1 1 3 2 1
Levels: 1 2 3
confusionMatrix(p_2, test_data$NSP)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 482 16 3
2 11 61 2
3 2 5 49
Overall Statistics
Accuracy : 0.9382
95% CI : (0.9165, 0.9557)
No Information Rate : 0.7845
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8256
Mcnemar's Test P-Value : 0.4915
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9737 0.74390 0.90741
Specificity 0.8603 0.97632 0.98787
Pos Pred Value 0.9621 0.82432 0.87500
Neg Pred Value 0.9000 0.96230 0.99130
Prevalence 0.7845 0.12995 0.08558
Detection Rate 0.7639 0.09667 0.07765
Detection Prevalence 0.7940 0.11727 0.08875
Balanced Accuracy 0.9170 0.86011 0.94764
No much changes in our model but there are some improvements
Remember we have 300 trees in this model
hist(treesize(rf2),
main = "Number of Nodes for the Trees",
col = "blue")
The histogram shows the distribution of the number of nodes in each of those 300 trees. From the histogram above, there are about 80 trees with approximately 80 nodes in each. We also have few trees with sixty nodes and also few trees with more than 100 nodes. Majority of trees have an average of 80 nodes.
We can also find which variable played an important role in the model using the command below
varImpPlot(rf2, main = "Variable Importance in our Random Forest Model")
The chart shows how worst the model would perform if we remove each variable. In other words, the first chart shows the mean decrease accuracy when a variable is removed. Some variable have a higher contribution to our model as compared to other. For example, DS variable has almost zero contribution while variable such as ASTV abd ALTV have higher contribution to our model. The second chart measure how pure the nodes are at the end of the tree without each variable. From the chart, the first four stands out as the most significant predictors, where if we remove them from model, mean Gini decreases significantly.
Let us view the fisrt top ten variables
varImpPlot(rf2,
sort = TRUE,
n.var = 10,
main = "Variable Importance in our Random Forest Model: Top 10")
We can as well get the quantitative value to evaluate the variable importance.
rf2$importance
1 2 3 MeanDecreaseAccuracy
LB 1.553953e-02 0.0258200214 1.071221e-02 1.657837e-02
AC 2.943162e-02 0.0871365675 4.320153e-02 3.856144e-02
FM 2.486324e-03 0.0114239468 3.655310e-03 3.824810e-03
UC 5.177781e-03 0.0477066403 3.462117e-02 1.362509e-02
DL 1.390607e-03 0.0011991263 1.702274e-02 2.630076e-03
DS 7.698229e-06 0.0000000000 0.000000e+00 6.006006e-06
DP 1.127090e-02 0.0061564706 5.551498e-02 1.410008e-02
ASTV 2.836333e-02 0.2595434435 1.826003e-01 7.354497e-02
MSTV 2.988848e-02 0.1897893962 2.050543e-01 6.687412e-02
ALTV 3.806223e-02 0.1688399845 1.836021e-01 6.840553e-02
MLTV 6.206474e-03 0.0319690528 3.022055e-02 1.179525e-02
Width 1.034185e-02 0.0101418426 3.082483e-02 1.190649e-02
Min 9.490114e-03 0.0115299398 3.972243e-02 1.219832e-02
Max 1.356609e-02 0.0105077479 8.713479e-03 1.272368e-02
Nmax 3.965936e-03 0.0029536825 1.343104e-02 4.631204e-03
Nzeros 7.601632e-04 0.0015629388 -1.894073e-05 8.093686e-04
Mode 2.117570e-02 0.0233126568 3.929863e-02 2.287848e-02
Mean 3.464728e-02 0.0451425315 2.300968e-01 5.206579e-02
Median 2.396184e-02 0.0295561776 5.979045e-02 2.760608e-02
Variance 9.396708e-03 0.0087543808 2.225253e-02 1.037441e-02
Tendency 6.445705e-04 0.0004500091 2.162051e-03 7.167475e-04
MeanDecreaseGini
LB 17.54785188
AC 25.78439515
FM 7.89618149
UC 19.55790339
DL 3.77273295
DS 0.06106956
DP 28.40131621
ASTV 88.66220697
MSTV 85.09145239
ALTV 77.94261740
MLTV 22.57844113
Width 13.62028859
Min 12.04096306
Max 14.62494129
Nmax 9.66268459
Nzeros 2.56566805
Mode 23.96025039
Mean 62.67833384
Median 25.72019197
Variance 10.03170156
Tendency 2.46446976
Let us now get which predictor variables are actually used in the random forest
varUsed(rf2)
[1] 1299 996 897 1605 375 5 724 2156 1278 2240 1553 1247 1143 1352 982
[16] 327 1227 1504 1296 881 322
The results above shows how many time each variable occurred/was used in the model. DS was only used twice in the model.
partial dependence plot give a good graphical depiction of marginal effect of a variable on the class probability (classification) or response (regression)
partialPlot(rf2, train_data, ASTV, "1")
When ASTV is less than 60, the model predicts the patient to belong in class 1 as compared to when ASTV is more than 60. Similarly we can look at the plot for class 3
partialPlot(rf2, train_data, ASTV, "3")
From the plot, when ASTV is more than 60, the model predict class 3 more than when ASTV is less than 60. From the previous results we saw that misclassification was generally high in class two. Lets make a plot to ASTV for class 2
partialPlot(rf2, train_data, ASTV, "2")
We can see there is more confusion with class 2.
Let us get the first tree
getTree(rf2, 1, labelVar = TRUE)
left daughter right daughter split var split point status prediction
1 2 3 ALTV 7.500000e+00 1 <NA>
2 4 5 Mean 1.080000e+02 1 <NA>
3 6 7 MSTV 4.500000e-01 1 <NA>
4 8 9 DP 5.181345e-04 1 <NA>
5 10 11 ASTV 7.400000e+01 1 <NA>
6 12 13 ALTV 6.850000e+01 1 <NA>
7 14 15 Max 1.850000e+02 1 <NA>
8 16 17 Max 1.775000e+02 1 <NA>
9 18 19 Width 7.250000e+01 1 <NA>
10 20 21 DP 1.489997e-03 1 <NA>
11 22 23 Median 1.520000e+02 1 <NA>
12 24 25 Min 1.255000e+02 1 <NA>
13 0 0 <NA> 0.000000e+00 -1 3
14 26 27 UC 3.834495e-03 1 <NA>
15 28 29 ASTV 5.050000e+01 1 <NA>
16 0 0 <NA> 0.000000e+00 -1 3
17 0 0 <NA> 0.000000e+00 -1 1
18 0 0 <NA> 0.000000e+00 -1 2
19 0 0 <NA> 0.000000e+00 -1 3
20 30 31 Mode 1.040000e+02 1 <NA>
21 32 33 Mode 1.145000e+02 1 <NA>
22 34 35 MSTV 5.500000e-01 1 <NA>
23 36 37 Mode 1.530000e+02 1 <NA>
24 38 39 UC 5.547580e-03 1 <NA>
25 40 41 Max 1.425000e+02 1 <NA>
26 42 43 LB 1.365000e+02 1 <NA>
27 44 45 Mean 1.585000e+02 1 <NA>
28 0 0 <NA> 0.000000e+00 -1 1
29 46 47 MSTV 9.500000e-01 1 <NA>
30 48 49 DP 8.785115e-04 1 <NA>
31 50 51 LB 1.425000e+02 1 <NA>
32 52 53 Min 6.900000e+01 1 <NA>
33 54 55 ASTV 6.350000e+01 1 <NA>
34 0 0 <NA> 0.000000e+00 -1 3
35 0 0 <NA> 0.000000e+00 -1 2
36 0 0 <NA> 0.000000e+00 -1 1
37 0 0 <NA> 0.000000e+00 -1 2
38 56 57 Min 1.235000e+02 1 <NA>
39 58 59 Nmax 5.000000e-01 1 <NA>
40 0 0 <NA> 0.000000e+00 -1 3
41 60 61 ASTV 8.000000e+01 1 <NA>
42 62 63 Width 1.025000e+02 1 <NA>
43 64 65 MSTV 5.500000e-01 1 <NA>
44 66 67 Width 1.195000e+02 1 <NA>
45 0 0 <NA> 0.000000e+00 -1 2
46 68 69 Width 1.320000e+02 1 <NA>
47 0 0 <NA> 0.000000e+00 -1 2
48 0 0 <NA> 0.000000e+00 -1 2
49 0 0 <NA> 0.000000e+00 -1 1
50 70 71 MSTV 4.500000e-01 1 <NA>
51 72 73 ASTV 4.200000e+01 1 <NA>
52 0 0 <NA> 0.000000e+00 -1 2
53 0 0 <NA> 0.000000e+00 -1 1
54 74 75 FM 1.870177e-01 1 <NA>
55 0 0 <NA> 0.000000e+00 -1 3
56 76 77 FM 2.219280e-03 1 <NA>
57 0 0 <NA> 0.000000e+00 -1 1
58 0 0 <NA> 0.000000e+00 -1 2
59 0 0 <NA> 0.000000e+00 -1 1
60 78 79 MLTV 9.100000e+00 1 <NA>
61 0 0 <NA> 0.000000e+00 -1 3
62 80 81 Median 1.415000e+02 1 <NA>
63 0 0 <NA> 0.000000e+00 -1 2
64 82 83 LB 1.550000e+02 1 <NA>
65 84 85 AC 4.170140e-04 1 <NA>
66 86 87 FM 9.012281e-03 1 <NA>
67 0 0 <NA> 0.000000e+00 -1 2
68 0 0 <NA> 0.000000e+00 -1 2
69 0 0 <NA> 0.000000e+00 -1 1
70 88 89 Width 2.200000e+01 1 <NA>
71 90 91 Width 1.750000e+01 1 <NA>
72 0 0 <NA> 0.000000e+00 -1 1
73 92 93 Width 2.350000e+01 1 <NA>
74 0 0 <NA> 0.000000e+00 -1 1
75 94 95 MSTV 1.850000e+00 1 <NA>
76 96 97 FM 1.703756e-03 1 <NA>
77 0 0 <NA> 0.000000e+00 -1 2
78 98 99 UC 8.173059e-03 1 <NA>
79 0 0 <NA> 0.000000e+00 -1 1
80 0 0 <NA> 0.000000e+00 -1 1
81 100 101 Width 5.100000e+01 1 <NA>
82 102 103 Median 1.595000e+02 1 <NA>
83 0 0 <NA> 0.000000e+00 -1 1
84 104 105 LB 1.455000e+02 1 <NA>
85 106 107 UC 3.772002e-03 1 <NA>
86 108 109 Nmax 6.500000e+00 1 <NA>
87 0 0 <NA> 0.000000e+00 -1 2
88 0 0 <NA> 0.000000e+00 -1 2
89 0 0 <NA> 0.000000e+00 -1 1
90 110 111 ASTV 5.600000e+01 1 <NA>
91 112 113 DP 8.138395e-04 1 <NA>
92 114 115 Tendency 5.000000e-01 1 <NA>
93 116 117 MLTV 8.450000e+00 1 <NA>
94 0 0 <NA> 0.000000e+00 -1 1
95 118 119 Variance 4.650000e+01 1 <NA>
96 120 121 ALTV 1.500000e+01 1 <NA>
97 0 0 <NA> 0.000000e+00 -1 3
98 122 123 Mode 1.355000e+02 1 <NA>
99 0 0 <NA> 0.000000e+00 -1 1
100 0 0 <NA> 0.000000e+00 -1 1
101 0 0 <NA> 0.000000e+00 -1 2
102 0 0 <NA> 0.000000e+00 -1 2
103 0 0 <NA> 0.000000e+00 -1 1
104 124 125 ASTV 4.350000e+01 1 <NA>
105 0 0 <NA> 0.000000e+00 -1 1
106 0 0 <NA> 0.000000e+00 -1 1
107 0 0 <NA> 0.000000e+00 -1 2
108 126 127 Mode 1.620000e+02 1 <NA>
109 128 129 Variance 6.000000e+00 1 <NA>
110 0 0 <NA> 0.000000e+00 -1 1
111 0 0 <NA> 0.000000e+00 -1 2
112 130 131 Median 1.475000e+02 1 <NA>
113 132 133 FM 2.242762e-01 1 <NA>
114 0 0 <NA> 0.000000e+00 -1 2
115 0 0 <NA> 0.000000e+00 -1 1
116 0 0 <NA> 0.000000e+00 -1 1
117 134 135 Min 1.345000e+02 1 <NA>
118 0 0 <NA> 0.000000e+00 -1 2
119 0 0 <NA> 0.000000e+00 -1 3
120 136 137 Max 1.500000e+02 1 <NA>
121 0 0 <NA> 0.000000e+00 -1 2
122 138 139 Nmax 1.500000e+00 1 <NA>
123 140 141 ASTV 5.850000e+01 1 <NA>
124 0 0 <NA> 0.000000e+00 -1 1
125 142 143 ALTV 9.500000e+00 1 <NA>
126 144 145 ALTV 6.050000e+01 1 <NA>
127 0 0 <NA> 0.000000e+00 -1 2
128 146 147 Min 1.015000e+02 1 <NA>
129 0 0 <NA> 0.000000e+00 -1 1
130 0 0 <NA> 0.000000e+00 -1 1
131 148 149 MLTV 9.350000e+00 1 <NA>
132 0 0 <NA> 0.000000e+00 -1 1
133 150 151 Mode 1.380000e+02 1 <NA>
134 152 153 ASTV 4.350000e+01 1 <NA>
135 154 155 MSTV 7.500000e-01 1 <NA>
136 0 0 <NA> 0.000000e+00 -1 3
137 0 0 <NA> 0.000000e+00 -1 1
138 0 0 <NA> 0.000000e+00 -1 2
139 0 0 <NA> 0.000000e+00 -1 3
140 0 0 <NA> 0.000000e+00 -1 1
141 0 0 <NA> 0.000000e+00 -1 2
142 0 0 <NA> 0.000000e+00 -1 1
143 156 157 Max 1.470000e+02 1 <NA>
144 0 0 <NA> 0.000000e+00 -1 1
145 158 159 Min 1.160000e+02 1 <NA>
146 0 0 <NA> 0.000000e+00 -1 1
147 0 0 <NA> 0.000000e+00 -1 2
148 0 0 <NA> 0.000000e+00 -1 1
149 160 161 AC 1.562341e-03 1 <NA>
150 0 0 <NA> 0.000000e+00 -1 1
151 0 0 <NA> 0.000000e+00 -1 2
152 162 163 Min 8.950000e+01 1 <NA>
153 164 165 DL 3.508204e-03 1 <NA>
154 0 0 <NA> 0.000000e+00 -1 1
155 0 0 <NA> 0.000000e+00 -1 2
156 0 0 <NA> 0.000000e+00 -1 1
157 0 0 <NA> 0.000000e+00 -1 2
158 0 0 <NA> 0.000000e+00 -1 2
159 0 0 <NA> 0.000000e+00 -1 1
160 166 167 Min 1.225000e+02 1 <NA>
161 0 0 <NA> 0.000000e+00 -1 1
162 0 0 <NA> 0.000000e+00 -1 1
163 0 0 <NA> 0.000000e+00 -1 2
164 168 169 ALTV 6.500000e+00 1 <NA>
165 0 0 <NA> 0.000000e+00 -1 2
166 0 0 <NA> 0.000000e+00 -1 2
167 0 0 <NA> 0.000000e+00 -1 1
168 0 0 <NA> 0.000000e+00 -1 1
169 0 0 <NA> 0.000000e+00 -1 2
From the results, whenever it says, the status is -1, it means, this node is a terminal node and the classification based on this terminal node is the patient NSP value is 2 or the patient is suspect. Similarly at various terminals where status is negative the model predict patients’ NSP as 3 and 1 as well.
patients <- read.csv("patients.csv")
patients$NSP <- as.factor(patients$NSP)
str(patients)
'data.frame': 6 obs. of 22 variables:
$ LB : int 134 122 122 122 151 131
$ AC : num 0.0014 0 0 0 0 ...
$ FM : num 0 0 0 0 0 ...
$ UC : num 0.012623 0 0.001517 0.002967 0.000834 ...
$ DL : num 0.008415 0 0 0 0.000834 ...
$ DS : int 0 0 0 0 0 0
$ DP : num 0.00281 0 0 0 0 ...
$ ASTV : int 29 83 84 86 64 28
$ MSTV : num 6.3 0.5 0.5 0.3 1.9 1.5
$ ALTV : int 0 6 5 6 9 0
$ MLTV : num 0 15.6 13.6 10.6 27.6 5.4
$ Width : int 150 68 68 68 130 87
$ Min : int 50 62 62 62 56 71
$ Max : int 200 130 130 130 186 158
$ Nmax : int 6 0 0 1 2 2
$ Nzeros : int 3 0 0 0 0 0
$ Mode : int 71 122 122 122 150 141
$ Mean : int 107 122 122 122 148 137
$ Median : int 106 123 123 123 151 141
$ Variance: int 215 3 3 1 9 10
$ Tendency: int 0 1 1 1 1 1
$ NSP : Factor w/ 3 levels "1","2","3": 3 3 3 3 2 1
head(patients,5)
LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV
1 134 0.001402525 0 0.012622721 0.008415147 0 0.002805049 29 6.3 0 0.0
2 122 0.000000000 0 0.000000000 0.000000000 0 0.000000000 83 0.5 6 15.6
3 122 0.000000000 0 0.001517451 0.000000000 0 0.000000000 84 0.5 5 13.6
4 122 0.000000000 0 0.002967359 0.000000000 0 0.000000000 86 0.3 6 10.6
5 151 0.000000000 0 0.000834028 0.000834028 0 0.000000000 64 1.9 9 27.6
Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
1 150 50 200 6 3 71 107 106 215 0 3
2 68 62 130 0 0 122 122 123 3 1 3
3 68 62 130 0 0 122 122 123 3 1 3
4 68 62 130 1 0 122 122 123 1 1 3
5 130 56 186 2 0 150 148 151 9 1 2
# Use the trained random forest model to classify the patient
predicted_class <- predict(rf2, patients)
# The 'predicted_class' variable now contains the predicted class (1, 2, or 3)
# You can print or use it as needed
print(predicted_class)
1 2 3 4 5 6
3 3 3 3 2 1
Levels: 1 2 3
### Actual data
head(patients$NSP)
[1] 3 3 3 3 2 1
Levels: 1 2 3
confusionMatrix(predicted_class, patients$NSP)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1 0 0
2 0 1 0
3 0 0 4
Overall Statistics
Accuracy : 1
95% CI : (0.5407, 1)
No Information Rate : 0.6667
P-Value [Acc > NIR] : 0.08779
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.1667 0.1667 0.6667
Detection Rate 0.1667 0.1667 0.6667
Detection Prevalence 0.1667 0.1667 0.6667
Balanced Accuracy 1.0000 1.0000 1.0000
str(mydata4)
'data.frame': 2126 obs. of 22 variables:
$ LB : int 120 132 133 134 132 134 134 122 122 122 ...
$ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
$ FM : num 0 0 0 0 0 0 0 0 0 0 ...
$ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
$ DL : num 0 0.00319 0.00332 0.00256 0 ...
$ DS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DP : num 0 0 0 0 0 ...
$ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
$ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
$ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ Width : int 64 130 130 117 117 150 150 68 68 68 ...
$ Min : int 62 68 68 53 53 50 50 62 62 62 ...
$ Max : int 126 198 198 170 170 200 200 130 130 130 ...
$ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ Median : int 121 140 138 137 138 107 106 123 123 123 ...
$ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
$ NSP : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...
par(mfrow=c(1,2))
boxplot(LB ~ NSP)
boxplot(ASTV ~ NSP)
boxplot(Width ~ NSP)
boxplot(Median ~ NSP)
par(mfrow=c(1,1))
pairs( cbind(AC, ASTV, MSTV, Width, Min, Max), pch=19, lower.panel=NULL, cex=.5)
train_model <- trainControl(method = "repeatedcv", number = 5, repeats=10)
model.cart <- train( NSP ~.,
data = mydata4,
method = "rpart",
trControl = train_model)
model.cart
CART
2126 samples
21 predictor
3 classes: '1', '2', '3'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times)
Summary of sample sizes: 1701, 1701, 1701, 1701, 1700, 1701, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.08917197 0.8742240 0.6550994
0.17622081 0.8401331 0.5268462
0.20806794 0.8002354 0.2604269
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.08917197.
model.cart$finalModel
n= 2126
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 2126 471 1 (0.778457197 0.138758231 0.082784572)
2) MSTV>=0.55 1754 202 1 (0.884834664 0.053591790 0.061573546)
4) Mean>=107.5 1649 107 1 (0.935112189 0.055791389 0.009096422) *
5) Mean< 107.5 105 12 3 (0.095238095 0.019047619 0.885714286) *
3) MSTV< 0.55 372 171 2 (0.276881720 0.540322581 0.182795699) *
confusionMatrix(predict(model.cart, mydata4),
reference=mydata4$NSP, positive="1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1542 92 15
2 103 201 68
3 10 2 93
Overall Statistics
Accuracy : 0.8636
95% CI : (0.8483, 0.8779)
No Information Rate : 0.7785
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6292
Mcnemar's Test P-Value : 8.841e-14
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.9317 0.68136 0.52841
Specificity 0.7728 0.90661 0.99385
Pos Pred Value 0.9351 0.54032 0.88571
Neg Pred Value 0.7631 0.94641 0.95893
Prevalence 0.7785 0.13876 0.08278
Detection Rate 0.7253 0.09454 0.04374
Detection Prevalence 0.7756 0.17498 0.04939
Balanced Accuracy 0.8523 0.79398 0.76113
fancyRpartPlot(model.cart$finalModel)
model.rf <- train(
NSP ~.,
data = mydata4,
method = "rf",
trControl = train_model)
model.rf
Random Forest
2126 samples
21 predictor
3 classes: '1', '2', '3'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times)
Summary of sample sizes: 1701, 1700, 1701, 1701, 1701, 1701, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9318432 0.8044846
11 0.9442125 0.8433624
21 0.9394626 0.8303969
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 11.
summary(model.rf$finalModel)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 2126 factor numeric
err.rate 2000 -none- numeric
confusion 12 -none- numeric
votes 6378 matrix numeric
oob.times 2126 -none- numeric
classes 3 -none- character
importance 21 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 2126 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 21 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 3 -none- character
param 0 -none- list
model.rf$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 11
OOB estimate of error rate: 5.08%
Confusion matrix:
1 2 3 class.error
1 1624 23 8 0.01873112
2 57 232 6 0.21355932
3 8 6 162 0.07954545
plot(model.rf$finalModel)
varImp(model.rf$finalModel)
Overall
LB 22.9384967
AC 31.5312698
FM 12.6508906
UC 29.4776634
DL 5.1060651
DS 0.5361335
DP 44.8461939
ASTV 124.7502475
MSTV 137.3937671
ALTV 105.3323077
MLTV 25.7922601
Width 17.2978347
Min 17.1618847
Max 18.6372755
Nmax 14.0623581
Nzeros 2.8341471
Mode 33.9885301
Mean 95.8131455
Median 25.7409495
Variance 11.3537206
Tendency 3.5667884
plot( varImp(model.rf) )
yhat = 1-predict(model.rf$finalModel, type="prob")[,1]
plot(mydata4$ASTV, yhat)
scatter.smooth(mydata4$MSTV, yhat, span=.4)
scatter.smooth(mydata4$ASTV, yhat, span=.4)
scatter.smooth(mydata4$ALTV, yhat, span=.4)
scatter.smooth(mydata4$Mean, yhat, span=.4)
scatter.smooth(mydata4$DP, yhat, span=.4)
scatter.smooth(mydata4$Mode, yhat, span=.4)
scatter.smooth(mydata4$AC, yhat, span=.4)
scatter.smooth(mydata4$UC, yhat, span=.4)
boxplot(yhat ~ Tendency)
boxplot(yhat ~ mydata4$NSP)
confusionMatrix(predict(model.rf, mydata4),
reference=mydata4$NSP, positive="1")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1655 2 0
2 0 293 0
3 0 0 176
Overall Statistics
Accuracy : 0.9991
95% CI : (0.9966, 0.9999)
No Information Rate : 0.7785
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9974
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 0.9932 1.00000
Specificity 0.9958 1.0000 1.00000
Pos Pred Value 0.9988 1.0000 1.00000
Neg Pred Value 1.0000 0.9989 1.00000
Prevalence 0.7785 0.1388 0.08278
Detection Rate 0.7785 0.1378 0.08278
Detection Prevalence 0.7794 0.1378 0.08278
Balanced Accuracy 0.9979 0.9966 1.00000
heart <- read.csv("heart_failure_clinical_records_dataset.csv")
head(heart,5)
time age anaemia creatinine_phosphokinase diabetes ejection_fraction
1 4 75 0 582 0 20
2 6 55 0 7861 0 38
3 7 65 0 146 0 20
4 7 50 1 111 0 20
5 8 65 1 160 1 20
high_blood_pressure platelets serum_creatinine serum_sodium sex smoking
1 1 265000 1.9 130 1 0
2 0 263358 1.1 136 1 0
3 0 162000 1.3 129 1 1
4 0 210000 1.9 137 1 0
5 0 327000 2.7 116 0 0
heart.failure
1 1
2 1
3 1
4 1
5 1
heart$anaemia <- as.factor(heart$anaemia)
heart$diabetes <- as.factor(heart$diabetes)
heart$high_blood_pressure <- as.factor(heart$high_blood_pressure)
heart$sex <- as.factor(heart$sex)
heart$smoking <- as.factor(heart$smoking)
heart$heart.failure <- as.factor(heart$heart.failure)
str(heart)
'data.frame': 299 obs. of 13 variables:
$ time : int 4 6 7 7 8 8 10 10 10 10 ...
$ age : num 75 55 65 50 65 90 75 60 65 80 ...
$ anaemia : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
$ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
$ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
$ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
$ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
$ platelets : num 265000 263358 162000 210000 327000 ...
$ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
$ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
$ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
$ heart.failure : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
heart_failure <- rpart(heart.failure ~ .,
data=heart,
method="class")
rpart.plot(heart_failure, main = "Decision Tree for Heart Failure")
tree.model.party1 <- as.party(heart_failure)
plot(tree.model.party1)
fancyRpartPlot(heart_failure, main = "Decision Tree for Heart Failure")
We can use the decision tree plot above to make prediction on the heart failure. We we can now move to K-Nearest neighbors
frq(iris, Species)
Species <categorical>
# total N=150 valid N=150 mean=2.00 sd=0.82
Value | N | Raw % | Valid % | Cum. %
------------------------------------------
setosa | 50 | 33.33 | 33.33 | 33.33
versicolor | 50 | 33.33 | 33.33 | 66.67
virginica | 50 | 33.33 | 33.33 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
flowers_data <- iris
flowers_data <- flowers_data[!(flowers_data$Species == 'virginica'),]
tree.model <- rpart(Species ~ . ,
data = flowers_data)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)
flowers_data <- iris
tree.model <- rpart(Species~ .,
data = flowers_data)
tree.model.party <- as.party(tree.model)
plot(tree.model.party)
K-Nearest Neighbors (K-NN) is a versatile and intuitive machine learning algorithm used for both classification and regression tasks. At its core, K-NN operates on the principle of similarity: it classifies or predicts a data point based on the majority class or average of its K nearest neighbors in the feature space. The choice of K, the number of neighbors, is a critical parameter that influences the algorithm’s performance. A smaller K can make the model sensitive to noise and outliers, potentially leading to overfitting, while a larger K may result in a smoother decision boundary or prediction but could miss important local patterns in the data.
One of the key strengths of K-NN is its simplicity and ease of implementation. It doesn’t require a training phase; instead, it relies on the entire dataset during prediction. Additionally, K-NN can be applied to both numerical and categorical data, making it a versatile choice for various types of datasets. However, it’s important to consider the impact of scaling and distance metric selection when working with K-NN, as the algorithm’s performance can be sensitive to these factors. Overall, K-NN is a valuable tool in the machine learning toolkit, especially for cases where interpretability and simplicity are desired, but it should be used judiciously, with careful consideration of K and data preprocessing steps to achieve optimal results. However many people confuses the k nearest neighbours to k means clustering. They are not the same thing. K means clustering is non supervised machine learning algorith while k-NN is supervised machine learning. This method is used for classification and regression.
knitr::include_graphics("knn.png")
knitr::include_graphics("knn2.png")
library(caret)
library(pROC)
library(mlbench)
k_nn <- read.csv("binary.csv")
head(k_nn,5)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
str(k_nn)
'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
k_nn$admit <- as.factor(k_nn$admit)
str(k_nn)
'data.frame': 400 obs. of 4 variables:
$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
k_nn$admit <- factor(k_nn$admit, levels = c(0,1),
labels = c("No","Yes"))
str(k_nn)
'data.frame': 400 obs. of 4 variables:
$ admit: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 1 2 1 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
frq(k_nn, admit)
admit <categorical>
# total N=400 valid N=400 mean=1.32 sd=0.47
Value | N | Raw % | Valid % | Cum. %
--------------------------------------
No | 273 | 68.25 | 68.25 | 68.25
Yes | 127 | 31.75 | 31.75 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
set.seed(1234)
ind3 <- sample(2, nrow(k_nn), replace = TRUE, prob = c(0.7, 0.3))
training <- k_nn[ind3 == 1,]
testing <- k_nn[ind3 ==2,]
str(testing)
'data.frame': 116 obs. of 4 variables:
$ admit: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 2 2 1 ...
$ gre : int 520 700 480 800 520 780 400 500 520 400 ...
$ gpa : num 2.93 3.08 3.44 3.66 3.74 3.22 3.05 3.13 2.68 3.35 ...
$ rank : int 4 2 3 1 4 2 2 2 3 3 ...
str(training)
'data.frame': 284 obs. of 4 variables:
$ admit: Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 1 2 1 1 ...
$ gre : int 380 660 800 640 760 560 400 540 700 800 ...
$ gpa : num 3.61 3.67 4 3.19 3 2.98 3.08 3.39 3.92 4 ...
$ rank : int 3 3 1 4 2 1 2 3 2 4 ...
Before we estimate our kNN methods, let us specify the train and test models
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
FIT <- train(admit~.,
data = training,
method = 'knn',
tuneLength = 20,
trControl = trControl,
preProc = c("center","scale"))
FIT
k-Nearest Neighbors
284 samples
3 predictor
2 classes: 'No', 'Yes'
Pre-processing: centered (3), scaled (3)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 256, 255, 256, 255, 255, 256, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6351040 0.09145898
7 0.6762425 0.16991017
9 0.6750930 0.15244021
11 0.6682348 0.13513104
13 0.6645430 0.12637105
15 0.6668774 0.11457683
17 0.6751314 0.13646317
19 0.6716010 0.12128138
21 0.6704543 0.11501118
23 0.6784620 0.12415411
25 0.6856431 0.13593414
27 0.6821128 0.12652554
29 0.6821538 0.12157653
31 0.6915134 0.13821970
33 0.6892556 0.13647228
35 0.6870389 0.13270602
37 0.6882321 0.12953000
39 0.6823618 0.10789173
41 0.6766119 0.09507794
43 0.6811713 0.10907586
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 31.
The results above indicates that we have carried out the k- nearest neighbors with a sample of 284 observations, three predictors(GRA, GPA and RANK) and two classes, “yes” and “no”, where yes means the student was admitted and no means the student was not admitted. We have carried out a ten fold cross validation repeated three times. This means that in each cross validation the training data is splited into ten parts, and 9 of them are used in creating the model and the remaining part is used in assessing the performance of the model. In the model above, we have the model accuracy and kappa value. Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 31.
plot(FIT)
varImp(FIT)
ROC curve variable importance
Importance
gpa 100.00
rank 25.18
gre 0.00
From the output above, GPA turns out to be the most important variable and GRE turns to be least important variable in our model.
pred <- predict(FIT, newdata = testing)
confusionMatrix(pred, testing$admit)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 79 31
Yes 3 3
Accuracy : 0.7069
95% CI : (0.6152, 0.7877)
No Information Rate : 0.7069
P-Value [Acc > NIR] : 0.5461
Kappa : 0.0681
Mcnemar's Test P-Value : 3.649e-06
Sensitivity : 0.96341
Specificity : 0.08824
Pos Pred Value : 0.71818
Neg Pred Value : 0.50000
Prevalence : 0.70690
Detection Rate : 0.68103
Detection Prevalence : 0.94828
Balanced Accuracy : 0.52582
'Positive' Class : No
The model has an accuracy value of 70.69%. No information rate is given as 70.69%. Basically, what this means is that, if all students were rejected, then the model would be correct 70.69% of the time.
We can make few changes to improve the performance of the model.
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
Class probabilities are need if we want to use ROC for selecting the best model with optimal K
set.seed(222)
FIT <- train(admit~.,
data = training,
method = 'knn',
tuneLength = 20,
trControl = trControl,
preProc = c("center","scale"),
metric = "ROC",
tuneGrid = expand.grid(k= 1:60))
FIT
k-Nearest Neighbors
284 samples
3 predictor
2 classes: 'No', 'Yes'
Pre-processing: centered (3), scaled (3)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 256, 256, 256, 256, 255, 256, ...
Resampling results across tuning parameters:
k ROC Sens Spec
1 0.5412281 0.7100000 0.36962963
2 0.5631871 0.7019298 0.35703704
3 0.5822368 0.7973684 0.34074074
4 0.5622904 0.7770175 0.26111111
5 0.5878558 0.8065789 0.28518519
6 0.5911501 0.8241228 0.27740741
7 0.5884016 0.8585965 0.28296296
8 0.5892982 0.8550877 0.26851852
9 0.5965010 0.8725439 0.29074074
10 0.5899123 0.8657895 0.27444444
11 0.5955945 0.8799123 0.28555556
12 0.5876170 0.8695614 0.27740741
13 0.5942300 0.8659649 0.24185185
14 0.5949854 0.8885965 0.25703704
15 0.5987768 0.8816667 0.22777778
16 0.6125097 0.8974561 0.22074074
17 0.6257943 0.8976316 0.23629630
18 0.6256481 0.9008772 0.21481481
19 0.6286647 0.8976316 0.22851852
20 0.6372612 0.9027193 0.22185185
21 0.6386209 0.9064035 0.21074074
22 0.6393519 0.9113158 0.22518519
23 0.6397710 0.9200000 0.21444444
24 0.6438743 0.9268421 0.21740741
25 0.6460039 0.9215789 0.19962963
26 0.6482359 0.9252632 0.19925926
27 0.6554288 0.9287719 0.20296296
28 0.6596881 0.9356140 0.21037037
29 0.6670419 0.9356140 0.19925926
30 0.6726901 0.9426316 0.19888889
31 0.6659649 0.9460526 0.17740741
32 0.6657797 0.9478070 0.18148148
33 0.6623830 0.9495614 0.18851852
34 0.6627973 0.9461404 0.19259259
35 0.6652242 0.9495614 0.18851852
36 0.6679678 0.9442982 0.18148148
37 0.6682261 0.9477193 0.18148148
38 0.6704825 0.9426316 0.18555556
39 0.6682846 0.9459649 0.18925926
40 0.6617788 0.9442982 0.18185185
41 0.6624708 0.9408772 0.17481481
42 0.6600097 0.9443860 0.15703704
43 0.6633138 0.9478070 0.16037037
44 0.6617008 0.9495614 0.16000000
45 0.6627924 0.9478070 0.14962963
46 0.6605897 0.9478070 0.15037037
47 0.6598294 0.9494737 0.14629630
48 0.6578558 0.9512281 0.14962963
49 0.6581481 0.9511404 0.14259259
50 0.6615692 0.9546491 0.14592593
51 0.6638060 0.9546491 0.13148148
52 0.6637037 0.9581579 0.14222222
53 0.6655507 0.9546491 0.12814815
54 0.6666715 0.9528947 0.11703704
55 0.6662671 0.9597368 0.11370370
56 0.6639133 0.9614912 0.11777778
57 0.6619737 0.9597368 0.10333333
58 0.6612037 0.9632456 0.10333333
59 0.6640156 0.9597368 0.09592593
60 0.6681189 0.9597368 0.09962963
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 30.
The results above indicates that we have carried out the k- nearest neighbors with a sample of 284 observations, three predictors(GRA, GPA and RANK) and two classes, “yes” and “no”, where yes means the student was admitted and no means the student was not admitted. We have carried out a ten fold cross validation repeated three times. This means that in each cross validation the training data is splited into ten parts, and 9 of them are used in creating the model and the remaining part is used in assessing the performance of the model. In the model above, we have the model accuracy and kappa value. ROC was used to select the optimal model using the largest value. The final value used for the model was k = 30. ROC is the area under the curve. From the model, the area under curve is given as 0.6726901, that is 67.27%.
plot(FIT)
varImp(FIT)
ROC curve variable importance
Importance
gpa 100.00
rank 25.18
gre 0.00
pred <- predict(FIT, newdata = testing)
confusionMatrix(pred, testing$admit)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 79 29
Yes 3 5
Accuracy : 0.7241
95% CI : (0.6334, 0.803)
No Information Rate : 0.7069
P-Value [Acc > NIR] : 0.3848
Kappa : 0.1423
Mcnemar's Test P-Value : 9.897e-06
Sensitivity : 0.9634
Specificity : 0.1471
Pos Pred Value : 0.7315
Neg Pred Value : 0.6250
Prevalence : 0.7069
Detection Rate : 0.6810
Detection Prevalence : 0.9310
Balanced Accuracy : 0.5552
'Positive' Class : No
For the purpose of the training, we will use Boston Housing Data. This data is found in “mlbench” library
data("Boston")
str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data <- Boston
head(data,5)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
The dependent variable of interest is “medv”. if you want to more about Boston Housing, use the comand below
?Boston
Boston Data Description A data set containing housing values in 506 suburbs of Boston. A data frame with 506 rows and 13 variables.
crim: per capita crime rate by town. zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox: nitrogen oxides concentration (parts per 10 million). rm: average number of rooms per dwelling. age: proportion of owner-occupied units built prior to 1940. dis: weighted mean of distances to five Boston employment centres. rad: index of accessibility to radial highways. tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town. lstat: lower status of the population (percent). medv: median value of owner-occupied homes in $1000s.
set.seed(1234)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
training <- data[ind == 1,]
testing <- data[ind ==2,]
trControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 3)
set.seed(333)
attach(data)
FIT <- train(medv~.,
data = training,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
355 samples
13 predictor
Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 320, 320, 319, 320, 319, 319, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 4.247396 0.7819818 2.799416
2 3.979579 0.8083403 2.651736
3 3.950952 0.8194244 2.602271
4 4.081028 0.8117352 2.663417
5 4.187455 0.8009220 2.751447
6 4.269958 0.7936822 2.830755
7 4.259754 0.7944348 2.840942
8 4.285647 0.7923607 2.885038
9 4.277810 0.7942649 2.875932
10 4.288972 0.7960606 2.886287
11 4.306883 0.7963191 2.899734
12 4.311103 0.7955246 2.920360
13 4.336054 0.7947843 2.950035
14 4.405470 0.7888339 3.005893
15 4.423224 0.7887495 3.030541
16 4.492374 0.7852007 3.080057
17 4.520643 0.7848370 3.103929
18 4.566745 0.7827552 3.133610
19 4.608306 0.7806416 3.162693
20 4.638719 0.7801816 3.187662
21 4.671443 0.7777975 3.221833
22 4.715731 0.7762199 3.257161
23 4.773820 0.7719154 3.309642
24 4.810877 0.7707091 3.345512
25 4.838732 0.7700164 3.370597
26 4.867535 0.7704611 3.385367
27 4.883757 0.7726904 3.402671
28 4.913500 0.7720243 3.427319
29 4.951461 0.7704870 3.453771
30 4.977230 0.7704819 3.469728
31 5.018419 0.7690603 3.497270
32 5.041679 0.7698973 3.512520
33 5.072888 0.7694062 3.539960
34 5.103519 0.7669878 3.559422
35 5.132717 0.7659705 3.576025
36 5.170067 0.7641995 3.596993
37 5.201979 0.7640409 3.619875
38 5.242001 0.7625534 3.648506
39 5.269697 0.7618512 3.666397
40 5.313147 0.7587825 3.694215
41 5.347397 0.7563999 3.718244
42 5.377001 0.7539976 3.737528
43 5.409438 0.7517684 3.756962
44 5.432267 0.7502295 3.775538
45 5.464056 0.7477553 3.799406
46 5.492641 0.7452937 3.812879
47 5.513741 0.7437770 3.825341
48 5.532227 0.7436683 3.844284
49 5.554146 0.7425758 3.860164
50 5.569495 0.7407185 3.873060
51 5.590183 0.7391508 3.884943
52 5.611812 0.7376250 3.900505
53 5.627486 0.7363782 3.915204
54 5.644846 0.7348713 3.926904
55 5.660921 0.7345985 3.940371
56 5.682031 0.7333447 3.953439
57 5.696958 0.7326176 3.963038
58 5.712698 0.7318748 3.973957
59 5.730909 0.7309311 3.987815
60 5.749974 0.7301051 3.999038
61 5.772356 0.7288745 4.014376
62 5.790233 0.7273679 4.026197
63 5.806207 0.7264301 4.037537
64 5.821958 0.7251828 4.046288
65 5.836567 0.7240744 4.055216
66 5.848333 0.7232492 4.063122
67 5.855005 0.7233243 4.068135
68 5.870225 0.7226127 4.077585
69 5.885600 0.7213242 4.089970
70 5.902702 0.7203636 4.102866
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.
The results above shows that we have run K-Nearest Neighbors with 355 samples and 13 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 3.
plot(FIT)
The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.
varImp(FIT)
loess r-squared variable importance
Overall
lstat 100.00
rm 94.28
ptratio 40.21
tax 37.51
indus 36.41
crim 35.25
nox 31.91
rad 25.82
black 24.68
age 22.83
dis 19.78
zn 16.23
chas 0.00
The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.
pred <- predict(FIT, newdata = testing)
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred, testing$medv)
[1] 6.127333
plot(pred, testing$medv)
FIT <- train(medv~.,
data = training,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
metric = 'Rsquared',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
355 samples
13 predictor
Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 320, 320, 320, 319, 319, 320, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 4.155683 0.7889735 2.751110
2 3.920046 0.8085024 2.615415
3 3.894027 0.8191854 2.576399
4 3.990820 0.8169846 2.633052
5 4.098328 0.8092759 2.721242
6 4.164762 0.8005553 2.782908
7 4.250046 0.7913992 2.849220
8 4.237511 0.7935728 2.849656
9 4.241670 0.7954279 2.867654
10 4.223247 0.8011824 2.864390
11 4.256550 0.7990746 2.888878
12 4.275542 0.7980135 2.909365
13 4.305207 0.7965859 2.951738
14 4.355170 0.7937387 2.994838
15 4.370063 0.7945428 3.019494
16 4.413595 0.7935461 3.052740
17 4.459155 0.7910273 3.086315
18 4.502035 0.7897903 3.119118
19 4.555873 0.7856789 3.160073
20 4.599092 0.7832938 3.192969
21 4.637681 0.7810155 3.224354
22 4.673391 0.7798530 3.253564
23 4.720768 0.7770125 3.283033
24 4.763076 0.7746145 3.315225
25 4.796324 0.7735186 3.353459
26 4.825073 0.7728740 3.378364
27 4.850184 0.7736520 3.402117
28 4.876660 0.7735504 3.423978
29 4.913905 0.7728784 3.453461
30 4.945720 0.7723983 3.475163
31 4.986189 0.7706545 3.500545
32 5.014103 0.7695456 3.513243
33 5.042506 0.7700251 3.532718
34 5.063848 0.7713883 3.547570
35 5.096916 0.7698529 3.572919
36 5.130072 0.7690975 3.594625
37 5.158998 0.7682367 3.616503
38 5.198007 0.7654431 3.638909
39 5.230199 0.7640846 3.656377
40 5.263299 0.7625566 3.676656
41 5.295303 0.7606504 3.696649
42 5.334168 0.7561353 3.721177
43 5.365412 0.7538500 3.744440
44 5.399053 0.7512384 3.763381
45 5.431207 0.7482837 3.789278
46 5.450129 0.7482528 3.802465
47 5.471523 0.7474810 3.817650
48 5.498410 0.7455536 3.828871
49 5.519348 0.7443257 3.845853
50 5.541579 0.7428898 3.861870
51 5.568951 0.7407798 3.880670
52 5.580325 0.7401795 3.892955
53 5.597908 0.7379495 3.906649
54 5.618826 0.7369065 3.920215
55 5.634407 0.7365578 3.932015
56 5.652281 0.7355451 3.945043
57 5.672553 0.7340832 3.961924
58 5.686516 0.7331445 3.974976
59 5.706086 0.7318256 3.989420
60 5.715973 0.7322444 3.992434
61 5.736630 0.7303771 4.004595
62 5.754764 0.7291911 4.017459
63 5.772238 0.7279232 4.029730
64 5.783690 0.7273407 4.035692
65 5.797816 0.7260285 4.042914
66 5.809293 0.7256345 4.048952
67 5.823800 0.7251425 4.057399
68 5.836885 0.7243181 4.068262
69 5.849227 0.7239215 4.081166
70 5.863107 0.7226044 4.092137
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
plot(FIT)
varImp(FIT)
loess r-squared variable importance
Overall
lstat 100.00
rm 94.28
ptratio 40.21
tax 37.51
indus 36.41
crim 35.25
nox 31.91
rad 25.82
black 24.68
age 22.83
dis 19.78
zn 16.23
chas 0.00
pred <- predict(FIT, newdata = testing)
data_pred <- data.frame(testing, pred)
head(data_pred,5)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26
16 0.62739 0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47
26 0.84054 0 8.14 0 0.538 5.599 85.7 4.4546 4 307 21.0 303.42 16.51
28 0.95577 0 8.14 0 0.538 6.047 88.8 4.4534 4 307 21.0 306.38 17.28
medv pred
5 36.2 32.26667
14 20.4 18.93333
16 19.9 19.20000
26 13.9 14.13333
28 14.8 14.40000
str(data_pred)
'data.frame': 151 obs. of 15 variables:
$ crim : num 0.069 0.6298 0.6274 0.8405 0.9558 ...
$ zn : num 0 0 0 0 0 0 0 0 75 0 ...
$ indus : num 2.18 8.14 8.14 8.14 8.14 8.14 5.96 5.96 2.95 6.91 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.458 0.538 0.538 0.538 0.538 0.538 0.499 0.499 0.428 0.448 ...
$ rm : num 7.15 5.95 5.83 5.6 6.05 ...
$ age : num 54.2 61.8 56.5 85.7 88.8 94.4 68.2 30.2 21.8 62 ...
$ dis : num 6.06 4.71 4.5 4.45 4.45 ...
$ rad : int 3 4 4 4 4 4 5 5 3 3 ...
$ tax : num 222 307 307 307 307 307 279 279 252 233 ...
$ ptratio: num 18.7 21 21 21 21 21 19.2 19.2 18.3 17.9 ...
$ black : num 397 397 396 303 306 ...
$ lstat : num 5.33 8.26 8.47 16.51 17.28 ...
$ medv : num 36.2 20.4 19.9 13.9 14.8 18.4 18.9 24.7 30.8 19.4 ...
$ pred : num 32.3 18.9 19.2 14.1 14.4 ...
# Sample data with two variables
data <- data.frame(data_pred$medv, data_pred$pred
)
# Create a line plot for both variables
plot(data$data_pred.medv, type = "l", col = "blue", xlab = "Time Axis", ylab = "MEDV", main = "Actual vs Predicted MEDV")
lines(data$data_pred.pred, col = "red")
# Add a legend
legend("topright", legend = c("Actual MEDV", "Predicted MEDV"), col = c("blue", "red"), lty = 1)
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred, testing$medv)
[1] 6.127333
plot(pred, testing$medv)
Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:
For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):
The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.
The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.
Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.
k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.
Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.
k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.
Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.
Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.
In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.
For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:
It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.
It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.
It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.
It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.
A Support Vector Machine (SVM) is a versatile and widely-used supervised machine learning algorithm designed for classification and regression tasks. SVM aims to find an optimal hyperplane or decision boundary that best separates data points into different classes or predicts continuous values, while maximizing the margin, which is the distance between the boundary and the nearest data points of each class. What sets SVM apart is its ability to handle high-dimensional data effectively and its flexibility to deal with non-linear relationships in the data through the kernel trick, which allows data to be implicitly mapped into higher-dimensional spaces. In binary classification, SVM seeks to find the hyperplane that best separates the two classes, and in multiclass scenarios, it uses strategies like one-versus-rest (OvR) or one-versus-one (OvO) to extend classification capabilities. SVM has found applications in diverse fields, including text classification, image recognition, and financial prediction, owing to its robustness and effectiveness in modeling complex decision boundaries.
SVM’s key concepts include support vectors, which are data points that lie closest to the decision boundary and significantly influence the model, and the regularization parameter (C), which balances the trade-off between achieving a wide margin and minimizing misclassification errors. By maximizing the margin, SVM generally offers good generalization to new, unseen data. However, it can be computationally intensive for large datasets, and its interpretability is limited compared to simpler models like logistic regression. Despite these limitations, SVM remains a valuable tool in machine learning, particularly when dealing with high-dimensional data and complex classification problems, thanks to its ability to find optimal separating hyperplanes in various feature spaces.
Support Vector Machines (SVMs) have several key features and characteristics that make them a powerful and widely-used machine learning algorithm:
Effective for High-Dimensional Data: SVMs are effective in scenarios where the number of features (dimensions) is greater than the number of samples. They excel in high-dimensional spaces, such as text classification or image recognition.
Optimal Hyperplane: SVM aims to find the optimal hyperplane that maximizes the margin, which is the distance between the decision boundary and the nearest data points of each class. This leads to better generalization to unseen data.
Kernel Trick: SVMs can handle non-linear data by implicitly mapping it into higher-dimensional spaces using kernel functions (e.g., polynomial, radial basis function). This allows SVMs to model complex decision boundaries.
Sparsity: SVMs are “sparse” models, meaning only a subset of data points, called support vectors, influence the decision boundary. This makes them memory-efficient and suitable for large datasets.
Binary and Multiclass Classification: SVMs can be used for both binary and multiclass classification tasks. Strategies like one-versus-rest (OvR) or one-versus-one (OvO) are employed for multiclass classification.
Regularization Parameter (C): The parameter C controls the trade-off between maximizing the margin and minimizing the classification error. Higher C values lead to a narrower margin but fewer misclassifications, while lower C values prioritize a wider margin but may tolerate some misclassifications.
Robust to Outliers: SVMs are less sensitive to outliers in the data compared to some other algorithms because the decision boundary is influenced by support vectors rather than all data points.
Interpretability: While not as interpretable as linear models like logistic regression, SVMs can provide insights into feature importance through the examination of support vectors and their associated coefficients.
Versatile: SVMs are versatile and can be applied to various machine learning tasks, including classification, regression (Support Vector Regression), and outlier detection.
Well-Defined Theoretical Framework: SVMs are founded on a well-defined mathematical framework with strong theoretical underpinnings, making them attractive for rigorous analysis.
Popular in Real-World Applications: SVMs have been successfully applied in a wide range of real-world applications, including image and text classification, bioinformatics, finance, and more.
Scalability: While SVMs can be computationally intensive for large datasets, methods like Stochastic Gradient Descent (SGD) and the use of approximate solvers have been developed to enhance scalability.
Overall, Support Vector Machines offer a powerful tool for both linear and non-linear classification problems, with a strong emphasis on maximizing the margin between classes, robustness, and adaptability to various data types and applications.
In this support vector machine, we are going to use “iris” data set ## View the data set
data <- iris
head(data,5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
str(data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
attach(data)
The data set has 150 observations, 5 variables, 4 of which are numeric and one categorical variable, that is, species is a factor variable. Using these four numeric variables, we want to build a classification model will help us predict the correct species. We need to load some library including “ggplot2”.
Let us make a plot of petal length and petal width
qplot(Petal.Length, Petal.Width, data = data)
Let us color the scatter plot and see if there is a possibility of separation.
qplot(Petal.Length, Petal.Width,
color = Species, data = data)
From the scatter plot above, thee is seperation based on the species as shown by different colors. The red one are “setosa”, abd the green ones are” versicolor”, and lastly the blue ones are “virginica”. We can see that there is some seperation since the setosa species are far from the other two.
Let us make a plot of sepal length and sepal width
qplot(Sepal.Length, Sepal.Width,
color = Species, data = data)
Let us get a quick review of what is support vector machine. Consider the figure below
knitr::include_graphics("support.png")
The point points which are color coded are data points for two categories. These are two distinct classes. In other words, we are looking for optimal separating hyperplane between these two classes. We can only achieve that by maximizing the margin around the separating hyperplane. Points that lie on the boundary are called the support vectors. The middle line is the separating hyperplane. In situation where we cannot obtain a linear separator, data points are then projected into to a higher dimensional space so that the data points can be linearly separable. For this project we make use kernels. The program that help us achieve this is called the support vector machine. For this purpose of this demonstration, load the following library; “e1071”, which is commonly used for support vector machine. We can now develope a support vector machine
library(e1071)
model <- svm(Species~., data = data)
summary(model)
Call:
svm(formula = Species ~ ., data = data)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 51
( 8 22 21 )
Number of Classes: 3
Levels:
setosa versicolor virginica
The output above gives the formula used. The type used is C-Classification with a “radial” SVM-Kernel. Since we are using Species as our dependent variable which is a factor variable, classification becomes the default type. Otherwise, if our dependent variable was a continuous, we would have a regression analysis in that case. But right now it is a classification. The cost implies that cost of constrained violation and the default is one, gamma is 0.25. The model has 51 support vectors. From the model, 8 support vectors belongs to the first species, 22 belongs to the second species and 21 belongs to the third species. We have three species are these are setosa, versicolor and virginica. Let us plot the results and see what comes out. But since we have four variables, we will to do some modification, otherwise, if we had only two variables (Petal length and Petal width), the code below would be sufficient.
## plot(model, data = data)
We will make use of slice for the other two quantitative variables. For the Sepal width we keep the constrain at 3 and sepat length constrained at 4.
plot(model, data = data,
Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
The above plot shows visualization of the support vector machine. The plot shows the data, support vectors represented by crosses and and the decision boundary. for the three types of species. The lower part of the graphs comprises data points belonging to the first species, the middle one comprises data points belonging to the second species and the upper part comprises data points belonging to the third species. The color represents the predicted class for each species. Like we said, we have 51 support vectors, 8, 22 and 21 belong to the first, second and third species, respectively.
pred <- predict(model, data)
confusionMatrix(pred, data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 2
virginica 0 2 48
Overall Statistics
Accuracy : 0.9733
95% CI : (0.9331, 0.9927)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.96
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9600 0.9600
Specificity 1.0000 0.9800 0.9800
Pos Pred Value 1.0000 0.9600 0.9600
Neg Pred Value 1.0000 0.9800 0.9800
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3200 0.3200
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9700 0.9700
From the results above, there are were 50 setosa species and the model also predicted that 50 species belongs to setosa. Additionally, the data has 48 versicolor species and the predicted them to belong to versicolor species. Lastly, there are were 48 virginica species and the model predicted the same. However, we also have mis-classification. From the data, we have 2 virginical species but the model predicted them to belong to versicolor species. Similarly, there are were 2 versicolor species but the model predicted that they belong to virginica. However, the model has an accuracy of approximately 97.33%. The mis-classification is approximately 2.67%. It is also good to note that this model used radial Kernel type.
Other than radial function, we can use linear function as shown below.
model1 <- svm(Species~.,
kernel = "linear",
data = data)
summary(model1)
Call:
svm(formula = Species ~ ., data = data, kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
Number of Support Vectors: 29
( 2 15 12 )
Number of Classes: 3
Levels:
setosa versicolor virginica
Now the kernel type is linear and not radial. The number of support vectors have reduced to 29 as compared to the previous 51. In this new model, 2 of the support vectors belong to the first species, 15 belongs to the second species and 12 support vectors belongs to the third species. Let us view the plot and see if there is any change.
plot(model1, data = data,
Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
Let us look at the confusion matrix
pred1 <- predict(model1, data)
confusionMatrix(pred1, data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 1
virginica 0 4 49
Overall Statistics
Accuracy : 0.9667
95% CI : (0.9239, 0.9891)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.95
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9200 0.9800
Specificity 1.0000 0.9900 0.9600
Pos Pred Value 1.0000 0.9787 0.9245
Neg Pred Value 1.0000 0.9612 0.9897
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3067 0.3267
Detection Prevalence 0.3333 0.3133 0.3533
Balanced Accuracy 1.0000 0.9550 0.9700
In the new model, we have slightly higher mis-classification of approximately 3.33% with the model accuracy of approximately 96.67%.
model2 <- svm(Species~.,
kernel = "polynomial",
data = data)
summary(model2)
Call:
svm(formula = Species ~ ., data = data, kernel = "polynomial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 1
degree: 3
coef.0: 0
Number of Support Vectors: 54
( 6 26 22 )
Number of Classes: 3
Levels:
setosa versicolor virginica
Using polynomial kernel function, the model has 54 support vectors where 6, 26 and 22 belonging to first, second and third category, respectively. Let us make the plot
plot(model2, data = data,
Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
pred2 <- predict(model2, data)
confusionMatrix(pred2, data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 50 0 0
versicolor 0 50 7
virginica 0 0 43
Overall Statistics
Accuracy : 0.9533
95% CI : (0.9062, 0.981)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.93
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.8600
Specificity 1.0000 0.9300 1.0000
Pos Pred Value 1.0000 0.8772 1.0000
Neg Pred Value 1.0000 1.0000 0.9346
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.2867
Detection Prevalence 0.3333 0.3800 0.2867
Balanced Accuracy 1.0000 0.9650 0.9300
From the model above, the accuracy reduced to 95.33% as compared to the previous model with an increase in mis-classification of approximately 4.67%. The model classify 7 species to belong in second category while in reality they belong in third category.
model3 <- svm(Species~.,
kernel = "sigmoid",
data = data)
summary(model3)
Call:
svm(formula = Species ~ ., data = data, kernel = "sigmoid")
Parameters:
SVM-Type: C-classification
SVM-Kernel: sigmoid
cost: 1
coef.0: 0
Number of Support Vectors: 54
( 6 26 22 )
Number of Classes: 3
Levels:
setosa versicolor virginica
The results above represents the training of a Support Vector Machine (SVM) classifier using the sigmoid kernel for a multi-class classification problem. The formula “Species ~ .” suggests that the goal is to predict the “Species” variable using all the other variables in the dataset. This type of SVM is a C-classification, which means it is designed for multi-class classification tasks. The “cost” parameter is set to 1, indicating the trade-off between maximizing the margin and minimizing classification errors. The “coef.0” parameter is set to 0, which is the bias term.
The model has identified 54 support vectors. Support vectors are the data points closest to the decision boundary, and they play a crucial role in defining the decision boundary of the SVM. In this case, there are 6 support vectors for the “setosa” class, 26 support vectors for the “versicolor” class, and 22 support vectors for the “virginica” class.
There are three classes in this classification problem: “setosa,” “versicolor,” and “virginica.” The model has learned to distinguish between these three classes based on the provided features. The sigmoid kernel is a non-linear kernel that can capture complex relationships between features and class labels. Overall, this SVM model with the sigmoid kernel has been trained to classify data into one of these three flower species based on the given dataset.
plot(model3, data = data,
Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
pred3 <- predict(model3, data)
confusionMatrix(pred3, data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 49 0 0
versicolor 1 41 7
virginica 0 9 43
Overall Statistics
Accuracy : 0.8867
95% CI : (0.8248, 0.9326)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.83
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 0.9800 0.8200 0.8600
Specificity 1.0000 0.9200 0.9100
Pos Pred Value 1.0000 0.8367 0.8269
Neg Pred Value 0.9901 0.9109 0.9286
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3267 0.2733 0.2867
Detection Prevalence 0.3267 0.3267 0.3467
Balanced Accuracy 0.9900 0.8700 0.8850
The confusion matrix and associated statistics provide a comprehensive evaluation of the performance of a classification model, presumably for a multiclass problem (with three classes: setosa, versicolor, and virginica). Here’s what each part of the output means:
The confusion matrix displays the counts of true positive, true negative, false positive, and false negative predictions for each class. In this case, the three classes are setosa, versicolor, and virginica. For example, in the first row, it shows that 49 instances of setosa were correctly classified as setosa (true positives), and there were no false positives or false negatives for setosa. In the second row, it shows that 41 instances of versicolor were correctly classified as versicolor (true positives), 1 was incorrectly classified as setosa (false negative), and 7 were incorrectly classified as virginica (false positive). Overall Statistics:
Accuracy is the overall classification accuracy of the model, which is approximately 88.67%. It’s the ratio of correctly predicted instances to the total instances. The 95% Confidence Interval (CI) provides a range for the accuracy estimate. The No Information Rate (NIR) represents the accuracy that would be achieved by simply predicting the majority class all the time (0.3333 in this case). The Kappa statistic measures the agreement between actual and predicted classifications, with a value of 0.83 indicating substantial agreement. This model has a lower accuracy as compared to the previous one, with a higher mis-classification error of approximately 11.33%.
Sensitivity (also known as True Positive Rate or Recall) measures the proportion of true positives for each class. It tells you how well the model is at correctly identifying instances of each class. Specificity measures the proportion of true negatives for each class. It tells you how well the model is at correctly identifying instances not belonging to each class. Pos Pred Value (Positive Predictive Value or Precision) is the proportion of true positives among all predicted positives for each class. Neg Pred Value (Negative Predictive Value) is the proportion of true negatives among all predicted negatives for each class. Prevalence is the proportion of each class in the dataset. Detection Rate is the rate at which the model correctly detects each class. Detection Prevalence is the rate at which the model predicts each class. Balanced Accuracy is a measure that averages sensitivity and specificity and is particularly useful when dealing with imbalanced datasets.
In summary, this analysis provides a detailed breakdown of the model’s performance for each class, including its ability to correctly classify instances, along with overall accuracy and additional statistics. It’s essential to consider these metrics in the context of your specific problem and objectives to determine the model’s effectiveness.
From all the support vector machine models above, signoid kernel function was the worst and radial kernel function was the best. Let us see if we can tune our model to get better classification.
This kind of optimzation helps to select the best model.
set.seed(222) ### To attain reproducibility
### Tuning
t.model<-tune(svm, Species~.,
data = data,
ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:9)))
The code above is the tune function to tune hyperparameters for a Support Vector Machine (SVM) classifier using the “Species” variable as the target variable and all other available features in the “data” dataset. Specifically, it seems to be tuning two hyperparameters: “epsilon” and “cost.”
tune: This function is typically used for hyperparameter tuning in R. It takes various arguments to specify the model to be tuned, the data, and the hyperparameter search space.
svm: The svm function is the model that you are trying to tune. It’s an SVM classifier for classification tasks.
Species~.: This formula specifies that you want to predict the “Species” variable based on all other variables in the dataset (denoted by the .).
data = data: This argument specifies the dataset from which the model should be trained and validated.
ranges: This argument specifies the hyperparameter search space. It appears to be defining a range for two hyperparameters:
epsilon: A range from 0 to 1 with a step of 0.1. cost: A range of values obtained by raising 2 to the power of integers from 2 to 9. The tune function will perform a grid search over the specified hyperparameter ranges, training and evaluating the SVM model with different combinations of “epsilon” and “cost.” It will likely use some form of cross-validation to assess the model’s performance for each combination of hyperparameters.
After running this code, you should get the best combination of hyperparameters that optimizes the model’s performance on the dataset, as well as the corresponding model performance metrics for each combination. This is a common approach to find the best hyperparameters for machine learning models.
plot(t.model)
The graph give the performance of SVM of the two parameters, “cost” and “epsilon”. Darker regions implies better results. From the graph if reduce the cost value we can as well get better results instead of going all the way to 2^9.
t.model1<-tune(svm, Species~.,
data = data,
ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:7)))
plot(t.model1)
The model seems to perform quite better. Let us now get the summary of this model
summary(t.model1)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
epsilon cost
0 4
- best performance: 0.04666667
- Detailed performance results:
epsilon cost error dispersion
1 0.0 4 0.04666667 0.04499657
2 0.1 4 0.04666667 0.04499657
3 0.2 4 0.04666667 0.04499657
4 0.3 4 0.04666667 0.04499657
5 0.4 4 0.04666667 0.04499657
6 0.5 4 0.04666667 0.04499657
7 0.6 4 0.04666667 0.04499657
8 0.7 4 0.04666667 0.04499657
9 0.8 4 0.04666667 0.04499657
10 0.9 4 0.04666667 0.04499657
11 1.0 4 0.04666667 0.04499657
12 0.0 8 0.04666667 0.04499657
13 0.1 8 0.04666667 0.04499657
14 0.2 8 0.04666667 0.04499657
15 0.3 8 0.04666667 0.04499657
16 0.4 8 0.04666667 0.04499657
17 0.5 8 0.04666667 0.04499657
18 0.6 8 0.04666667 0.04499657
19 0.7 8 0.04666667 0.04499657
20 0.8 8 0.04666667 0.04499657
21 0.9 8 0.04666667 0.04499657
22 1.0 8 0.04666667 0.04499657
23 0.0 16 0.05333333 0.04216370
24 0.1 16 0.05333333 0.04216370
25 0.2 16 0.05333333 0.04216370
26 0.3 16 0.05333333 0.04216370
27 0.4 16 0.05333333 0.04216370
28 0.5 16 0.05333333 0.04216370
29 0.6 16 0.05333333 0.04216370
30 0.7 16 0.05333333 0.04216370
31 0.8 16 0.05333333 0.04216370
32 0.9 16 0.05333333 0.04216370
33 1.0 16 0.05333333 0.04216370
34 0.0 32 0.04666667 0.03220306
35 0.1 32 0.04666667 0.03220306
36 0.2 32 0.04666667 0.03220306
37 0.3 32 0.04666667 0.03220306
38 0.4 32 0.04666667 0.03220306
39 0.5 32 0.04666667 0.03220306
40 0.6 32 0.04666667 0.03220306
41 0.7 32 0.04666667 0.03220306
42 0.8 32 0.04666667 0.03220306
43 0.9 32 0.04666667 0.03220306
44 1.0 32 0.04666667 0.03220306
45 0.0 64 0.05333333 0.04216370
46 0.1 64 0.05333333 0.04216370
47 0.2 64 0.05333333 0.04216370
48 0.3 64 0.05333333 0.04216370
49 0.4 64 0.05333333 0.04216370
50 0.5 64 0.05333333 0.04216370
51 0.6 64 0.05333333 0.04216370
52 0.7 64 0.05333333 0.04216370
53 0.8 64 0.05333333 0.04216370
54 0.9 64 0.05333333 0.04216370
55 1.0 64 0.05333333 0.04216370
56 0.0 128 0.07333333 0.04919099
57 0.1 128 0.07333333 0.04919099
58 0.2 128 0.07333333 0.04919099
59 0.3 128 0.07333333 0.04919099
60 0.4 128 0.07333333 0.04919099
61 0.5 128 0.07333333 0.04919099
62 0.6 128 0.07333333 0.04919099
63 0.7 128 0.07333333 0.04919099
64 0.8 128 0.07333333 0.04919099
65 0.9 128 0.07333333 0.04919099
66 1.0 128 0.07333333 0.04919099
The sampling method is 10 fold cross validation with the best parameters as epsilon = 0 and cost = 8. Using these results we can choose our best model.
mymodel <- t.model1$best.model
summary(mymodel)
Call:
best.tune(METHOD = svm, train.x = Species ~ ., data = data, ranges = list(epsilon = seq(0,
1, 0.1), cost = 2^(2:7)))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 4
Number of Support Vectors: 37
( 6 17 14 )
Number of Classes: 3
Levels:
setosa versicolor virginica
The output above represents the best-tuned Support Vector Machine (SVM) model for a classification task with three classes (setosa, versicolor, virginica). Here’s an explanation of the key details:
Model Type: The model type is C-classification, indicating that it’s a classification model.
SVM Kernel: The best-tuned SVM model uses the radial kernel. The radial kernel, also known as the Radial Basis Function (RBF) kernel, is a popular choice for SVMs as it can capture complex, non-linear decision boundaries effectively.
Hyperparameters:
The best-tuned hyperparameters are: Cost: 8 Epsilon: The value of epsilon is not explicitly mentioned in this output, but it may have been chosen based on the tuning process.
Number of Support Vectors: The model has 35 support vectors in total. 6 support vectors belong to the setosa class. 15 support vectors belong to the versicolor class. 14 support vectors belong to the virginica class.
Number of Classes: There are three classes in this classification problem: setosa, versicolor, and virginica. Levels:
This section lists the levels or class labels for the target variable, which are setosa, versicolor, and virginica. The best-tuned SVM model is selected based on the hyperparameters that maximize its performance on the training dataset, typically through techniques like cross-validation. In this case, the radial kernel and a cost value of 8 have been identified as the optimal hyperparameters. These hyperparameters are chosen to strike a balance between achieving a good fit to the training data and preventing overfitting. The number of support vectors indicates that the model is relying on a relatively small subset of the training data to define the decision boundary. This can be beneficial for generalization and model simplicity.
Overall, this best-tuned SVM model is designed to classify new data points into one of the three classes (setosa, versicolor, or virginica) using the radial kernel and the specified cost parameter, which have been determined as the optimal choices through the tuning process.
plot(mymodel, data = data,
Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
pred4 <- predict(mymodel, data)
confusionMatrix(pred4, data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 0
virginica 0 2 50
Overall Statistics
Accuracy : 0.9867
95% CI : (0.9527, 0.9984)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.98
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9600 1.0000
Specificity 1.0000 1.0000 0.9800
Pos Pred Value 1.0000 1.0000 0.9615
Neg Pred Value 1.0000 0.9804 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3200 0.3333
Detection Prevalence 0.3333 0.3200 0.3467
Balanced Accuracy 1.0000 0.9800 0.9900
From the model above, the accuracy level (98.67%) improved significantly as compared to all previous models with a very low misc-lassification error of about 1.33%
Principal Component Analysis (PCA) is a widely used dimensionality reduction technique in data analysis and machine learning. It aims to simplify complex datasets while preserving as much of the original information as possible. PCA works by transforming the original features into a new set of orthogonal (uncorrelated) features called principal components. These components are ordered in such a way that the first component explains the maximum variance in the data, the second component explains the second most, and so on.
The first step in PCA involves centering the data by subtracting the mean from each feature, ensuring that the data has a mean of zero. Then, PCA calculates the covariance matrix of the centered data, which captures the relationships between the features. The eigenvectors and eigenvalues of this covariance matrix are computed. The eigenvectors represent the direction of maximum variance in the data, and the eigenvalues indicate the amount of variance explained by each eigenvector. These eigenvectors are the principal components, and they form an orthogonal basis for the data.
In practice, PCA is used for various purposes, such as data compression, visualization, and noise reduction. By retaining only a subset of the top principal components that explain most of the variance, you can reduce the dimensionality of the data while maintaining its essential structure. This can be particularly useful for visualizing high-dimensional data or for speeding up machine learning algorithms that suffer from the curse of dimensionality. However, it’s essential to keep in mind that PCA assumes linear relationships between variables, and it may not work well for data with nonlinear dependencies.
In summary, PCA is a powerful technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. It relies on the computation of eigenvectors and eigenvalues of the covariance matrix and can be applied to a wide range of data analysis tasks, making it a fundamental tool in the toolkit of data scientists and machine learning practitioners.
The Iris dataset is a well-known dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Ronald A. Fisher in 1936. The dataset consists of 150 samples of iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample has four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. The primary purpose of this dataset is to serve as a benchmark for classification and pattern recognition algorithms. Researchers often use it for tasks like supervised learning, clustering, and data visualization. It’s a simple and widely used dataset for practicing and demonstrating various machine learning techniques.
data <- iris
head(data,5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
str(data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We are going to use Iris data with 150 observations with 5 variables. Among the five variable four are numeric with one categorical variable. Check the summary of the data set using the command below
summary(data)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
The variable we are seeing have different scales, it therefore a good idea to normalize the data. Data normalization help in ensuring that one variable does not overshadow other variables. For the species variable, each species has fifty observations.
This is fist and most important step in machine learning. Partitioning the data into training and testing.
set.seed(123)
ind <- sample(2, nrow(data),
replace = TRUE,
prob = c(0.8, 0.2))
training <- data[ind == 1,] ## this is the training data set
testing <- data[ind ==2,] ### this is the testing data set
We are going make use of the variable “psych”
library(psych)
Now, we want to see the correlation among independent variables.
pairs.panels(training[,-5],
gap = 0,
bg = c("red","yellow","blue")[training$Species],
pch = 21)
This the plot that we get with low lower triangular part giving the scatter plot and species are color coded and the part triangular part give the correlation coefficient. The correlation coefficient among independent variables is very high, that give rise to multicollinearity problem. In such cases, the estimate will be unstable with inaccutae predictions. One way of handling such cases is the application of principal component analysis.
The use of center and scale as TRUE helps to ensure that before performance of the principal component analysis, all the four variables are normalized, that is, with a mean of zero and a variance of one.
pc <- prcomp(training[,-5],
center = TRUE,
scale. = TRUE)
Check the attributes stored in pc object
attributes(pc)
$names
[1] "sdev" "rotation" "center" "scale" "x"
$class
[1] "prcomp"
We now check what is contained in all the attributes, for example we can check what is contained in center attribute
pc$center
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.826446 3.021488 3.759504 1.201653
mean(training$Sepal.Length)
[1] 5.826446
pc$scale
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.8105316 0.4270454 1.7377945 0.7557539
sd(training$Sepal.Length)
[1] 0.8105316
print(pc)
Standard deviations (1, .., p=4):
[1] 1.7237041 0.9309494 0.3749824 0.1468522
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Sepal.Length 0.5201832 -0.37206837 0.7292637 0.2432058
Sepal.Width -0.2937566 -0.92188195 -0.2184802 -0.1269139
Petal.Length 0.5747475 -0.03477237 -0.1603396 -0.8017153
Petal.Width 0.5592690 -0.10241915 -0.6282771 0.5310334
When we print pc, we get standard deviation and rotations which are also called loadings. Because we have four variable, we will have four principal components, that is PC1, C2, PC3, and PC4. Each principal component is a normalized linear combinations of the original variable. Rotations of loadings are the coefficients of continous variables. The values we are seeing in the matrix above will lie between -1 and 1. From the results above, we can see that PC1 increases as sepal length, petal length and petal width increases. While on the other hand, as PC1 increases, Sepal width decreases.
summary(pc)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7237 0.9309 0.37498 0.14685
Proportion of Variance 0.7428 0.2167 0.03515 0.00539
Cumulative Proportion 0.7428 0.9595 0.99461 1.00000
From the results above, the principal component 1 (PC1) captures the majority of the variability (74.28%) while PC2 capture 21.67%, PC3 captures 3.515% and finally PC4 captures 0.539%. Again looking at the cumulative proportion, when we reach PC2, mores than 95% of the variability has been explained (95.84%). Principal 3 and 4 do not play impotant role in terms of capturing the variability in independent variables. We can now say that PC1 and PC2 captures the majority of the variability.
The orthogonality of principal components is a fundamental concept in Principal Component Analysis (PCA) and plays a crucial role in the dimensionality reduction process. In PCA, the goal is to find a set of new orthogonal axes (principal components) in the data space that captures the maximum variance. Here’s a detailed explanation of why these principal components are orthogonal:
Centering the Data: PCA begins by centering the data, which means subtracting the mean of each feature from the data points. This step ensures that the data has a mean of zero for each feature, which simplifies subsequent calculations.
Covariance Matrix: The next step in PCA involves calculating the covariance matrix of the centered data. The covariance matrix is a square matrix where each element represents the covariance between two features. Importantly, the covariance matrix is symmetric, meaning that the covariance between feature A and feature B is the same as the covariance between feature B and feature A.
Eigenvalue Decomposition: PCA proceeds by performing eigenvalue decomposition on the covariance matrix. This involves finding the eigenvalues and corresponding eigenvectors of the matrix. The eigenvectors are the directions in the original feature space, and the eigenvalues represent the amount of variance explained by each eigenvector.
Orthogonality Emerges: The key observation is that the eigenvectors (principal components) obtained from the eigenvalue decomposition of the covariance matrix are orthogonal to each other. This orthogonality arises from the properties of the covariance matrix. Here’s why:
Symmetric matrices always have orthogonal eigenvectors. In other words, the eigenvectors extracted from a symmetric matrix are perpendicular to each other. Orthogonal Principal Components: As a result of this property, the principal components obtained from PCA are orthogonal axes in the original feature space. This orthogonality means that they are uncorrelated with each other, capturing different directions of maximum variance in the data. The first principal component represents the direction of maximum variance, the second principal component represents the second most significant variance, and so on.
Dimensionality Reduction: The orthogonality of principal components allows for efficient dimensionality reduction. By selecting a subset of the principal components and projecting the data onto these components, you can reduce the dimensionality of the data while preserving most of its variance. This is particularly valuable in cases where the original data has a high dimensionality, as it simplifies analysis and visualization.
In summary, the orthogonality of principal components in PCA emerges from the properties of the covariance matrix, which is symmetric. This orthogonality ensures that the principal components capture different directions of maximum variance in the data, making PCA a powerful technique for dimensionality reduction and data analysis.
We are going to use pairs.panels() function, the components are stored in x
pairs.panels(pc$x,
gap = 0,
bg = c("red","yellow","blue")[training$Species],
pch = 21)
The correlation correlation coefficients are zero, implying that there is no correlation between the four independent variables. Since the principal components are orthogonal to each other, the correlation between them is always zero. This helps us to get rid of multicollinearity problem.
library(devtools)
library(ggbiplot)
g <- ggbiplot(pc,
obs.scale = 1,
var.scale = 1,
groups = training$Species,
ellipse = TRUE,
circle = TRUE,
ellipse.prob = 0.68)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)
On the x-axis we have PC1 which explained 74.3% of the variability in the independent variable, while on y-axis we have PC2 which explained approximately 21.7% of the variability in independent variables. Species are color coded as we see. The ellipse are made in a way that it can capture 68% of the data, as seen the code above. The distance from one arrow to the other, tell more about the correlation between the two variables. For instance, the correlation between Petal Length and Petal Width are strongly correlated. You can check from the previous (first) correlation graph. Similarly, Sepal length seems to have a high correlation with Petal length and petal width. However the correlation between Sepal width and all the other is very low. Additionally, the correlation between PC1 and Sepal length, petal length and petal with is positive while Sepal width and PC1 have a negative correlation. It is important to note that PC2 has a negative correlation with all the variables. However, it is good to note that PC2 has a higher negative correlation with Sepal width.
trg <- predict(pc, training)
head (trg,20)
PC1 PC2 PC3 PC4
1 -2.316985 -0.51656076 0.15198103 0.024535104
2 -2.101400 0.65462004 0.22783845 0.113119002
3 -2.400406 0.31668054 -0.04520373 0.039803496
6 -2.152380 -1.55087801 0.02331317 -0.002195439
7 -2.495085 -0.08471701 -0.32985812 -0.025509039
9 -2.353502 1.10001571 -0.17086821 -0.007190525
10 -2.211116 0.45029655 0.25058344 -0.032999514
12 -2.374584 -0.16697547 -0.08523184 -0.128031117
13 -2.239579 0.71407620 0.22099743 0.012847900
14 -2.659690 0.94960025 -0.20119026 0.001221504
15 -2.277826 -1.91326086 0.54444439 0.178247932
17 -2.284673 -1.54287422 0.06021963 0.182340825
18 -2.242983 -0.53011268 0.06884854 0.094800487
19 -1.965059 -1.45916436 0.42752708 0.047275401
22 -2.273484 -0.97741447 -0.12583235 0.059493671
23 -2.838956 -0.49491022 -0.31214094 0.029323708
25 -2.275364 -0.17297831 -0.11291169 -0.266433315
26 -1.971075 0.60471390 0.29935873 0.050856588
27 -2.098225 -0.28588779 -0.07154981 0.072511087
28 -2.219733 -0.56446595 0.23272793 0.008406757
trg <- data.frame(trg, training[5])
head(trg,5)
PC1 PC2 PC3 PC4 Species
1 -2.316985 -0.51656076 0.15198103 0.024535104 setosa
2 -2.101400 0.65462004 0.22783845 0.113119002 setosa
3 -2.400406 0.31668054 -0.04520373 0.039803496 setosa
6 -2.152380 -1.55087801 0.02331317 -0.002195439 setosa
7 -2.495085 -0.08471701 -0.32985812 -0.025509039 setosa
tst <- predict(pc, testing)
tst <- data.frame(tst, testing[5])
head(tst,5)
PC1 PC2 PC3 PC4 Species
4 -2.329649 0.5744574 -0.10246958 -0.05275129 setosa
5 -2.449951 -0.6865310 0.01084664 -0.03518968 setosa
8 -2.279301 -0.2567830 0.10394179 -0.02188561 setosa
11 -2.228953 -1.0880234 0.31035317 0.00898006 setosa
16 -2.369933 -2.7639612 0.05588250 0.03149451 setosa
Our data is now ready for further analysis, in this case, we are going to use multinomial logistic regression model with PCs. We are going to use PC1 and PC2 only because they have more than 95% of the variability capture
library(nnet)
trg$Species <- relevel(trg$Species, ref = "setosa")
mymodel <- multinom(Species~ PC1 + PC2, data = trg)
# weights: 12 (6 variable)
initial value 132.932087
iter 10 value 19.674294
iter 20 value 17.413814
iter 30 value 17.299839
iter 40 value 17.297242
iter 50 value 17.296292
final value 17.294539
converged
summary(mymodel)
Call:
multinom(formula = Species ~ PC1 + PC2, data = trg)
Coefficients:
(Intercept) PC1 PC2
versicolor 10.584590 15.26832 4.019053
virginica 3.007363 21.95663 4.228617
Std. Errors:
(Intercept) PC1 PC2
versicolor 273.3561 177.9735 223.6350
virginica 273.3636 177.9818 223.6359
Residual Deviance: 34.58908
AIC: 46.58908
p <- predict(mymodel, trg)
confusionMatrix(p, trg$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 39 0 0
versicolor 0 38 4
virginica 0 5 35
Overall Statistics
Accuracy : 0.9256
95% CI : (0.8635, 0.9654)
No Information Rate : 0.3554
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8884
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8837 0.8974
Specificity 1.0000 0.9487 0.9390
Pos Pred Value 1.0000 0.9048 0.8750
Neg Pred Value 1.0000 0.9367 0.9506
Prevalence 0.3223 0.3554 0.3223
Detection Rate 0.3223 0.3140 0.2893
Detection Prevalence 0.3223 0.3471 0.3306
Balanced Accuracy 1.0000 0.9162 0.9182
The provided information represents the performance evaluation of a classification model through a confusion matrix and various statistical metrics. The confusion matrix reveals that the model has correctly identified 39 samples of the “setosa” class, 38 samples of the “versicolor” class, and 35 samples of the “virginica” class, with very few misclassifications. Overall, the model demonstrates a high accuracy of 92.56%, significantly outperforming the No Information Rate (NIR) of 35.54%. The Kappa statistic also indicates substantial agreement between predictions and actual values. In a class-specific analysis, the model shows good sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) across all classes, reflecting its ability to correctly classify samples from each class. Additionally, the balanced accuracy metric indicates a well-balanced performance across classes. Overall, this model appears to perform impressively well in classifying the three classes, with a particular strength in distinguishing the “setosa” class and reasonable performance in the “versicolor” and “virginica” classes.
p1 <- predict(mymodel, tst)
confusionMatrix(p1, tst$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 11 0 0
versicolor 0 7 2
virginica 0 0 9
Overall Statistics
Accuracy : 0.931
95% CI : (0.7723, 0.9915)
No Information Rate : 0.3793
P-Value [Acc > NIR] : 7.016e-10
Kappa : 0.8961
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.8182
Specificity 1.0000 0.9091 1.0000
Pos Pred Value 1.0000 0.7778 1.0000
Neg Pred Value 1.0000 1.0000 0.9000
Prevalence 0.3793 0.2414 0.3793
Detection Rate 0.3793 0.2414 0.3103
Detection Prevalence 0.3793 0.3103 0.3103
Balanced Accuracy 1.0000 0.9545 0.9091
K-means clustering is a widely used unsupervised machine learning algorithm that is employed for data segmentation and pattern recognition. It’s a partitioning algorithm that categorizes data points into K distinct and non-overlapping clusters, where K is a user-defined parameter. The algorithm’s objective is to minimize the sum of squared distances (variance) within each cluster, making it a “centroid-based” clustering method.
The K-means algorithm can be summarized in four key steps:
Initialization: It begins with the selection of K initial cluster centroids. These centroids can be randomly chosen from the dataset or through other initialization methods like K-means++ to enhance the convergence of the algorithm. The centroids represent the initial cluster centers.
Assignment: In this step, each data point is assigned to the cluster whose centroid is closest to it. The distance metric, often Euclidean distance, is used to determine the proximity of a point to each centroid. Consequently, each data point becomes a member of the nearest cluster.
Update Centroids: After assigning data points to clusters, the algorithm recalculates the centroids of each cluster. These new centroids are computed by taking the mean (average) of all data points in each cluster. The centroids are updated iteratively until they converge or until a predefined convergence criterion is met.
Iteration: Steps 2 and 3 are repeated iteratively until either the centroids do not change significantly between iterations or a maximum number of iterations is reached. The algorithm converges when the centroids stabilize, indicating that the clusters have been formed.
K-means has several applications, including image compression, customer segmentation, anomaly detection, and more. However, it has some limitations. It assumes that clusters are spherical and equally sized, which may not always hold in real-world data. It is also sensitive to the initial placement of centroids and may converge to local minima. Addressing these limitations may involve using alternative clustering algorithms, such as hierarchical clustering or density-based clustering, depending on the nature of the data and the goals of the analysis. Proper evaluation and validation of clustering results are crucial to ensuring the effectiveness of K-means in a given context.
Naive Bayes is a simple yet effective classification algorithm widely used in machine learning and natural language processing. It’s based on Bayes’ theorem and makes a “naive” assumption that all features are conditionally independent given the class label, which means that the presence or absence of one feature does not affect the presence or absence of another feature. This simplifying assumption allows for efficient and straightforward probabilistic classification.
The core idea behind Naive Bayes is to compute the probability of a given instance belonging to a particular class based on the joint probabilities of the individual features and the class label. This is done using Bayes’ theorem, which calculates the posterior probability of the class given the data. Mathematically, it can be represented as:
\[ P(class | features)= \left(\frac{P(features)*P(features | class)}{P(features)}\right) \]
where; \[ p(class/features) \]
is the posterior probability of the class given the features.
\[ p(class) \]
is the prior probability of the class.
\[ p(features/class) \]
is the likelihood of the features given the class.
\[ p(features) \]
is the marginal probability of the features.
In practice, p(features) can be challenging to compute, so Naive Bayes makes the simplifying assumption that all features are conditionally independent, which allows us to factorize P(features∣class) as the product of individual probabilities:
$$ P() = P() P() P()
$$
This simplification significantly reduces the computational complexity and is the reason why it’s called “naive.” Naive Bayes is commonly used in text classification tasks like spam detection and sentiment analysis. It’s particularly efficient with high-dimensional data, where other algorithms may suffer from the curse of dimensionality. Despite its simplicity and the independence assumption, Naive Bayes often performs surprisingly well, especially when the features are conditionally independent or when the data is well-suited to the independence assumption. However, it may not work well when there are strong dependencies between features, as it fails to capture these relationships.
Extreme Gradient Boosting (XGBoost) is a powerful and widely-used ensemble machine learning algorithm known for its exceptional performance in various predictive modeling tasks, particularly in structured/tabular data problems. XGBoost belongs to the gradient boosting family of algorithms, which iteratively improves the predictive capability of a weak learner, typically decision trees, by combining them into a strong predictive model. What sets XGBoost apart is its efficiency, scalability, and regularized learning approach.
XGBoost employs a gradient boosting framework, where each new decision tree is trained to correct the errors made by the ensemble of previous trees. It optimizes a loss function, typically a measure of prediction error, by fitting each tree to the negative gradient of the loss. This way, it focuses on the most challenging examples, continuously refining its predictions. XGBoost incorporates several techniques to improve model performance, such as regularization, which helps prevent overfitting, and a second-order approximation of the loss function for more accurate and faster convergence. Moreover, it allows users to specify custom evaluation metrics and handles missing data gracefully. XGBoost’s parallel processing capabilities and the use of distributed computing frameworks make it highly scalable and suitable for large datasets. Its success has led to its integration into various machine learning libraries and platforms, making it accessible to a wide range of users.
In practice, XGBoost has demonstrated its effectiveness across a multitude of applications, including Kaggle competitions and real-world problems in areas like finance, healthcare, and recommendation systems. Its versatility, speed, and ability to handle structured data make it a preferred choice for many data scientists and machine learning practitioners when seeking state-of-the-art performance in predictive modeling tasks. As a result, XGBoost has become an essential tool in the machine learning toolbox and continues to drive advances in the field.