\(~\)
\(~\)
Below are the libraries used to complete this assignment
library(tidyverse)
library(skimr)
library(rpart)
library(rpart.plot)
library(knitr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(stringr)
library(tidymodels)
library(corrplot)
library(randomForest)
library(caret)
library("e1071")
\(~\)
The both articles seek to predict whether there is the presence of covid-19 in the body of a patient or not by using machine learning algorithms. The Armir Ahmad paper relies on decision trees to predict the presence of Covid-19, while the Soham Guhathakurata paper relies on support vector machines (SVMs) to predict the presence of the disease. The first articles investigate the application of decision tree for the selection of parameters; which is very robust in addressing the imbalance between the positive and negative covid-19 cases. The second articles from Guhathakurata seek to predict if the patient is infected with covid-19 by using the support vector machine (SVM). The svm uses classification mean to detect the various level of the patient’s condition in no infection, mild infection, and serious infection categories.
Also, Guhathakurata works with a linear kernel. It’s possible their 87% accuracy figure could be improved upon by using a different SVM kernel function. In addition, accuracy won’t be a great evaluation metric for this classification task since Covid data is very likely to be significantly imbalanced (most test cases will be negative, and a naive classifier could simply predict the baseline rate of positive cases). Both studies had similar results in terms of accuracy, recall, and F-1 scores, with Guhathakurata improving precision for their SVM classifier.
I worked in a law Agency where I help in analyzing arrest and criminal data so as to monitor trends and make prediction.
With the increase of crimes, law enforcement agencies are continuing to demand advanced systems and new methods to improve crime analytics and better protect their communities. Data mining is an approach to discover hidden relationships among data by using artificial intelligence methods of which decision tree (JT) is inclusive. Decision tree as a machine learning helps to uncover relationship and pattern that exist in criminal data. This study considered the development of crime prediction prototype model using decision tree (JT) algorithm because it has been considered as the most efficient machine learning algorithm for prediction of crime data as described in the related literature. From the experimental results, JT algorithm predicted the unknown category of crime data to the accuracy of 94.25287% which is fair enough for the system to be relied on for prediction of future crimes.
These two articles present a support vector machine (SVM)-based approach to predict the location of crime hot spots as an alternative to existing modeling approaches. Support vector machine forms the new generation of machine-learning techniques used to find optimal separability between classes within datasets. We also compared SVM with a neural network-based approach and spatial auto-regression-based approach. An experiment on two different spatial datasets is used in the first article to demonstrate that the former approach performs slightly better and the latter one gives reasonable results. The SVM algorithm in machine learning utilizes crime mapping as an important area of research of crime analysis as it allows visualizing, analyzing and tracking high crime areas or patterns. Furthermore, in this study, we provide a general framework to customize the spatial data classification task for other spatial domains that have datasets similar to the analyzed crime datasets.
\(~\)
The data used in homework 2 was downloaded from Kaggle.com and loaded into my github. The data wine+Quality contain two sub data; white and red wine. I decided to use the red wine data
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
\(~\)
Based on the description from Kaggle, the two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
\(~\)
Using the skimr library we can obtain a quick summary
statistic of the dataset. It has 1599 values with 12 variables all
numeric and no missing variables.
| Name | wine_data |
| Number of rows | 1599 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 0 | 1 | 8.32 | 1.74 | 4.60 | 7.10 | 7.90 | 9.20 | 15.90 | ▂▇▂▁▁ |
| volatile.acidity | 0 | 1 | 0.53 | 0.18 | 0.12 | 0.39 | 0.52 | 0.64 | 1.58 | ▅▇▂▁▁ |
| citric.acid | 0 | 1 | 0.27 | 0.19 | 0.00 | 0.09 | 0.26 | 0.42 | 1.00 | ▇▆▅▁▁ |
| residual.sugar | 0 | 1 | 2.54 | 1.41 | 0.90 | 1.90 | 2.20 | 2.60 | 15.50 | ▇▁▁▁▁ |
| chlorides | 0 | 1 | 0.09 | 0.05 | 0.01 | 0.07 | 0.08 | 0.09 | 0.61 | ▇▁▁▁▁ |
| free.sulfur.dioxide | 0 | 1 | 15.87 | 10.46 | 1.00 | 7.00 | 14.00 | 21.00 | 72.00 | ▇▅▁▁▁ |
| total.sulfur.dioxide | 0 | 1 | 46.47 | 32.90 | 6.00 | 22.00 | 38.00 | 62.00 | 289.00 | ▇▂▁▁▁ |
| density | 0 | 1 | 1.00 | 0.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▃▇▂▁ |
| pH | 0 | 1 | 3.31 | 0.15 | 2.74 | 3.21 | 3.31 | 3.40 | 4.01 | ▁▅▇▂▁ |
| sulphates | 0 | 1 | 0.66 | 0.17 | 0.33 | 0.55 | 0.62 | 0.73 | 2.00 | ▇▅▁▁▁ |
| alcohol | 0 | 1 | 10.42 | 1.07 | 8.40 | 9.50 | 10.20 | 11.10 | 14.90 | ▇▇▃▁▁ |
| quality | 0 | 1 | 5.64 | 0.81 | 3.00 | 5.00 | 6.00 | 6.00 | 8.00 | ▁▇▇▂▁ |
\(~\)
\(~\)
There is no correlation between a wine’s residual sugar and its quality rating.
There’s no visible relationship between chloride content, free sulfur dioxide, and wine quality.
Wines containing higher levels of total sulfur dioxide are not consistently rated as low quality wines and don’t provide a reliable indicator of wine quality.
There is a slight negative relationship between a wine’s density and it’s quality rating. Higher density wines tend to have a slightly lower quality rating.
There is very little to no correlation between pH and wine quality.
There is a slight positive relationship between alcohol content and wine quality. The higher the alcohol content, the higher the average of the wine quality.
\(~\)
Now that I’ve visualized the data I want to do one minor change to
the columns. Most of the columns have a “.” and I’m changing it to an
“_“. I’ll also be converting the column Quality to factor.
Since there’s no missing values there’s not much more to prepare the
data.
| Fixed_Acidity | Volatile_Acidity | Citric_Acid | Residual_Sugar | Chlorides | Free_Sulfur_Dioxide | Total_Sulfur_Dioxide | Density | pH | Sulphates | Alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
\(~\)
The correlation plot below is measuring the degree of linear relationship within the dataset. The values in which this is measured falls between -1 and +1, with +1 being a strong positive correlation and -1 a strong negative correlation. The darker the dot the more strongly correlated (whether positive or negative). From the results below, there’s a strong positive correlation with citric acid, density and fixed acidity as well as free sulfur dioxide and total sulfur dioxide. Negative strong correlations are only seen with fixed acidity and pH, citric acid and volatile acidy, citric acid and pH, and density and alcohol.
\(~\)
Building from the previous homework I am recreating the Decision Trees and Random Forest models. If you recall, I had some issues displaying the confusion matrix for both models so I have improved on this to hopefully get a better accuracy of the models and be able to compare it with the support vector machines (SVM).
The first decision tree is between Quality and the whole
data set and started off by doing the cross validations setup by using
the 75:25 ratio. Below is the decision tree created:
Then we test the model using the validation dataset. The results are seen in the confusion matrix and statistics output:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 2 6 132 61 4 0
## 6 0 7 38 87 34 3
## 7 0 0 0 11 11 1
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5793
## 95% CI : (0.5291, 0.6284)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : 1.021e-09
##
## Kappa : 0.3004
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.7765 0.5472 0.22449 0.00000
## Specificity 1.000000 1.00000 0.6784 0.6555 0.96552 1.00000
## Pos Pred Value NaN NaN 0.6439 0.5148 0.47826 NaN
## Neg Pred Value 0.994962 0.96725 0.8021 0.6842 0.89840 0.98992
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.01008
## Detection Rate 0.000000 0.00000 0.3325 0.2191 0.02771 0.00000
## Detection Prevalence 0.000000 0.00000 0.5164 0.4257 0.05793 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7274 0.6013 0.59500 0.50000
Let’s look at the contribution of each variable:
| Overall | |
|---|---|
| Alcohol | 108.070244 |
| Chlorides | 17.151769 |
| Citric_Acid | 28.034564 |
| Density | 42.102816 |
| Fixed_Acidity | 36.316930 |
| Free_Sulfur_Dioxide | 6.357231 |
| pH | 6.169872 |
| Residual_Sugar | 2.400345 |
| Sulphates | 82.234256 |
| Total_Sulfur_Dioxide | 53.237707 |
| Volatile_Acidity | 77.456655 |
and we check the accuracy which is 58% (previous accuracy was 57.4%):
| x | |
|---|---|
| Accuracy | 0.5793451 |
\(~\)
We were also asked to switch variables and create a second decision
tree. I looked at the relationship between Quality and
Density, pH, and Alcohol that
yield an accuracy of 57%. Upon making changes this accuracy went down.
Below is the output of this decision tree.
Same as before, we create the confusion matrix and statistics for the second decision tree:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 8 132 67 5 0
## 6 2 5 38 92 44 4
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5642
## 95% CI : (0.5139, 0.6136)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : 3.518e-08
##
## Kappa : 0.2547
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.7765 0.5786 0.0000 0.00000
## Specificity 1.000000 1.00000 0.6476 0.6092 1.0000 1.00000
## Pos Pred Value NaN NaN 0.6226 0.4973 NaN NaN
## Neg Pred Value 0.994962 0.96725 0.7946 0.6840 0.8766 0.98992
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.1234 0.01008
## Detection Rate 0.000000 0.00000 0.3325 0.2317 0.0000 0.00000
## Detection Prevalence 0.000000 0.00000 0.5340 0.4660 0.0000 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7120 0.5939 0.5000 0.50000
Let’s look at the contribution of each variable for the second dataset:
| Overall | |
|---|---|
| Alcohol | 69.422698 |
| Density | 25.855801 |
| pH | 2.874687 |
and now for the accuracy of 56.4% which is lower than the first decision tree:
| x | |
|---|---|
| Accuracy | 0.5642317 |
\(~\)
From the variable contribution in the first decision tree, I decided
to create a third decision tree composed of Quality,
Alcohol, Sulphates,
Volatile_Acidity, and Total_Sulfur_Dioxide and
view the changes in the model accuracy. Same as before, I created a new
datasets from the original choosing only the variables above and
followed the same steps to create this final decision tree.
The confusion matrix and statistics for the third decision tree:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 2 8 146 78 8 0
## 6 0 5 22 68 23 3
## 7 0 0 2 13 18 1
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5844
## 95% CI : (0.5342, 0.6333)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : 2.896e-10
##
## Kappa : 0.3145
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.8588 0.4277 0.36735 0.00000
## Specificity 1.000000 1.00000 0.5771 0.7773 0.95402 1.00000
## Pos Pred Value NaN NaN 0.6033 0.5620 0.52941 NaN
## Neg Pred Value 0.994962 0.96725 0.8452 0.6703 0.91460 0.98992
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.01008
## Detection Rate 0.000000 0.00000 0.3678 0.1713 0.04534 0.00000
## Detection Prevalence 0.000000 0.00000 0.6096 0.3048 0.08564 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7180 0.6025 0.66068 0.50000
Let’s look at the contribution of each variable for the third dataset:
| Overall | |
|---|---|
| Alcohol | 110.52522 |
| Sulphates | 85.19619 |
| Total_Sulfur_Dioxide | 71.14540 |
| Volatile_Acidity | 83.16022 |
and now for the accuracy of 58.4% which is higher than the first and second decision tree models:
| x | |
|---|---|
| Accuracy | 0.5843829 |
\(~\)
For a second recap: we now create a random forest model for the dataset. A Random Forest is an ensemble learning technique in machine learning that combines multiple decision trees to make accurate predictions. It works by creating a collection of decision trees, each trained on a bootstrapped dataset (randomly sampled with replacement) from the original data and considering only a subset of features at each split. The final prediction in a classification task is determined by a majority vote of the individual trees, while in a regression task, it’s an average of their predictions. Random Forests are valued for their high accuracy, resistance to overfitting, and the ability to assess feature importance.
For the random forest model, I first chose the first decision tree as it had a higher accuracy compared to the second model. Create the random forest model using the training data and then applying it to the validation data. A new addition to this model is that now I will create a second random forest model with the third decision tree model and make the comparison. Below are the results:
##
## Call:
## randomForest(formula = Quality ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 31.86%
## Confusion matrix:
## 3 4 5 6 7 8 class.error
## 3 0 0 7 1 0 0 1.0000000
## 4 0 0 27 13 0 0 1.0000000
## 5 1 1 411 93 5 0 0.1956947
## 6 0 1 116 334 27 1 0.3027140
## 7 0 0 9 66 74 1 0.5066667
## 8 0 0 0 8 6 0 1.0000000
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 1 0 0 0 0
## 4 0 0 0 0 0 0
## 5 2 7 148 44 1 0
## 6 0 4 22 104 18 2
## 7 0 1 0 11 30 1
## 8 0 0 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.7128
## 95% CI : (0.6656, 0.7569)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5349
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.8706 0.6541 0.61224 0.250000
## Specificity 0.997468 1.00000 0.7621 0.8067 0.96264 1.000000
## Pos Pred Value 0.000000 NaN 0.7327 0.6933 0.69767 1.000000
## Neg Pred Value 0.994949 0.96725 0.8872 0.7773 0.94633 0.992424
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.010076
## Detection Rate 0.000000 0.00000 0.3728 0.2620 0.07557 0.002519
## Detection Prevalence 0.002519 0.00000 0.5088 0.3778 0.10831 0.002519
## Balanced Accuracy 0.498734 0.50000 0.8164 0.7304 0.78744 0.625000
From the random forest model we can create a variable importance plot
which shows each variable and how important it is in classifying the
data. From the plot below we note that Alcohol,
Total_Sulfur_Dixoide and Sulphates are among
the top variables that play a significant role in the classification of
the quality of the wine.
Numerically, we can see the same result below:
| Overall | |
|---|---|
| Fixed_Acidity | 56.41020 |
| Volatile_Acidity | 76.32992 |
| Citric_Acid | 58.62591 |
| Residual_Sugar | 55.81955 |
| Chlorides | 66.23051 |
| Free_Sulfur_Dioxide | 51.89288 |
| Total_Sulfur_Dioxide | 84.75798 |
| Density | 72.14677 |
| pH | 58.72748 |
| Sulphates | 83.95732 |
| Alcohol | 108.65654 |
Lastly, I check the accuracy on the validation data with the results of 71.3% accuracy seen below:
## Accuracy
## 0.7128463
\(~\)
Now to create the second random forest with the third dataset using
the variables Quality, Alcohol,
Volatile_Acidity, Sulphates, and
Total_Sulfur_Dioxide. The results are below:
##
## Call:
## randomForest(formula = Quality ~ Alcohol + Volatile_Acidity + Sulphates + Total_Sulfur_Dioxide, data = train3)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 34.03%
## Confusion matrix:
## 3 4 5 6 7 8 class.error
## 3 0 1 6 1 0 0 1.0000000
## 4 0 2 26 12 0 0 0.9500000
## 5 0 5 400 103 3 0 0.2172211
## 6 0 1 119 317 41 1 0.3382046
## 7 0 0 11 65 73 1 0.5133333
## 8 0 0 0 2 11 1 0.9285714
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 1 7 134 28 4 0
## 6 1 6 35 116 17 2
## 7 0 0 1 15 28 1
## 8 0 0 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.7028
## 95% CI : (0.6552, 0.7473)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5204
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.7882 0.7296 0.57143 0.250000
## Specificity 1.000000 1.00000 0.8238 0.7437 0.95115 1.000000
## Pos Pred Value NaN NaN 0.7701 0.6554 0.62222 1.000000
## Neg Pred Value 0.994962 0.96725 0.8386 0.8045 0.94034 0.992424
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.010076
## Detection Rate 0.000000 0.00000 0.3375 0.2922 0.07053 0.002519
## Detection Prevalence 0.000000 0.00000 0.4383 0.4458 0.11335 0.002519
## Balanced Accuracy 0.500000 0.50000 0.8060 0.7366 0.76129 0.625000
From the random forest model we created, we can create a variable
importance plot which shows each variable and how important it is in
classifying the data. From the plot below we note that
Alcohol and Total_Sulfur_Dioxide are among the
top variables that play a significant role in the classification of the
quality of the wine.
Numerically, we can see the same result below:
| Overall | |
|---|---|
| Alcohol | 201.6277 |
| Volatile_Acidity | 186.0988 |
| Sulphates | 182.8026 |
| Total_Sulfur_Dioxide | 200.6466 |
Lastly, we check on the validation data’s accuracy of the second model with the results of 70.3% accuracy seen below:
## Accuracy
## 0.7027708
\(~\)
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It is particularly effective for classification tasks in which the goal is to divide data points into different classes based on their features. Due to their effectiveness in handling high-dimensional data and their ability to perform well with relatively small datasets SVM is used in various fields.
We were asked to create an SVM algorithm with the same data used and make the comparison. To start the algorithm, we follow similar criteria to the decision tree and random forest model by setting up the cross validation set-up, create the prediction and confusion matrix and lastly it’s accuracy. Results are below:
\(~\)
The confusion matrix for SVM:
# SVM
svm_model <- svm(Quality ~ ., svm_train)
# create prediction
svm_result <- predict(svm_model, newdata = svm_valid)
# confusion matrix for svm
confusionMatrix(svm_result, svm_valid$Quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 2 9 138 51 1 0
## 6 0 3 31 99 32 3
## 7 0 1 1 9 16 1
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6373
## 95% CI : (0.5878, 0.6847)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4005
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.8118 0.6226 0.32653 0.00000
## Specificity 1.000000 1.00000 0.7225 0.7101 0.96552 1.00000
## Pos Pred Value NaN NaN 0.6866 0.5893 0.57143 NaN
## Neg Pred Value 0.994962 0.96725 0.8367 0.7380 0.91057 0.98992
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.01008
## Detection Rate 0.000000 0.00000 0.3476 0.2494 0.04030 0.00000
## Detection Prevalence 0.000000 0.00000 0.5063 0.4232 0.07053 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7671 0.6664 0.64602 0.50000
The summary of the SVM results:
## 3 4 5 6 7 8
## 0 0 201 168 28 0
The accuracy of this SVM is 63.7% for the original data set:
## Accuracy
## 0.6372796
\(~\)
Decided to do a second SVM algorithm to check for any changes in accuracy, these results are below.
The confusion matrix for the second SVM:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 2 7 121 48 1 0
## 6 0 6 49 108 37 1
## 7 0 0 0 3 11 3
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6045
## 95% CI : (0.5545, 0.653)
## No Information Rate : 0.4282
## P-Value [Acc > NIR] : 1.246e-12
##
## Kappa : 0.3396
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.7118 0.6792 0.22449 0.00000
## Specificity 1.000000 1.00000 0.7445 0.6092 0.98276 1.00000
## Pos Pred Value NaN NaN 0.6760 0.5373 0.64706 NaN
## Neg Pred Value 0.994962 0.96725 0.7752 0.7398 0.90000 0.98992
## Prevalence 0.005038 0.03275 0.4282 0.4005 0.12343 0.01008
## Detection Rate 0.000000 0.00000 0.3048 0.2720 0.02771 0.00000
## Detection Prevalence 0.000000 0.00000 0.4509 0.5063 0.04282 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7281 0.6442 0.60362 0.50000
The summary of the SVM results:
## 3 4 5 6 7 8
## 0 0 179 201 17 0
The accuracy of the second SVM is 60.5% for the original data set which is lower than the first SVM accuracy.
## Accuracy
## 0.604534
\(~\)
Lastly, let’s do a model comparison for Decision Tree, Random Forest and SVM:
## Model Accuracy
## 2 Random Forest 0.7128463
## 3 SVM 0.6372796
## 1 Decision Tree 0.5843829
\(~\)
Decision Tree: The Decision Tree model using the rpart algorithm achieved an accuracy of 58.4%. The confusion matrix revealed a limited ability to predict wine quality, particularly for classes 3, 4, and 8, where the sensitivity was low.
Random Forest: The Random Forest model outperformed the Decision Tree with an accuracy of 71.3%. The confusion matrix showed improved predictions across all classes compared to the Decision Tree, resulting in better specificity but still had a low sensitivity in classes 3 and 4.
Support Vector Machine (SVM): The SVM model achieved an accuracy of 63.7%. While it showed high specificity for most classes, it struggled with low sensitivity in classes 3, 4, and 8.
Overall, I’d recommend Random Forest as the algorithm of choice for this dataset or similar for more accurate results since it outperformed decision tree by almost 20% and SVM by 11%. Random Forest is a versatile algorithm that performs well in both classification and regression scenarios. Its ability to handle high-dimensional data, deal with non-linear relationships, and reduce overfitting makes it a popular choice across a wide range of machine learning applications. Keep in mind the selection between using Random Forest for classification or regression often depends on the specific nature of the problem and the characteristics of the dataset being analyzed.
\(~\)
Emmanuel Ahishakiye; Danison Taremwa; Elisha Opiyo Omulo; Ivan Niyonzima. Crime Prediction Using Decision Tree (JT) Classification Algorithm (https://www.researchgate.net/publication/316960839_Crime_Prediction_Using_Decision_Tree_J48_Classification_Algorithm)
Keivan Kianmehr, Reda Alhajj (2008). EFFECTIVENESS OF SUPPORT VECTOR MACHINE FOR CRIME HOT-SPOTS PREDICTION (https://www.tandfonline.com/doi/full/10.1080/08839510802028405)
K Vinothkumar; Kumar S Ranjith; Raj R Vikram; N. Mekala; R. Reshma; S.P. Sasirekha (2023). Crime Hotspot Identification using SVM in Machine Learning (https://ieeexplore.ieee.org/document/10104689)