In this exercise, we will explore Supervised Learning. Supervised Learning algorithms use input data and response values to train the machine learning model to predict response values for future data. There are two categories of these algorithms: Classification and Regression.
For this assignment, we will be focusing on Classification, using the k-Nearest Neighbors (KNN) algorithm. This is used to classify a point based on the classification of the K nearest points. I chose to use a dataset based on zoo animals, that includes a variety of their characteristics - such as feathers, fins, hair, legs, and so on - and their respective animal classification: Mammal, Bird, Reptile, Fish, Amphibian, Bug, and Invertebrate. I will be using KNN to train the model to predict animal classifications, and checking with Cross Table to see how well the model functioned.
I loaded in the zoo data, and viewed it to confirm everything was properly retrieved.
zoo <- read.csv("C:/Users/linds/Documents/MSDS/MSDS 650 - Data Analytics/Week 7/zoo.csv")
View(zoo)
Below I ran a summary of the data, in which you can see all of the different animal characteristics. For all characteristics, results are binary, showing either a 0 or 1. A 1 means that characteristic is present, and a 0 if it is not. We can also see there are 101 animal names loaded into this dataset.
summary(zoo)
## animal_name hair feathers eggs
## Length:101 Min. :0.0000 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.000 Median :1.0000
## Mean :0.4257 Mean :0.198 Mean :0.5842
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
## milk airborne aquatic predator
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.4059 Mean :0.2376 Mean :0.3564 Mean :0.5545
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## toothed backbone breathes venomous
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.000 Median :1.0000 Median :1.0000 Median :0.00000
## Mean :0.604 Mean :0.8218 Mean :0.7921 Mean :0.07921
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## fins legs tail domestic
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :4.000 Median :1.0000 Median :0.0000
## Mean :0.1683 Mean :2.842 Mean :0.7426 Mean :0.1287
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :8.000 Max. :1.0000 Max. :1.0000
## catsize class_type
## Min. :0.0000 Length:101
## 1st Qu.:0.0000 Class :character
## Median :0.0000 Mode :character
## Mean :0.4356
## 3rd Qu.:1.0000
## Max. :1.0000
Per the instructions, I concatenated multiple columns, in this case aquatic and fin. This is an easier way to scale down the step above, to look at specific information.
I can see here based on the mean, that on average more animals in the dataset have tails than they do fins.
summary(zoo[c("tail", "fins")])
## tail fins
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000
## Mean :0.7426 Mean :0.1683
## 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
If we slice the data again, but this time looking at legs, we can see mean is not as helpful here. We know there are no animals with 2.8 legs, and even rounding up to 3 is not accurate as there are no animals in our dataset with 3 legs.
Instead, we can extract other learnings from this such as the min and max amount of legs. Some animals have as small as 0 legs, while the max amount of legs these animals have are 8. Looking at IQR, most of the animals have in between 2-4 legs.
summary(zoo[c("legs")])
## legs
## Min. :0.000
## 1st Qu.:2.000
## Median :4.000
## Mean :2.842
## 3rd Qu.:4.000
## Max. :8.000
ggvis is a data visualization package for R similar to ggplot2. We will use this to visually represent some of the data we just explored above.
Please note: I received an error that I needed to install Rtools. I went ahead and installed this via https://cran.rstudio.com/bin/windows/Rtools/
library(ggvis)
By choosing a specific characteristic and plotting it against the animal class types, we can visualize what characteristic these animal classes have or don’t have. In this case, I have plotted animal class against feathers. We can easily see here that if an animal has feathers, it is a part of the Bird class.
Please note: Since results on this dataset were binary, I did not find the fill options to be applicable as they would be layered on top of one another.
zoo %>%
ggvis(~class_type, ~feathers) %>%
layer_points()
This is where machine learning can be helpful, because it can look at a variety of factors to predict animal class instead of getting stuck on situations like the below.
zoo %>%
ggvis(~class_type, ~hair) %>%
layer_points()
I downloaded the class package, which includes the functions needed for KNN.
library(class)
To run KNN for predicting animal class, we first need to train the model.
To prepare for training, we need to set a seed and create a split of our data. We will first test out a split of 80/20. This is based off of the Pareto Principle, the 80/20 rule, which claims that 80% of effects come from 20% of the results.
We will set the model seed to 3465, which is the recommended way to specify seeds. Next, we will set our sample and call it ‘ind’. We use nrows() to return the number of rows in our dataset, and then a vector of 2, so that we assign a 1 or 2 to each row of data based on our split. Replacement is set to TRUE here, which means that each class is “replaced” after it is selected. In other words, the same class can show up more than once.
Next, we will use our variable ‘ind’ to set up our training and test sets. These will include all of the animal characteristics, which can be found in indices 2:17. We will have our train sets be equal to the rows assigned as 1, and our test sets equal to the rows assigned as 2. Using the 80/20 split, 80% of our data should have been assigned as 1, or our training set, and 20% to 2, our test set.
We will also set our training and test labels equal to the index 18 in our dataset. This is the label we will be training the model to predict.
*Please note: I received the below error when I included animal_name (index 1). From my understanding, this could not be included because it was not a number, but I wasn’t sure why. Can KNN only predict categorical variables, but not base predictions off of categorical variables? As introduced by coercionNAs introduced by coercionError in knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k = 9) : NA/NaN/Inf in foreign function call (arg 6)
set.seed(3465)
ind <- sample(2, nrow(zoo), replace=TRUE, prob=c(0.8, 0.2))
zooTrain <- zoo[ind==1, 2:17]
zooTest <- zoo[ind==2, 2:17]
zooTrainLabels <- zoo[ind==1, 18]
zooTestLabels <- zoo[ind==2, 18]
Now we are ready to train. We will use the supervised learning method, KNN. This will predict animal classification based on the classification of its nearest neighbors.
Please note: This was not in the instructions, but I did some research to determine how to calculate the optim*al k, as I read this is an important input. I found that a best practice was to take the square root of your sample size, make sure it is odd, and not too small or too large. I have 101 samples, which comes out to a square root of 10. I rounded down to the nearest odd, 9, to make sure I wasn’t choosing anything too large or even.
zoo_pred <- knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k=9)
zoo_pred
## [1] Bird Mammal Mammal Mammal Bird Mammal Bug
## [8] Mammal Mammal Bird Fish Fish Mammal Fish
## [15] Amphibian Mammal
## Levels: Amphibian Bird Bug Fish Invertebrate Mammal Reptile
Now that we have trained our model, we can evaluate its effectiveness. We can use Cross Table for this, which will illustrate the relationship between the two variables by comparing the Test Label to the result of the KNN model. We are setting chi-square contribution to FALSE, so that this will not show up in the cells.
Per the below, we can see that the KNN model guessed 88% of the animal classifications correctly. This model worked well for classifying Amphibians, Birds, Fish, and Mammals, predicting all of those cases correctly. Invertebrate and Reptile did not go so well, as those were guessed completely incorrectly.
Please note: My Test Labels only included 6 of the 7 animal classes. I thought this might be due to sampling, but please correct me if I’m wrong.
library(gmodels)
CrossTable(x = zooTestLabels, y = zoo_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 16
##
##
## | zoo_pred
## zooTestLabels | Amphibian | Bird | Bug | Fish | Mammal | Row Total |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Amphibian | 1 | 0 | 0 | 0 | 0 | 1 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.062 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.062 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Bird | 0 | 3 | 0 | 0 | 0 | 3 |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.188 |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.188 | 0.000 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Fish | 0 | 0 | 0 | 2 | 0 | 2 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.125 |
## | 0.000 | 0.000 | 0.000 | 0.667 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.125 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Invertebrate | 0 | 0 | 1 | 0 | 0 | 1 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.062 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.062 | 0.000 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Mammal | 0 | 0 | 0 | 0 | 8 | 8 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.500 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Reptile | 0 | 0 | 0 | 1 | 0 | 1 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.062 |
## | 0.000 | 0.000 | 0.000 | 0.333 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.062 | 0.000 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1 | 3 | 1 | 3 | 8 | 16 |
## | 0.062 | 0.188 | 0.062 | 0.188 | 0.500 | |
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Now, let’s repeat but with a 67/33 split to see if that is any more effective than 80/20. 67/33 means that we will have a smaller training group than we had prior, but a larger test group. We can revise our code to set up the new split:
ind <- sample(2, nrow(zoo), replace=TRUE, prob=c(0.67, 0.33))
zooTrain <- zoo[ind==1, 2:17]
zooTest <- zoo[ind==2, 2:17]
zooTrainLabels <- zoo[ind==1, 18]
zooTestLabels <- zoo[ind==2, 18]
zoo_pred <- knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k=9)
zoo_pred
## [1] Mammal Mammal Bug Fish Fish
## [6] Mammal Invertebrate Mammal Mammal Bug
## [11] Mammal Mammal Bird Bug Bird
## [16] Bug Mammal Bug Mammal Bird
## [21] Bird Fish Fish Mammal Mammal
## [26] Fish Mammal Mammal Fish Fish
## [31] Bird Mammal Bird
## Levels: Amphibian Bird Bug Fish Invertebrate Mammal Reptile
This time, we had roughly 2x the observations, however the model only predicted 79% of animal classifications correctly. The model worked well for Birds, Bugs, and Fish, however Amphibian, Invertebrate, and Reptile were completely wrong, and Mammal had 1 incorrect classification.
*Please note: This time, all 7 animal classes were included in the test labels. I concluded that this was due to the larger test sample, but please correct me if I’m wrong.
CrossTable(x = zooTestLabels, y = zoo_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 33
##
##
## | zoo_pred
## zooTestLabels | Bird | Bug | Fish | Invertebrate | Mammal | Row Total |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Amphibian | 0 | 0 | 0 | 0 | 1 | 1 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.030 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.030 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Bird | 6 | 0 | 0 | 0 | 0 | 6 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.182 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.182 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Bug | 0 | 3 | 0 | 1 | 0 | 4 |
## | 0.000 | 0.750 | 0.000 | 0.250 | 0.000 | 0.121 |
## | 0.000 | 0.600 | 0.000 | 1.000 | 0.000 | |
## | 0.000 | 0.091 | 0.000 | 0.030 | 0.000 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Fish | 0 | 0 | 3 | 0 | 0 | 3 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.091 |
## | 0.000 | 0.000 | 0.429 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Invertebrate | 0 | 2 | 0 | 0 | 0 | 2 |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.061 |
## | 0.000 | 0.400 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.061 | 0.000 | 0.000 | 0.000 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Mammal | 0 | 0 | 3 | 0 | 13 | 16 |
## | 0.000 | 0.000 | 0.188 | 0.000 | 0.812 | 0.485 |
## | 0.000 | 0.000 | 0.429 | 0.000 | 0.929 | |
## | 0.000 | 0.000 | 0.091 | 0.000 | 0.394 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Reptile | 0 | 0 | 1 | 0 | 0 | 1 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.030 |
## | 0.000 | 0.000 | 0.143 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.030 | 0.000 | 0.000 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## Column Total | 6 | 5 | 7 | 1 | 14 | 33 |
## | 0.182 | 0.152 | 0.212 | 0.030 | 0.424 | |
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##
##
We can conclude that the 80/20 split for the KNN model is more effective for predicting animal class compared to 67/33, however there are some caveats. The model works well for predicting Amphibians, Birds, Fish, and Mammals, however it has not proven to work as well for the other animal class types. The other animal class types likely have too many similarities in the characteristics within this dataset to properly differentiate across animal classes. To improve the accuracy, we could look to add more unique characteristics into our dataset that might create stronger differences between them.
#1. How do supervised learning algorithms solve regression and classification problems? (I am not wanting a description of the internal workings of the algorithms.)
Regression and Classification both fall under Supervised learning, however their outputs are different. Regression outputs are numerical (or continuous), while Classification outputs are categorical (or discrete).
The example from this exercise shows how Supervised Learning solves for Classification. We trained the model to predict animal classification based on the animal classes of its nearest neighbors in terms of animal characteristics. Regression would be a similar concept, but with the output being numerical. An example of this could be predicting weight based on age and height. Our dataset would look at bit different than the binary dataset we used in the zoo animals example, but the method for prediction would be the same: looking at the weight of its nearest neighbors in terms of age and height to predict weight.
#2. What packages (in R, Python…) perform supervised learning?
In this example, we used KNN via the class package in R for Supervised Learning. Some other packages we could have used are: 1. ada (Adaptive Boosting which boosts performance of decision trees on binary classification problems) 2. randomforest (ensemble learning method using a multitude of decision trees) 3. e1070 (SVM for text/image classification, hand-written character recognition, spam-filtering/etc.)
#3. What measures of quality of the learning algorithm might you expect to see?
Quality can be measured by the consistency and accuracy of labeled data.
From this exercise, we can see via the Cross Table that KNN did not work perfectly. We also saw that testing different splits was helpful to determine the split with the highest accuracy. We could then look at commonalities amongst which varaibles were consistently not predicted correctly, to see what more we could do to increase the quality of our model. An example of this in the case of zoo animals would be including more unique characteristics into our dataset for the animals that were predicted incorrectly. We could repeat this process until our model was running at the highest accuracy possible.
This is reflective of how Machine Learning operates. It is likely very rare to run at a high accuracy from the start. It is helpful to evaluate runs, look for ways to improve, test again, and then repeat that process until the learning algorithms are running at the highest quality possible.
In this exercise, we used Supervised Learning via KNN to train the algorithm to predict animal classes based the animal classes of its nearest neighbors in terms of animal characteristics. We saw that the model was not perfect, but was able to successfully predict as high as 88% of the classes correctly.
I could see Supervised Training being useful in my Advertising career where we are running in-theater ads before the movies. Since we sell ads based on estimated people in-theater, we need a way to predict movie attendance prior to films being released. There is likely a better way to estimate this using machine learning because we have different variables such as genres, actors, movie studio, seasonality, and attendance from past films that could be used to help predict attendance for upcoming films. Since movies are an art and not a science, there could very well be low accuracy with this, at least to start, but it never hurts to have an extra data point (especially considering our research team often predicts these very incorrectly).
I was curious if Supervised Learning could be used in tandem with other learning models i.e. Reinforcement? As in, Supervised Learning could be the base for the model, and then further improved via Reinforcement training?
I was also curious if there any benchmarks for machine learning accuracy? For example, should accuracy rates be closer to 100%? I am assuming it depends on the industry, but was curious if there were any general guidelines.
https://financetrain.com/difference-between-model-and-algorithm/#:~:text=To%20summarize%2C%20an%20algorithm%20is,produces%20some%20value%20as%20output. https://www.programmingr.com/tutorial/nrow-ncol/#:~:text=The%20function%20nrow%20in%20R,the%20elements%20to%20count%20them.&text=Using%20this%20example%20of%20the,set%20has%20thirty%2Dtwo%20rows. https://stats.oecd.org/glossary/detail.asp?ID=3835#:~:text=Definition%3A,to%20be%20%E2%80%9Cwith%20replacement%E2%80%9D. https://towardsdatascience.com/finally-why-we-use-an-80-20-split-for-training-and-test-data-plus-an-alternative-method-oh-yes-edc77e96295d https://quantdev.ssri.psu.edu/sites/qdev/f iles/kNN_tutorial.html https://livefreeordichotomize.com/2018/01/22/a-set-seed-ggplot2-adventure/#:~:text=The%20set.,always%20get%20the%20same%20sample. https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/ https://en.wikipedia.org/wiki/Random_forest https://worldclass.regis.edu/content/enforced/257453-CG_MSDS650-CLASS_C70_20F8W2/Course%20Resources/FTE/Week7_Machine%20Learning_v2.pdf?_&d2lSessionVal=lX7PU5awhnMrqTF8aBPapavY4&ou=257453 https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/ https://www.engineeringbigdata.com/apriori-algorithm/#:~:text=The%20Apriori%20algorithm%20can%20be,with%20a%20basic%20example%20set.&text=Apriori%20must%20be%20able%20to,and%20connected%20closer%20to%20relationships. https://hackernoon.com/how-to-measure-quality-when-training-machine-learning-models-cc9196dd377a#:~:text=Training%20data%20quality%20is%20critical,)%2C%20consensus%2C%20and%20review.