I. Introduction (10 points).

In this exercise, we will explore Supervised Learning. Supervised Learning algorithms use input data and response values to train the machine learning model to predict response values for future data. There are two categories of these algorithms: Classification and Regression.

For this assignment, we will be focusing on Classification, using the k-Nearest Neighbors (KNN) algorithm. This is used to classify a point based on the classification of the K nearest points. I chose to use a dataset based on zoo animals, that includes a variety of their characteristics - such as feathers, fins, hair, legs, and so on - and their respective animal classification: Mammal, Bird, Reptile, Fish, Amphibian, Bug, and Invertebrate. I will be using KNN to train the model to predict animal classifications, and checking with Cross Table to see how well the model functioned.

II. Methods/Code (20 points), III. Results/Output (20 points).

Step 1: Load the dataset into R & verify data

I loaded in the zoo data, and viewed it to confirm everything was properly retrieved.

zoo <- read.csv("C:/Users/linds/Documents/MSDS/MSDS 650 - Data Analytics/Week 7/zoo.csv")
View(zoo)

Step 2: Explore the data

Below I ran a summary of the data, in which you can see all of the different animal characteristics. For all characteristics, results are binary, showing either a 0 or 1. A 1 means that characteristic is present, and a 0 if it is not. We can also see there are 101 animal names loaded into this dataset.

summary(zoo)
##  animal_name             hair           feathers          eggs       
##  Length:101         Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.000   Median :1.0000  
##                     Mean   :0.4257   Mean   :0.198   Mean   :0.5842  
##                     3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##       milk           airborne         aquatic          predator     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.4059   Mean   :0.2376   Mean   :0.3564   Mean   :0.5545  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     toothed         backbone         breathes         venomous      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.604   Mean   :0.8218   Mean   :0.7921   Mean   :0.07921  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##       fins             legs            tail           domestic     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :4.000   Median :1.0000   Median :0.0000  
##  Mean   :0.1683   Mean   :2.842   Mean   :0.7426   Mean   :0.1287  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :8.000   Max.   :1.0000   Max.   :1.0000  
##     catsize        class_type       
##  Min.   :0.0000   Length:101        
##  1st Qu.:0.0000   Class :character  
##  Median :0.0000   Mode  :character  
##  Mean   :0.4356                     
##  3rd Qu.:1.0000                     
##  Max.   :1.0000

Per the instructions, I concatenated multiple columns, in this case aquatic and fin. This is an easier way to scale down the step above, to look at specific information.

I can see here based on the mean, that on average more animals in the dataset have tails than they do fins.

summary(zoo[c("tail", "fins")]) 
##       tail             fins       
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000  
##  Mean   :0.7426   Mean   :0.1683  
##  3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000

If we slice the data again, but this time looking at legs, we can see mean is not as helpful here. We know there are no animals with 2.8 legs, and even rounding up to 3 is not accurate as there are no animals in our dataset with 3 legs.

Instead, we can extract other learnings from this such as the min and max amount of legs. Some animals have as small as 0 legs, while the max amount of legs these animals have are 8. Looking at IQR, most of the animals have in between 2-4 legs.

summary(zoo[c("legs")]) 
##       legs      
##  Min.   :0.000  
##  1st Qu.:2.000  
##  Median :4.000  
##  Mean   :2.842  
##  3rd Qu.:4.000  
##  Max.   :8.000

Step 3: Load in the ggvis package

ggvis is a data visualization package for R similar to ggplot2. We will use this to visually represent some of the data we just explored above.

Please note: I received an error that I needed to install Rtools. I went ahead and installed this via https://cran.rstudio.com/bin/windows/Rtools/

library(ggvis) 

Step 4: Make a scatterplot of the dataset

By choosing a specific characteristic and plotting it against the animal class types, we can visualize what characteristic these animal classes have or don’t have. In this case, I have plotted animal class against feathers. We can easily see here that if an animal has feathers, it is a part of the Bird class.

Please note: Since results on this dataset were binary, I did not find the fill options to be applicable as they would be layered on top of one another.

zoo %>% 
  ggvis(~class_type, ~feathers) %>% 
  layer_points() 

Let’s take a look at a different characteristic: hair. When I plot below, you can see that this is not as clean as the previous plot. There are 2 animal classes, Mammal and Bug, that include animals both with and without hair. This means that if an animal has hair, we can’t easily predict what type of animal class it belongs to based off this one characteristic.

This is where machine learning can be helpful, because it can look at a variety of factors to predict animal class instead of getting stuck on situations like the below.

zoo %>% 
  ggvis(~class_type, ~hair) %>% 
  layer_points() 

Step 5: Download the “class” package

I downloaded the class package, which includes the functions needed for KNN.

library(class) 

Step 6: Training and Test Sets

To run KNN for predicting animal class, we first need to train the model.

To prepare for training, we need to set a seed and create a split of our data. We will first test out a split of 80/20. This is based off of the Pareto Principle, the 80/20 rule, which claims that 80% of effects come from 20% of the results.

We will set the model seed to 3465, which is the recommended way to specify seeds. Next, we will set our sample and call it ‘ind’. We use nrows() to return the number of rows in our dataset, and then a vector of 2, so that we assign a 1 or 2 to each row of data based on our split. Replacement is set to TRUE here, which means that each class is “replaced” after it is selected. In other words, the same class can show up more than once.

Next, we will use our variable ‘ind’ to set up our training and test sets. These will include all of the animal characteristics, which can be found in indices 2:17. We will have our train sets be equal to the rows assigned as 1, and our test sets equal to the rows assigned as 2. Using the 80/20 split, 80% of our data should have been assigned as 1, or our training set, and 20% to 2, our test set.

We will also set our training and test labels equal to the index 18 in our dataset. This is the label we will be training the model to predict.

*Please note: I received the below error when I included animal_name (index 1). From my understanding, this could not be included because it was not a number, but I wasn’t sure why. Can KNN only predict categorical variables, but not base predictions off of categorical variables? As introduced by coercionNAs introduced by coercionError in knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k = 9) : NA/NaN/Inf in foreign function call (arg 6)

set.seed(3465)
ind <- sample(2, nrow(zoo), replace=TRUE, prob=c(0.8, 0.2)) 

zooTrain <- zoo[ind==1, 2:17]
zooTest <- zoo[ind==2, 2:17] 
zooTrainLabels <- zoo[ind==1, 18] 
zooTestLabels <- zoo[ind==2, 18] 

Step 7: Find the k Nearest Neighbors of the training set

Now we are ready to train. We will use the supervised learning method, KNN. This will predict animal classification based on the classification of its nearest neighbors.

Please note: This was not in the instructions, but I did some research to determine how to calculate the optim*al k, as I read this is an important input. I found that a best practice was to take the square root of your sample size, make sure it is odd, and not too small or too large. I have 101 samples, which comes out to a square root of 10. I rounded down to the nearest odd, 9, to make sure I wasn’t choosing anything too large or even.

zoo_pred <- knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k=9)
zoo_pred 
##  [1] Bird      Mammal    Mammal    Mammal    Bird      Mammal    Bug      
##  [8] Mammal    Mammal    Bird      Fish      Fish      Mammal    Fish     
## [15] Amphibian Mammal   
## Levels: Amphibian Bird Bug Fish Invertebrate Mammal Reptile

Step 8: Evaluate the results

Now that we have trained our model, we can evaluate its effectiveness. We can use Cross Table for this, which will illustrate the relationship between the two variables by comparing the Test Label to the result of the KNN model. We are setting chi-square contribution to FALSE, so that this will not show up in the cells.

Per the below, we can see that the KNN model guessed 88% of the animal classifications correctly. This model worked well for classifying Amphibians, Birds, Fish, and Mammals, predicting all of those cases correctly. Invertebrate and Reptile did not go so well, as those were guessed completely incorrectly.

Please note: My Test Labels only included 6 of the 7 animal classes. I thought this might be due to sampling, but please correct me if I’m wrong.

library(gmodels) 

CrossTable(x = zooTestLabels, y = zoo_pred, prop.chisq = FALSE) 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  16 
## 
##  
##               | zoo_pred 
## zooTestLabels | Amphibian |      Bird |       Bug |      Fish |    Mammal | Row Total | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##     Amphibian |         1 |         0 |         0 |         0 |         0 |         1 | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.062 | 
##               |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.062 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##          Bird |         0 |         3 |         0 |         0 |         0 |         3 | 
##               |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.188 | 
##               |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.188 |     0.000 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##          Fish |         0 |         0 |         0 |         2 |         0 |         2 | 
##               |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.125 | 
##               |     0.000 |     0.000 |     0.000 |     0.667 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.125 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##  Invertebrate |         0 |         0 |         1 |         0 |         0 |         1 | 
##               |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.062 | 
##               |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.062 |     0.000 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##        Mammal |         0 |         0 |         0 |         0 |         8 |         8 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.500 | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##       Reptile |         0 |         0 |         0 |         1 |         0 |         1 | 
##               |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.062 | 
##               |     0.000 |     0.000 |     0.000 |     0.333 |     0.000 |           | 
##               |     0.000 |     0.000 |     0.000 |     0.062 |     0.000 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
##  Column Total |         1 |         3 |         1 |         3 |         8 |        16 | 
##               |     0.062 |     0.188 |     0.062 |     0.188 |     0.500 |           | 
## --------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Now, let’s repeat but with a 67/33 split to see if that is any more effective than 80/20. 67/33 means that we will have a smaller training group than we had prior, but a larger test group. We can revise our code to set up the new split:

ind <- sample(2, nrow(zoo), replace=TRUE, prob=c(0.67, 0.33)) 
zooTrain <- zoo[ind==1, 2:17]
zooTest <- zoo[ind==2, 2:17] 
zooTrainLabels <- zoo[ind==1, 18] 
zooTestLabels <- zoo[ind==2, 18] 
zoo_pred <- knn(train = zooTrain, test = zooTest, cl = zooTrainLabels, k=9) 
zoo_pred 
##  [1] Mammal       Mammal       Bug          Fish         Fish        
##  [6] Mammal       Invertebrate Mammal       Mammal       Bug         
## [11] Mammal       Mammal       Bird         Bug          Bird        
## [16] Bug          Mammal       Bug          Mammal       Bird        
## [21] Bird         Fish         Fish         Mammal       Mammal      
## [26] Fish         Mammal       Mammal       Fish         Fish        
## [31] Bird         Mammal       Bird        
## Levels: Amphibian Bird Bug Fish Invertebrate Mammal Reptile

This time, we had roughly 2x the observations, however the model only predicted 79% of animal classifications correctly. The model worked well for Birds, Bugs, and Fish, however Amphibian, Invertebrate, and Reptile were completely wrong, and Mammal had 1 incorrect classification.

*Please note: This time, all 7 animal classes were included in the test labels. I concluded that this was due to the larger test sample, but please correct me if I’m wrong.

CrossTable(x = zooTestLabels, y = zoo_pred, prop.chisq = FALSE) 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  33 
## 
##  
##               | zoo_pred 
## zooTestLabels |         Bird |          Bug |         Fish | Invertebrate |       Mammal |    Row Total | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##     Amphibian |            0 |            0 |            0 |            0 |            1 |            1 | 
##               |        0.000 |        0.000 |        0.000 |        0.000 |        1.000 |        0.030 | 
##               |        0.000 |        0.000 |        0.000 |        0.000 |        0.071 |              | 
##               |        0.000 |        0.000 |        0.000 |        0.000 |        0.030 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##          Bird |            6 |            0 |            0 |            0 |            0 |            6 | 
##               |        1.000 |        0.000 |        0.000 |        0.000 |        0.000 |        0.182 | 
##               |        1.000 |        0.000 |        0.000 |        0.000 |        0.000 |              | 
##               |        0.182 |        0.000 |        0.000 |        0.000 |        0.000 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##           Bug |            0 |            3 |            0 |            1 |            0 |            4 | 
##               |        0.000 |        0.750 |        0.000 |        0.250 |        0.000 |        0.121 | 
##               |        0.000 |        0.600 |        0.000 |        1.000 |        0.000 |              | 
##               |        0.000 |        0.091 |        0.000 |        0.030 |        0.000 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##          Fish |            0 |            0 |            3 |            0 |            0 |            3 | 
##               |        0.000 |        0.000 |        1.000 |        0.000 |        0.000 |        0.091 | 
##               |        0.000 |        0.000 |        0.429 |        0.000 |        0.000 |              | 
##               |        0.000 |        0.000 |        0.091 |        0.000 |        0.000 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##  Invertebrate |            0 |            2 |            0 |            0 |            0 |            2 | 
##               |        0.000 |        1.000 |        0.000 |        0.000 |        0.000 |        0.061 | 
##               |        0.000 |        0.400 |        0.000 |        0.000 |        0.000 |              | 
##               |        0.000 |        0.061 |        0.000 |        0.000 |        0.000 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##        Mammal |            0 |            0 |            3 |            0 |           13 |           16 | 
##               |        0.000 |        0.000 |        0.188 |        0.000 |        0.812 |        0.485 | 
##               |        0.000 |        0.000 |        0.429 |        0.000 |        0.929 |              | 
##               |        0.000 |        0.000 |        0.091 |        0.000 |        0.394 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##       Reptile |            0 |            0 |            1 |            0 |            0 |            1 | 
##               |        0.000 |        0.000 |        1.000 |        0.000 |        0.000 |        0.030 | 
##               |        0.000 |        0.000 |        0.143 |        0.000 |        0.000 |              | 
##               |        0.000 |        0.000 |        0.030 |        0.000 |        0.000 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
##  Column Total |            6 |            5 |            7 |            1 |           14 |           33 | 
##               |        0.182 |        0.152 |        0.212 |        0.030 |        0.424 |              | 
## --------------|--------------|--------------|--------------|--------------|--------------|--------------|
## 
## 

IV. Analysis of Results (20 points).

We can conclude that the 80/20 split for the KNN model is more effective for predicting animal class compared to 67/33, however there are some caveats. The model works well for predicting Amphibians, Birds, Fish, and Mammals, however it has not proven to work as well for the other animal class types. The other animal class types likely have too many similarities in the characteristics within this dataset to properly differentiate across animal classes. To improve the accuracy, we could look to add more unique characteristics into our dataset that might create stronger differences between them.

#1. How do supervised learning algorithms solve regression and classification problems? (I am not wanting a description of the internal workings of the algorithms.)

Regression and Classification both fall under Supervised learning, however their outputs are different. Regression outputs are numerical (or continuous), while Classification outputs are categorical (or discrete).

The example from this exercise shows how Supervised Learning solves for Classification. We trained the model to predict animal classification based on the animal classes of its nearest neighbors in terms of animal characteristics. Regression would be a similar concept, but with the output being numerical. An example of this could be predicting weight based on age and height. Our dataset would look at bit different than the binary dataset we used in the zoo animals example, but the method for prediction would be the same: looking at the weight of its nearest neighbors in terms of age and height to predict weight.

#2. What packages (in R, Python…) perform supervised learning?

In this example, we used KNN via the class package in R for Supervised Learning. Some other packages we could have used are: 1. ada (Adaptive Boosting which boosts performance of decision trees on binary classification problems) 2. randomforest (ensemble learning method using a multitude of decision trees) 3. e1070 (SVM for text/image classification, hand-written character recognition, spam-filtering/etc.)

#3. What measures of quality of the learning algorithm might you expect to see?

Quality can be measured by the consistency and accuracy of labeled data.

From this exercise, we can see via the Cross Table that KNN did not work perfectly. We also saw that testing different splits was helpful to determine the split with the highest accuracy. We could then look at commonalities amongst which varaibles were consistently not predicted correctly, to see what more we could do to increase the quality of our model. An example of this in the case of zoo animals would be including more unique characteristics into our dataset for the animals that were predicted incorrectly. We could repeat this process until our model was running at the highest accuracy possible.

This is reflective of how Machine Learning operates. It is likely very rare to run at a high accuracy from the start. It is helpful to evaluate runs, look for ways to improve, test again, and then repeat that process until the learning algorithms are running at the highest quality possible.

V. Conclusion (20 points).

In this exercise, we used Supervised Learning via KNN to train the algorithm to predict animal classes based the animal classes of its nearest neighbors in terms of animal characteristics. We saw that the model was not perfect, but was able to successfully predict as high as 88% of the classes correctly.

I could see Supervised Training being useful in my Advertising career where we are running in-theater ads before the movies. Since we sell ads based on estimated people in-theater, we need a way to predict movie attendance prior to films being released. There is likely a better way to estimate this using machine learning because we have different variables such as genres, actors, movie studio, seasonality, and attendance from past films that could be used to help predict attendance for upcoming films. Since movies are an art and not a science, there could very well be low accuracy with this, at least to start, but it never hurts to have an extra data point (especially considering our research team often predicts these very incorrectly).

I was curious if Supervised Learning could be used in tandem with other learning models i.e. Reinforcement? As in, Supervised Learning could be the base for the model, and then further improved via Reinforcement training?

I was also curious if there any benchmarks for machine learning accuracy? For example, should accuracy rates be closer to 100%? I am assuming it depends on the industry, but was curious if there were any general guidelines.

VI. References (10 points).

https://financetrain.com/difference-between-model-and-algorithm/#:~:text=To%20summarize%2C%20an%20algorithm%20is,produces%20some%20value%20as%20output. https://www.programmingr.com/tutorial/nrow-ncol/#:~:text=The%20function%20nrow%20in%20R,the%20elements%20to%20count%20them.&text=Using%20this%20example%20of%20the,set%20has%20thirty%2Dtwo%20rows. https://stats.oecd.org/glossary/detail.asp?ID=3835#:~:text=Definition%3A,to%20be%20%E2%80%9Cwith%20replacement%E2%80%9D. https://towardsdatascience.com/finally-why-we-use-an-80-20-split-for-training-and-test-data-plus-an-alternative-method-oh-yes-edc77e96295d https://quantdev.ssri.psu.edu/sites/qdev/f iles/kNN_tutorial.html https://livefreeordichotomize.com/2018/01/22/a-set-seed-ggplot2-adventure/#:~:text=The%20set.,always%20get%20the%20same%20sample. https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/ https://en.wikipedia.org/wiki/Random_forest https://worldclass.regis.edu/content/enforced/257453-CG_MSDS650-CLASS_C70_20F8W2/Course%20Resources/FTE/Week7_Machine%20Learning_v2.pdf?_&d2lSessionVal=lX7PU5awhnMrqTF8aBPapavY4&ou=257453 https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/ https://www.engineeringbigdata.com/apriori-algorithm/#:~:text=The%20Apriori%20algorithm%20can%20be,with%20a%20basic%20example%20set.&text=Apriori%20must%20be%20able%20to,and%20connected%20closer%20to%20relationships. https://hackernoon.com/how-to-measure-quality-when-training-machine-learning-models-cc9196dd377a#:~:text=Training%20data%20quality%20is%20critical,)%2C%20consensus%2C%20and%20review.