Classification of Wheat Kernel Varieties

An exploration using k-Nearest Neighbors to classify three types of wheat kernels

Brad Mager

Introduction

k-Nearest Neighbors is a classification algorithm that is easy to apply.

One needs to select:

  • The value of k (how many neighbors to use)
  • Which features to include from the data

The proposed app applies k-Nearest Neighbors to a data set and allows the user to choose the value of k and which features to include.

A graphical output displays the data points classified by color, indicating which ones were misclassified.

The goal is to help the user understand which features of this particular data set are most important, and how the value of k affects the accuracy of the classifier.

The Project

Researchers examined three varieties of wheat: Kama, Rosa, and Canadian.

They measured seven geometric parameters of the kernels, including length, width, area, and perimeter.

The data set comprises 210 observations, with 70 elements of each variety randomly selected.

We can use this data set to demonstrate how well k-Nearest Neighbors performs in classifiying the different wheat varieties.

The data set is available at: http://archive.ics.uci.edu/ml/datasets/seeds

Reference: M. Charytanowicz, et. al., 'A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.

Applying k-Nearest Neighbors

As an example, the data set has been divided into 60% training and 40% testing. We then apply k-Nearest Neighbors using just two of the features, with k = 1 (the default setting for the app).

features <- c("kernel_length", "kernel_width")
fit <- knn(training[,features], testing[,features], training$type, k=1)
confusionMatrix(testing$type, fit)$table
##           Reference
## Prediction  1  2  3
##          1 20  2  4
##          2  5 25  0
##          3  6  0 22

We can see that this combination of parameters yields an out-of-sample accuracy of 0.75. The user can change the parameters to see if the accuracy improves.

Results

The user sees the results in the form of a graph, where each wheat kernel has been classified by color, with an option to show the misclassified kernels with an 'x', as shown below.

plot of chunk unnamed-chunk-3