Introduction

Image recognition is an evolving technology that heavily relies on machine learning. When working with different machine learning techniques, we often used datasets containing primarily numerical or character data. However, machine learning is a broad field with applications for data that is not in the form of text. In an effort to explore image recognition, I decided to aim to develop a model to classify different types of fingerprints, potentially for improving current fingerprint recognition technology.

Data Explanation

Variables: Fingerprint Class Type: categorical (“W”,“R”,“L”,“T”,“A”) PNG Image Type: continuous (as matrix of numeric values)

Data Origin and Collection

http://www.nist.gov/srd/nistsd4.cfm

The data for this model was obtained from a NIST database containing 2000 8-bit gray scale fingerprint image pairs, each of which is 512 x 512 pixels. The fingerprints are classified as Arch, Left Loop, Right Loop, Tented Arch, and Whorl patterns.

Data Preparation

The images are .png files; each is paired with a .txt file of matching ID that contains the gender of the person to which the fingerprint belongs, and the class of the fingerprint. The images were read in as a list of matrices, and the class of each corresponding fingerprint was extracted from the .txt files and stored in a list. Then, the data were divided into training and testing sets in preparation for modeling.

library(png)
library(parallel)
library(e1071)
library(caret)
folder <- "."
dirs <- list.dirs(path = "./NISTSpecialDatabase4GrayScaleImagesofFIGS/sd04/png_txt/")
txt = vector()
png = vector()
count = 2
txt <- list.files(path = dirs[count],pattern = ".*.txt",full.names = T)
png <- list.files(path = dirs[count],pattern = ".*.png",full.names = T)
imgs <- mclapply(png, readPNG, mc.cores = 2)
txt_tbl <- sapply(txt, function(x) scan(x,what = character()))
filenames <- names(txt_tbl[4,])
class <- txt_tbl[4,]
library(stringr)
filenames <- sapply(filenames, function(x) str_extract(x,"[f][0-9].*"))
img_ids <- lapply(filenames, function(file_name) as.numeric(unlist(strsplit(unlist(strsplit(file_name, "_"))[1],"f"))[2]))

train_test_border <- 500
train_in <- t(array(unlist(imgs[img_ids < train_test_border]), dim=c(length(unlist(imgs[1])),sum(img_ids < train_test_border))))
train_out <- class[img_ids < train_test_border]
test_in <- t(array(unlist(imgs[img_ids > train_test_border]), dim=c(length(unlist(imgs[1])),sum(img_ids >= train_test_border))))
test_out <- class[img_ids >= train_test_border]

Exploratory Analysis

Due to much difficulty in the Data Preparation stage, there was not sufficient time to perform much exploratory analysis. When experimenting with different models, I realized how computationally expensive it would be to work with thousands of 512 x 512 images as matrices. To avoid complication and for sake of time, I discarded gender as a variable even though it is provided in the text files. This decision likely positively influenced the accuracy of the model since I had to reduce the sample size when modeling due to memory restrictions.

Modeling

As this project is centered around image recognition/classification, I reasoned that random forest, SVM or neural net models would be most appropriate. Random forest and neural nets would be effective if my machine was capable of working with all 2000 samples in a timely manner, but since I had to lower my sample size to 600, SVM was most appropriate. A linear SVM model was developed on a training set of 500 samples, and was tested on 100 samples.

svm_model <- svm(train_in, train_out, type='C', kernel='linear')

Model Evaluation and Results

When testing the model on 100 samples, the accuracy was approximately 27.72%. This is quite low, but based on a relatively small training set without feature engineering, this model is not terrible. Randomly guessing a class for a given fingerprint would theoretically yield 20% accuracy, so the model is at least better than random guessing. If I had more time, I would take more random samples and perform statistical significance tests to ascertain this model’s validity. I attempted to use cross validation via the train function in the caret library, but after running for an hour, I terminated the process.

#Evaluate svm on training set
#p1 <- predict(svm_model, train_in)
#print(p)
#print(table(p, train_out))
# evaluate svm on testing set
p2 <- predict(svm_model, test_in)
print(p2)
print(table(p2, test_out))
#print accuracy
print(sum(diag(table(p2,test_out)))/sum(table(p2,test_out)))

0.2772277

Conclusion

In its current state, this model is next to useless. While it shows some promise in correctly classifying fingerprints, it is likely not based on a large enough training set. I’m unable to draw any significant conclusions other than that developing a model for image classification is a very computationally expensive task. With more processing power and time, I may be able to develop a more accurate model that can be supported by statistical significance tests; if so, then perhaps this model could be an effective way to identify fingerprint patterns.

NOTE: Due to currently inexplicable errors caused by markdown, this version of the document will not execute the code.