1 Abstract

We develop a data pipeline capable of processing pictures for image classification purposes. Our motive for this technology is to identify objects in scenarios such as televised sports and security.

2 Datasets

Image classification is usually started with microsized images and scaled up from there.

2.1 CIFAR-10

The CIFAR-10 dataset is a well known data consisting of ten different classes of color images. These classes include automobiles, deer, and horses. The images have very little processing and are scaled down to 32x32 pixels large. There are 60 thousand images, of which they are proportioned evenly between all 10 classes. Additionally, the order in the training set is random and resampling is always good practice.

2.2 CIFAR-100

The CIFAR-100 dataset is a sister dataset to the original CIFAR-10 dataset. It has largely the same specifications except it has two labels, a major and minor one.

3 Exploratory Data Analysis

For this project, we use the binary version of the files so the following paths are provided.

Cifar100Path <- function() {
  "data/cifar-100-binary/cifar-100-binary"
}

3.1 File Specifications

According to the CIFAR-10 binary specification, each image has one label and 3072 pixels embedded in each batch file. We can confirm that each batch contains ten thousand images by checking the file properties. Two implementations are shown below to verify this.

CountImages(CifarTrain(), 1 + 3072)

## [1] 10000 10000 10000 10000 10000

CountImagesByPixel(CifarTrain(), 1024, 3, 1)

## [1] 10000 10000 10000 10000 10000

3.2 Pointer Arithmetic

Since our images are concatenated together, we can use some arithmetic to determine the binary position. This will help set our pointer when we read each image into a matrix.

head(FindPos(file.size(CifarTrain()[1]), 3073))

## [1]     0  3073  6146  9219 12292 15365

3.3 Retrieving Labels

We can verify that this works as intended by first defining our scanner and then how to interop. Specifically, we have to read a secondary file in the metadata to get the names of our labels. In this scenario, the row number or index corresponds to the class id out of convenience. Otherwise, we would have to do a left join here to accomplish the same goal. After reading the binary data, we have a 1xM matrix for all of the labels in CIFAR-10. This will actually be vectorized to read all of the files later instead of only one.

imageFile <- file(CifarTrain()[1], "rb")
cifarLabels = sapply(FindPos(10000, 3073), Label, imageFile) %>%
  data.frame(labels = .) %>%
    NumbersToLabels(., CifarLabelnames())
invisible(close(imageFile))

head(cifarLabels[[1]])

## [1] "frog"       "truck"      "truck"      "deer"       "automobile"
## [6] "automobile"

3.4 Retrieving Images

Before we vectorize that function, let’s also take a look at our image data. As discussed, the image is 3072 pixels large. This corresponds to 1024 red bytes, 1024 green bytes, and 1024 blue bytes. Our definition to parse this creates a three dimensional matrix that organizes all of this data.

imageFile <- file(CifarTrain()[1], "rb")
cifarData = lapply(FindPos(10000, 3073), Image, imageFile, 1024L, 3L, 1L)
invisible(close(imageFile))

str(cifarData[1:6])

## List of 6
##  $ : int [1:1024, 1:3] 59 43 50 68 98 119 139 145 149 149 ...
##  $ : int [1:1024, 1:3] 154 126 105 102 125 155 172 180 142 111 ...
##  $ : int [1:1024, 1:3] 255 253 253 253 253 253 253 253 253 253 ...
##  $ : int [1:1024, 1:3] 28 37 38 42 44 40 40 24 32 43 ...
##  $ : int [1:1024, 1:3] 170 168 177 183 181 177 181 184 189 189 ...
##  $ : int [1:1024, 1:3] 159 150 153 154 138 184 154 77 61 64 ...

4 Data Transformations

For our data, it produces a matrix with 1024x3xM dimensions, representing pixels, channel, and count: x such that imagePixels, the value 0 to 255 of one pixel, top to bottom, left to right; y such that imageChannels, three columns correspond to the rgb channels in an image; z such that imageCount, separates each image to a layer.

UnlistDims(cifarData) %>%
  str()

##  int [1:1024, 1:3, 1:10000] 59 43 50 68 98 119 139 145 149 149 ...

4.1 Transform to Filter Green

In order to fully appreciate what this data structure is, let’s suppose that we wanted to filter green. This data structure can do that extremely easily because the green channel is in m[,2,] here.

FilterGreen <- function(m) {
  m[,2,] = 0
  m
}

UnlistDims(cifarData) %>%
  FilterGreen(.) %>%
    .[,,1] %>%
      head()

##      [,1] [,2] [,3]
## [1,]   59    0   63
## [2,]   43    0   45
## [3,]   50    0   43
## [4,]   68    0   42
## [5,]   98    0   52
## [6,]  119    0   63

4.2 Visualization of Green Filter

Just in case you are skeptical, we can quickly visualize the distribution of color present. Shown in the graph is the three different color channels: red, green, and blue.

CoerceLabel <- function(m, groupname) {
  df = mapply(1:3, c("red", "green", "blue"), FUN = function(x, y) {
    values = list(as.vector(m[,x,]))
    names(values) = y
    
    values
  }) %>%
    do.call(cbind, .) %>%
      data.frame(.) %>%
        dplyr::mutate(group = groupname)
}

set.seed(8675309) %>%
  { rbind(
    UnlistDims(cifarData) %>% 
      CoerceLabel("Pre Green Filter"),
    UnlistDims(cifarData) %>% 
      FilterGreen(.) %>% 
        CoerceLabel("Post Green Filter")
  ) } %>%
    reshape2::melt(id = "group") %>%
      .[sample.int(nrow(.), 1000), ] %>%
        ggplot2::ggplot(.) +
        ggplot2::aes(x = variable, y = value, fill = group) +
        ggplot2::geom_boxplot(shape = "circle") +
        ggplot2::scale_fill_hue(direction = 1) +
        ggplot2::theme_minimal()

4.3 Preparing for Modeling

Since most models take two dimensional data like data frames, we can reduce the dimensions from three to two. This gives us the original dimensions present in the binary file. However, since this process is separate, we can keep all of our transforms from previously.

modelReadyCifar = UnlistDims(cifarData) %>%
  FilterGreen(.) %>%
    ThreeToTwoDims(.)

str(modelReadyCifar)

##  num [1:10000, 1:3072] 59 154 255 28 170 159 164 28 134 125 ...

5 Analysis and Output

Finally, let’s vectorize our whole process now that we know that everything works as intended. We want to vectorize the reading of files first. Afterwards, we double check that our interop functions are distinct and can handle unintentional args. Thus, we coerce everything to follow a template and we are done.

finalLabels = ReadCifarLabels(CifarTest()) %>%
  unlist() %>%
    data.frame(labels = .) %>%
      NumbersToLabels()

finalResults = ReadCifarData(CifarTest(), simplify = FALSE) %>%
  FilterGreen(.) %>%
    ThreeToTwoDims(.) %>%
      data.frame(.) %>%
        cbind(finalLabels, .)

str(finalResults, list.len = 6)

## 'data.frame':    10000 obs. of  3073 variables:
##  $ labels: chr  "cat" "ship" "ship" "airplane" ...
##  $ X1    : num  158 235 158 155 65 179 160 83 23 217 ...
##  $ X2    : num  159 231 158 167 70 139 185 82 19 210 ...
##  $ X3    : num  165 232 139 176 48 77 209 81 21 205 ...
##  $ X4    : num  166 232 132 190 30 88 217 77 65 199 ...
##  $ X5    : num  160 232 166 177 23 141 230 81 164 218 ...
##   [list output truncated]

finalResults[1:6, 1:6]

##     labels  X1  X2  X3  X4  X5
## 1      cat 158 159 165 166 160
## 2     ship 235 231 232 232 232
## 3     ship 158 158 139 132 166
## 4 airplane 155 167 176 190 177
## 5     frog  65  70  48  30  23
## 6     frog 179 139  77  88 141

5.1 Verification of portability

From the same resource, we can also get the CIFAR-100 dataset with 100 classes and 2 sets of labels. These are also 32x32 images so theoretically, we should be able to get the same results with minor changes. One thing of note, we have to write an additional labeling function to make it vectorized. Afterwards, it works exactly as expected.

proofFinalLabels = ReadCifarMultiLabels(CifarTest(filepath = Cifar100Path(), pattern = "test.bin"), imageLabels = 2L) %>%
  unlist() %>%
    data.frame(labels = .) %>%
      NumbersToLabels(., CifarLabelnames(filepath = Cifar100Path(), pattern = ".*label_names.txt")) 

proofFinalResults = ReadCifarData(CifarTest(filepath = Cifar100Path(), pattern = "test.bin"), simplify = FALSE) %>%
  FilterGreen(.) %>%
    ThreeToTwoDims(.) %>%
      data.frame(.) %>%
        cbind(proofFinalLabels, .)

str(proofFinalResults, list.len = 6)

## 'data.frame':    10000 obs. of  3074 variables:
##  $ labels.1: chr  "large_natural_outdoor_scenes" "large_natural_outdoor_scenes" "aquatic_mammals" "fruit_and_vegetables" ...
##  $ labels.2: chr  "mountain" "forest" "seal" "mushroom" ...
##  $ X1      : num  49 10 67 115 173 35 81 67 105 11 ...
##  $ X2      : num  199 33 0 89 191 34 44 70 124 11 ...
##  $ X3      : num  196 113 72 4 223 32 50 82 96 12 ...
##  $ X4      : num  195 88 61 51 10 29 38 85 53 11 ...
##   [list output truncated]

proofFinalResults[1:6, 1:6]

##                       labels.1 labels.2  X1  X2  X3  X4
## 1 large_natural_outdoor_scenes mountain  49 199 196 195
## 2 large_natural_outdoor_scenes   forest  10  33 113  88
## 3              aquatic_mammals     seal  67   0  72  61
## 4         fruit_and_vegetables mushroom 115  89   4  51
## 5 large_natural_outdoor_scenes      sea 173 191 223  10
## 6                      flowers    tulip  35  34  32  29

6 References

CIFAR-10 and CIFAR-100 are both due credit to Alex Krizhevsky, 2009. The tech report can be accessed here to find the dataset and methodology: Learning Multiple Layers of Features from Tiny Images, https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf An additional website owned by them can be found at this link: https://www.cs.toronto.edu/~kriz/cifar.html

Processing Cifar-10

Anthony A

12/3/2022