We develop a data pipeline capable of processing pictures for image classification purposes. Our motive for this technology is to identify objects in scenarios such as televised sports and security.
Image classification is usually started with microsized images and scaled up from there.
The CIFAR-10 dataset is a well known data consisting of ten different classes of color images. These classes include automobiles, deer, and horses. The images have very little processing and are scaled down to 32x32 pixels large. There are 60 thousand images, of which they are proportioned evenly between all 10 classes. Additionally, the order in the training set is random and resampling is always good practice.
The CIFAR-100 dataset is a sister dataset to the original CIFAR-10 dataset. It has largely the same specifications except it has two labels, a major and minor one.
For this project, we use the binary version of the files so the following paths are provided.
Cifar100Path <- function() {
"data/cifar-100-binary/cifar-100-binary"
}According to the CIFAR-10 binary specification, each image has one label and 3072 pixels embedded in each batch file. We can confirm that each batch contains ten thousand images by checking the file properties. Two implementations are shown below to verify this.
CountImages(CifarTrain(), 1 + 3072)## [1] 10000 10000 10000 10000 10000
CountImagesByPixel(CifarTrain(), 1024, 3, 1)## [1] 10000 10000 10000 10000 10000
Since our images are concatenated together, we can use some arithmetic to determine the binary position. This will help set our pointer when we read each image into a matrix.
head(FindPos(file.size(CifarTrain()[1]), 3073))## [1] 0 3073 6146 9219 12292 15365
We can verify that this works as intended by first defining our scanner and then how to interop. Specifically, we have to read a secondary file in the metadata to get the names of our labels. In this scenario, the row number or index corresponds to the class id out of convenience. Otherwise, we would have to do a left join here to accomplish the same goal. After reading the binary data, we have a 1xM matrix for all of the labels in CIFAR-10. This will actually be vectorized to read all of the files later instead of only one.
imageFile <- file(CifarTrain()[1], "rb")
cifarLabels = sapply(FindPos(10000, 3073), Label, imageFile) %>%
data.frame(labels = .) %>%
NumbersToLabels(., CifarLabelnames())
invisible(close(imageFile))
head(cifarLabels[[1]])## [1] "frog" "truck" "truck" "deer" "automobile"
## [6] "automobile"
Before we vectorize that function, let’s also take a look at our image data. As discussed, the image is 3072 pixels large. This corresponds to 1024 red bytes, 1024 green bytes, and 1024 blue bytes. Our definition to parse this creates a three dimensional matrix that organizes all of this data.
imageFile <- file(CifarTrain()[1], "rb")
cifarData = lapply(FindPos(10000, 3073), Image, imageFile, 1024L, 3L, 1L)
invisible(close(imageFile))
str(cifarData[1:6])## List of 6
## $ : int [1:1024, 1:3] 59 43 50 68 98 119 139 145 149 149 ...
## $ : int [1:1024, 1:3] 154 126 105 102 125 155 172 180 142 111 ...
## $ : int [1:1024, 1:3] 255 253 253 253 253 253 253 253 253 253 ...
## $ : int [1:1024, 1:3] 28 37 38 42 44 40 40 24 32 43 ...
## $ : int [1:1024, 1:3] 170 168 177 183 181 177 181 184 189 189 ...
## $ : int [1:1024, 1:3] 159 150 153 154 138 184 154 77 61 64 ...
For our data, it produces a matrix with 1024x3xM dimensions, representing pixels, channel, and count: x such that imagePixels, the value 0 to 255 of one pixel, top to bottom, left to right; y such that imageChannels, three columns correspond to the rgb channels in an image; z such that imageCount, separates each image to a layer.
UnlistDims(cifarData) %>%
str()## int [1:1024, 1:3, 1:10000] 59 43 50 68 98 119 139 145 149 149 ...
In order to fully appreciate what this data structure is, let’s suppose that we wanted to filter green. This data structure can do that extremely easily because the green channel is in m[,2,] here.
FilterGreen <- function(m) {
m[,2,] = 0
m
}
UnlistDims(cifarData) %>%
FilterGreen(.) %>%
.[,,1] %>%
head()## [,1] [,2] [,3]
## [1,] 59 0 63
## [2,] 43 0 45
## [3,] 50 0 43
## [4,] 68 0 42
## [5,] 98 0 52
## [6,] 119 0 63
Just in case you are skeptical, we can quickly visualize the distribution of color present. Shown in the graph is the three different color channels: red, green, and blue.
CoerceLabel <- function(m, groupname) {
df = mapply(1:3, c("red", "green", "blue"), FUN = function(x, y) {
values = list(as.vector(m[,x,]))
names(values) = y
values
}) %>%
do.call(cbind, .) %>%
data.frame(.) %>%
dplyr::mutate(group = groupname)
}
set.seed(8675309) %>%
{ rbind(
UnlistDims(cifarData) %>%
CoerceLabel("Pre Green Filter"),
UnlistDims(cifarData) %>%
FilterGreen(.) %>%
CoerceLabel("Post Green Filter")
) } %>%
reshape2::melt(id = "group") %>%
.[sample.int(nrow(.), 1000), ] %>%
ggplot2::ggplot(.) +
ggplot2::aes(x = variable, y = value, fill = group) +
ggplot2::geom_boxplot(shape = "circle") +
ggplot2::scale_fill_hue(direction = 1) +
ggplot2::theme_minimal()Since most models take two dimensional data like data frames, we can reduce the dimensions from three to two. This gives us the original dimensions present in the binary file. However, since this process is separate, we can keep all of our transforms from previously.
modelReadyCifar = UnlistDims(cifarData) %>%
FilterGreen(.) %>%
ThreeToTwoDims(.)
str(modelReadyCifar)## num [1:10000, 1:3072] 59 154 255 28 170 159 164 28 134 125 ...
Finally, let’s vectorize our whole process now that we know that everything works as intended. We want to vectorize the reading of files first. Afterwards, we double check that our interop functions are distinct and can handle unintentional args. Thus, we coerce everything to follow a template and we are done.
finalLabels = ReadCifarLabels(CifarTest()) %>%
unlist() %>%
data.frame(labels = .) %>%
NumbersToLabels()
finalResults = ReadCifarData(CifarTest(), simplify = FALSE) %>%
FilterGreen(.) %>%
ThreeToTwoDims(.) %>%
data.frame(.) %>%
cbind(finalLabels, .)
str(finalResults, list.len = 6)## 'data.frame': 10000 obs. of 3073 variables:
## $ labels: chr "cat" "ship" "ship" "airplane" ...
## $ X1 : num 158 235 158 155 65 179 160 83 23 217 ...
## $ X2 : num 159 231 158 167 70 139 185 82 19 210 ...
## $ X3 : num 165 232 139 176 48 77 209 81 21 205 ...
## $ X4 : num 166 232 132 190 30 88 217 77 65 199 ...
## $ X5 : num 160 232 166 177 23 141 230 81 164 218 ...
## [list output truncated]
finalResults[1:6, 1:6]## labels X1 X2 X3 X4 X5
## 1 cat 158 159 165 166 160
## 2 ship 235 231 232 232 232
## 3 ship 158 158 139 132 166
## 4 airplane 155 167 176 190 177
## 5 frog 65 70 48 30 23
## 6 frog 179 139 77 88 141
From the same resource, we can also get the CIFAR-100 dataset with 100 classes and 2 sets of labels. These are also 32x32 images so theoretically, we should be able to get the same results with minor changes. One thing of note, we have to write an additional labeling function to make it vectorized. Afterwards, it works exactly as expected.
proofFinalLabels = ReadCifarMultiLabels(CifarTest(filepath = Cifar100Path(), pattern = "test.bin"), imageLabels = 2L) %>%
unlist() %>%
data.frame(labels = .) %>%
NumbersToLabels(., CifarLabelnames(filepath = Cifar100Path(), pattern = ".*label_names.txt"))
proofFinalResults = ReadCifarData(CifarTest(filepath = Cifar100Path(), pattern = "test.bin"), simplify = FALSE) %>%
FilterGreen(.) %>%
ThreeToTwoDims(.) %>%
data.frame(.) %>%
cbind(proofFinalLabels, .)
str(proofFinalResults, list.len = 6)## 'data.frame': 10000 obs. of 3074 variables:
## $ labels.1: chr "large_natural_outdoor_scenes" "large_natural_outdoor_scenes" "aquatic_mammals" "fruit_and_vegetables" ...
## $ labels.2: chr "mountain" "forest" "seal" "mushroom" ...
## $ X1 : num 49 10 67 115 173 35 81 67 105 11 ...
## $ X2 : num 199 33 0 89 191 34 44 70 124 11 ...
## $ X3 : num 196 113 72 4 223 32 50 82 96 12 ...
## $ X4 : num 195 88 61 51 10 29 38 85 53 11 ...
## [list output truncated]
proofFinalResults[1:6, 1:6]## labels.1 labels.2 X1 X2 X3 X4
## 1 large_natural_outdoor_scenes mountain 49 199 196 195
## 2 large_natural_outdoor_scenes forest 10 33 113 88
## 3 aquatic_mammals seal 67 0 72 61
## 4 fruit_and_vegetables mushroom 115 89 4 51
## 5 large_natural_outdoor_scenes sea 173 191 223 10
## 6 flowers tulip 35 34 32 29
CIFAR-10 and CIFAR-100 are both due credit to Alex Krizhevsky, 2009. The tech report can be accessed here to find the dataset and methodology: Learning Multiple Layers of Features from Tiny Images, https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf An additional website owned by them can be found at this link: https://www.cs.toronto.edu/~kriz/cifar.html