Available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.see http://yann.lecun.com/exdb/mnist/. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
With some classification methods (particuarly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications.
Four files are available on this site:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
You need to download all 4 files, unzip them and saved in your local directory.
train-images.idx3-ubyte: training set images
train-labels.idx1-ubyte: training set labels
t10k-images.idx3-ubyte: test set images
t10k-labels.idx1-ubyte: test set labels
The data is stored in a very simple file format designed for storing vectors and multidimensional matrices.
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (Most Significant Bit, MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
……..
xxxx unsigned byte ?? label
The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
……..
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
f = file("C:/Python36/my_project/project_0/MNIST_data/train-images.idx3-ubyte", "rb")
# Read Magic Number(A constant numerical or text value used to identify a file format)
readBin(f, integer(), n=1, endian="big")
## [1] 2051
# Read Number of Images
readBin(f, integer(), n=1, endian="big")
## [1] 60000
# Read Number of Rows
readBin(f, integer(), n=1, endian="big")
## [1] 28
# Read Number of Columns
readBin(f, integer(), n=1, endian="big")
## [1] 28
# Read pixels of every image, each image has nrow x ncol pixels
# Store them in a matrix form for easy visulization
m = (matrix(readBin(f,integer(), size=1, n=28*28, endian="big"),28,28))
image(m)
# Let's flip the image (Sinec we know the first letter is a "5")
df = as.data.frame(m)
df1 = df[,c(28:1)]
m=as.matrix(df1)
image(m)
# Do the same for first 25 images
par(mfrow=c(5,5))
par(mar=c(0.1,0.1,0.1,0.1))
for(i in 1:25){m = matrix(readBin(f,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}
f = file("C:/Python36/my_project/project_0/MNIST_data/train-labels.idx1-ubyte", "rb")
# Read Magic Number
readBin(f, integer(), n=1, endian="big")
## [1] 2049
# Read Number of Labels
n = readBin(f,'integer',n=1,size=4,endian='big')
# Read All the Labels
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
# See if the first letter is "5"
y[1]
## [1] 5
# Display first 25 labels
mlabel=t(matrix(y[2:26],5,5))
mlabel
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 4 1 9 2
## [2,] 1 3 1 4 3
## [3,] 5 3 6 1 7
## [4,] 2 8 6 9 4
## [5,] 0 9 1 1 2
load_image_file <- function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}
load_label_file <- function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
train <- load_image_file("C:/Python36/my_project/project_0/MNIST_data/train-images.idx3-ubyte")
test <- load_image_file("C:/Python36/my_project/project_0/MNIST_data/t10k-images.idx3-ubyte")
train$y <- load_label_file("C:/Python36/my_project/project_0/MNIST_data/train-labels.idx1-ubyte")
test$y <- load_label_file("C:/Python36/my_project/project_0/MNIST_data/t10k-labels.idx1-ubyte")
class(train)
## [1] "list"
lengths(train)
## n x y
## 1 47040000 60000
class(test)
## [1] "list"
lengths(test)
## n x y
## 1 7840000 10000
df_train = as.data.frame(train)
dim(df_train)
## [1] 60000 786
df_test = as.data.frame(test)
dim(df_test)
## [1] 10000 786