The Keypoint is an observation or a data sample. It’s one distinct, interesting thing you found in the image.
The 128-element Descriptor is the feature vector for that observation. Each of the 128 numbers is a variable that measures a specific visual property of the patch.
The Machine Learning Analogy Imagine you were trying to describe a person for a database. You might use a feature vector like this: [Height, Weight, Age, Shoe_Size, …]
In the SIFT world, the algorithm is describing a patch of an image. Its “person” is a 16x16 pixel neighborhood, and its feature vector has 128 highly sophisticated “measurements” instead of just 4.
| Concept | Traditional ML Example | SIFT Equivalent |
|---|---|---|
| One Data Sample / Observation | One Person | One Keypoint |
| Feature Vector | [Height, Weight, Age, Shoe_Size] e.g.,
[175, 68, 32, 10] |
The 128-element Descriptor e.g.,
[43., 2., 1., ..., 0., 0., 0.] |
| Number of Features | 4 features | 128 features |
| Feature 1 | Height = 175 cm | Histogram Bin 1 Value = 43.0 |
| Feature 2 | Weight = 68 kg | Histogram Bin 2 Value = 2.0 |
| Feature 3 | Age = 32 years | Histogram Bin 3 Value = 1.0 |
| … | … | … |
| Feature 128 | … | Histogram Bin 128 Value = 0.0 |
| Full Dataset | A table of many people, each with 4 measurements | The descriptors array: many keypoints,
each with 128 measurements |
Why 128 Features is Powerful This is what makes SIFT so powerful for computer vision. Instead of using raw pixels (which change with light, rotation, etc.), it computes these 128 high-level, numerical “measurements” that are:
Informative: They capture the essence of the texture and shape in the patch.
Consistent: The same patch in a rotated or slightly darker image will have a very similar 128-value vector.
Structured: Because every keypoint is described by the same number of features (128), you can easily compare them, cluster them, and feed them into machine learning models.
So, when you run SIFT on an image and get 150 keypoints, you are essentially creating a dataset with 150 rows (samples/observations) and 128 columns (features/variables). This dataset perfectly describes the visually interesting parts of your image in a way a computer can process.