The Palmer Penguins dataset consists of three major species across the Antarctican islands. Across the three islands, Palmer studied these penguins and took scientific measurements. Although it is generally complete, the raw data does have a few missing values, largely dependent if samples were unable to be extracted. Since most machine learning models are not fault tolerant, what artificial value should we use instead? Can an expected value be infered by the other features of the data?
Comparatively speaking, the Gentoo penguins are much larger birds. Their beaks, flippers, and body weight are all much bigger than the Adelie penguins.
A transparent example of some missing data that could possibly be filled is provided below. Generally, the males of each species are heavier than the females. Although this is not necessarily true of all animals, we can use data to establish some generalizations on a case to case basis.
In comparison however, it might be more difficult to distinguish an Adelie and Chinstrap penguin from each other if you are only provided the flipper size. This is not to say they do not have distinguishing factors, but that data has a tendency to have overlaps.
A more complex analysis may involve a scoring algorithm that compares each feature. Approaching the earlier graph, we may want to classify the gender from a statistical approach. This linear dependency can be approached using a machine learning algorithm such as linear regression. Certain SAS methods can evaluate features within a statistical range or boolean as well.
If instead, we wanted to derive the species we are expecting, we would need to produce nominal data instead of strict boolean classification. Another such machine learning model algorithm that classifies data into distinct groups is K-Nearest Neighbors, or KNN.
Given that the nature of our data is quite simple, we will fill in artificial values with a statistical strategy. The distance away from the mean +/- 2 std devs will suffice since our earlier analysis showed a strong correlation between body mass and sex. This is easily doable in almost any platform of choice. Below is our dataset where the sex column is missing values:
## # A tibble: 9 x 7
## Study Species Island `Culmen Length (~ `Flipper Length~ `Body Mass (g)` Sex
## <int> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Adelie Torger~ 34.1 193 3475 <NA>
## 2 1 Adelie Torger~ 42 190 4250 <NA>
## 3 1 Adelie Torger~ 37.8 186 3300 <NA>
## 4 1 Adelie Torger~ 37.8 180 3700 <NA>
## 5 1 Adelie Dream 37.5 179 2975 <NA>
## 6 1 Gentoo Biscoe 44.5 216 4100 <NA>
## 7 2 Gentoo Biscoe 46.2 214 4650 <NA>
## 8 3 Gentoo Biscoe 47.3 216 4725 <NA>
## 9 3 Gentoo Biscoe 44.5 217 4875 <NA>
For one feature, body mass per species, we get the following array. Hypothetically, an extensible method will do this for every feature. But for brevity’s sake, we will use this one metric for scoring 1/1 features instead of 1/3! features.
## [1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
Thus, we can safely produce extra, artificial data where there previously wasn’t any. This saves us valuable data and bolsters the capabilities of our dataset. Following is the augmented dataset from the same set used previously.
## # A tibble: 9 x 7
## Study Species Island `Culmen Length (~ `Flipper Length~ `Body Mass (g)` Sex
## <int> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Adelie Torger~ 34.1 193 3475 F
## 2 1 Adelie Torger~ 42 190 4250 <NA>
## 3 1 Adelie Torger~ 37.8 186 3300 F
## 4 1 Adelie Torger~ 37.8 180 3700 F
## 5 1 Adelie Dream 37.5 179 2975 <NA>
## 6 1 Gentoo Biscoe 44.5 216 4100 F
## 7 2 Gentoo Biscoe 46.2 214 4650 F
## 8 3 Gentoo Biscoe 47.3 216 4725 F
## 9 3 Gentoo Biscoe 44.5 217 4875 F
From our data augmentations, we were able to fill in the sex for 7/9 missing data points. A more thorough algorithm is advisable for such a use case and is highly recommended. There are examples of such a thing online, although it is a diverse problem with many solutions.
For data visualization purposes, data was cleaned through omission of missing data and string substitution in studyName, Species, and Sex. Additionally, 2/344 observations were omitted entirely due to lack of data.
Before:
## # A tibble: 6 x 7
## studyName Species Island `Culmen Length (mm)`
## <chr> <chr> <chr> <dbl>
## 1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1
## 2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5
## 3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3
## 4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7
## 5 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.3
## 6 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.9
## `Flipper Length (mm)` `Body Mass (g)` Sex
## <dbl> <dbl> <chr>
## 1 181 3750 MALE
## 2 186 3800 FEMALE
## 3 195 3250 FEMALE
## 4 193 3450 FEMALE
## 5 190 3650 MALE
## 6 181 3625 FEMALE
After:
## # A tibble: 6 x 7
## Study Species Island `Beak Size (mm)` `Flipper (mm)` `Body Mass (g)` Sex
## <int> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Adelie Torgersen 39.1 181 3750 M
## 2 1 Adelie Torgersen 39.5 186 3800 F
## 3 1 Adelie Torgersen 40.3 195 3250 F
## 4 1 Adelie Torgersen 36.7 193 3450 F
## 5 1 Adelie Torgersen 39.3 190 3650 M
## 6 1 Adelie Torgersen 38.9 181 3625 F