Can we generate and utilize artificial data in classifying our penguins?

The Palmer Penguins dataset consists of three major species across the Antarctican islands. Across the three islands, Palmer studied these penguins and took scientific measurements. Although it is generally complete, the raw data does have a few missing values, largely dependent if samples were unable to be extracted. Since most machine learning models are not fault tolerant, what artificial value should we use instead? Can an expected value be infered by the other features of the data?

Analyzing the data for correlations

Comparatively speaking, the Gentoo penguins are much larger birds. Their beaks, flippers, and body weight are all much bigger than the Adelie penguins.

A transparent example of some missing data that could possibly be filled is provided below. Generally, the males of each species are heavier than the females. Although this is not necessarily true of all animals, we can use data to establish some generalizations on a case to case basis.

In comparison however, it might be more difficult to distinguish an Adelie and Chinstrap penguin from each other if you are only provided the flipper size. This is not to say they do not have distinguishing factors, but that data has a tendency to have overlaps.

Looking for a solution

A more complex analysis may involve a scoring algorithm that compares each feature. Approaching the earlier graph, we may want to classify the gender from a statistical approach. This linear dependency can be approached using a machine learning algorithm such as linear regression. Certain SAS methods can evaluate features within a statistical range or boolean as well.

If instead, we wanted to derive the species we are expecting, we would need to produce nominal data instead of strict boolean classification. Another such machine learning model algorithm that classifies data into distinct groups is K-Nearest Neighbors, or KNN.

Putting an action plan together

Given that the nature of our data is quite simple, we will fill in artificial values with a statistical strategy. The distance away from the mean +/- 2 std devs will suffice since our earlier analysis showed a strong correlation between body mass and sex. This is easily doable in almost any platform of choice. Below is our dataset where the sex column is missing values:

## # A tibble: 9 x 7
##   Study Species Island  `Culmen Length (~ `Flipper Length~ `Body Mass (g)` Sex  
##   <int> <chr>   <chr>               <dbl>            <dbl>           <dbl> <chr>
## 1     1 Adelie  Torger~              34.1              193            3475 <NA> 
## 2     1 Adelie  Torger~              42                190            4250 <NA> 
## 3     1 Adelie  Torger~              37.8              186            3300 <NA> 
## 4     1 Adelie  Torger~              37.8              180            3700 <NA> 
## 5     1 Adelie  Dream                37.5              179            2975 <NA> 
## 6     1 Gentoo  Biscoe               44.5              216            4100 <NA> 
## 7     2 Gentoo  Biscoe               46.2              214            4650 <NA> 
## 8     3 Gentoo  Biscoe               47.3              216            4725 <NA> 
## 9     3 Gentoo  Biscoe               44.5              217            4875 <NA>

For one feature, body mass per species, we get the following array. Hypothetically, an extensible method will do this for every feature. But for brevity’s sake, we will use this one metric for scoring 1/1 features instead of 1/3! features.

## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

Thus, we can safely produce extra, artificial data where there previously wasn’t any. This saves us valuable data and bolsters the capabilities of our dataset. Following is the augmented dataset from the same set used previously.

## # A tibble: 9 x 7
##   Study Species Island  `Culmen Length (~ `Flipper Length~ `Body Mass (g)` Sex  
##   <int> <chr>   <chr>               <dbl>            <dbl>           <dbl> <chr>
## 1     1 Adelie  Torger~              34.1              193            3475 F    
## 2     1 Adelie  Torger~              42                190            4250 <NA> 
## 3     1 Adelie  Torger~              37.8              186            3300 F    
## 4     1 Adelie  Torger~              37.8              180            3700 F    
## 5     1 Adelie  Dream                37.5              179            2975 <NA> 
## 6     1 Gentoo  Biscoe               44.5              216            4100 F    
## 7     2 Gentoo  Biscoe               46.2              214            4650 F    
## 8     3 Gentoo  Biscoe               47.3              216            4725 F    
## 9     3 Gentoo  Biscoe               44.5              217            4875 F

From our data augmentations, we were able to fill in the sex for 7/9 missing data points. A more thorough algorithm is advisable for such a use case and is highly recommended. There are examples of such a thing online, although it is a diverse problem with many solutions.

Data cleaning methods

For data visualization purposes, data was cleaned through omission of missing data and string substitution in studyName, Species, and Sex. Additionally, 2/344 observations were omitted entirely due to lack of data.

Before:

## # A tibble: 6 x 7
##   studyName Species                             Island    `Culmen Length (mm)`
##   <chr>     <chr>                               <chr>                    <dbl>
## 1 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 39.1
## 2 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 39.5
## 3 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 40.3
## 4 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 36.7
## 5 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 39.3
## 6 PAL0708   Adelie Penguin (Pygoscelis adeliae) Torgersen                 38.9
##   `Flipper Length (mm)` `Body Mass (g)` Sex   
##                   <dbl>           <dbl> <chr> 
## 1                   181            3750 MALE  
## 2                   186            3800 FEMALE
## 3                   195            3250 FEMALE
## 4                   193            3450 FEMALE
## 5                   190            3650 MALE  
## 6                   181            3625 FEMALE

After:

## # A tibble: 6 x 7
##   Study Species Island    `Beak Size (mm)` `Flipper (mm)` `Body Mass (g)` Sex  
##   <int> <chr>   <chr>                <dbl>          <dbl>           <dbl> <chr>
## 1     1 Adelie  Torgersen             39.1            181            3750 M    
## 2     1 Adelie  Torgersen             39.5            186            3800 F    
## 3     1 Adelie  Torgersen             40.3            195            3250 F    
## 4     1 Adelie  Torgersen             36.7            193            3450 F    
## 5     1 Adelie  Torgersen             39.3            190            3650 M    
## 6     1 Adelie  Torgersen             38.9            181            3625 F