Overview

Bimodal features in a dataset are both problematic, interesting, and potentially an area of opportunity and exploration. Bimodal data suggests that there are possibly two different subgroups or classes being observed within the feature. I will be using the classic Moneyball dataset and the TEAM_BATTING_SO feature to illustrate.

Load Sample Data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(mixtools)
## mixtools package, version 1.2.0, Released 2020-02-05
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
library(ggplot2)

# Load Moneyball baseball dataset
df <- read.csv('datasets/moneyball-training-data.csv')

# Histogram plots of each variable
ggplot(df) + 
  geom_histogram(aes(x=TEAM_BATTING_SO, y = ..density..), bins=30) + 
  geom_density(aes(x=TEAM_BATTING_SO), color='blue')

Notice the clear bimodal structure suggesting that we have one group of teams more likely to strikeout at bat and a second group less likely to strike out. If we try to use this feature as-is with standard linear modeling techniques, we will lose accuracy in final prediction. A better approach is to identify the underlying subgroup distributions, then determine the probability that a given team’s TEAM_BATTING_SO value belongs to each distribution.

Determine Subgroups

The first step is to understand and model the likely subgroup distributions. R provides a package, mixtools (see R Vignette) which helps regress mixed models where data can be subdivided into subgroups. Here is a quick example showing a possible mix within BATTING_SO. Note that in the normalmixEM() call, we provide the range that spans both subgroup means. From the figure above, we see peaks at ~600 and ~900. I chose the range 300~1200 to ensure the mixmodel was able to capture both subgroups.

# Select BATTING_SO column and remove any missing data
df_mix <- df %>% 
  dplyr::select(TEAM_BATTING_SO) %>%
  drop_na()

# Calculate mixed distributions for BATTING_SO
model <- normalmixEM(df_mix$TEAM_BATTING_SO, 
                     lambda = .5, 
                     mu = c(300, 1200), 
                     sigma = 5, 
                     maxit=100)
## number of iterations= 61
# Simple plot to illustrate possible bimodal mix of groups
plot(model, 
     whichplots = 2,
     density = TRUE, 
     main2 = "TEAM_BATTING_SO Possible Distributions", 
     xlab2 = "TEAM_BATTING_SO")

Observe that there is a crossover point ~715 where the probability of belonging to either group is equal. Below this point, it is more likely the team belongs to the red group and above ~715, it’s more likely the team belongs to the green group. However, because there is overlap, we have some uncertainty.

Split the Bimodal Features

The first distribution (red) has \(\mu_1=523.8452395\) and \(\sigma_1 = 158.2155465\) and the second distribution (blue) has \(\mu_2=908.9373793\) and \(\sigma_2 = 158.2155465\).

We now have two possible feature engineering options.

  1. Create one new categorical feature where TEAM_BATTING_SO < 715 is assigned a value of 1 and >= 715 is assigned a value of 2.
  2. Create two new features that indicate the probability the team belongs to each group.

For this example, I will use the later approach. With this information we can add two new features to our dataset for each value of TEAM_BATTING_SO that contains the probability that value belongs to either the lower or higher distribution.

df_mix$TEAM_BATTING_SO_p1 <- pnorm(df_mix$TEAM_BATTING_SO, mean=model$mu[1], sd=model$sigma[1], lower.tail = F)
df_mix$TEAM_BATTING_SO_p2 <- pnorm(df_mix$TEAM_BATTING_SO, mean=model$mu[2], sd=model$sigma[2], lower.tail = T)

At this point, we could optionally drop the original TEAM_BATTING_SO feature and continue with any modeling approach using the two new features, TEAM_BATTING_SO_p1 and TEAM_BATTING_SO_p2.

References