Bimodal features in a dataset are both problematic, interesting, and potentially an area of opportunity and exploration. Bimodal data suggests that there are possibly two different subgroups or classes being observed within the feature. I will be using the classic Moneyball dataset and the TEAM_BATTING_SO feature to illustrate.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(mixtools)
## mixtools package, version 1.2.0, Released 2020-02-05
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
library(ggplot2)
# Load Moneyball baseball dataset
df <- read.csv('datasets/moneyball-training-data.csv')
# Histogram plots of each variable
ggplot(df) +
geom_histogram(aes(x=TEAM_BATTING_SO, y = ..density..), bins=30) +
geom_density(aes(x=TEAM_BATTING_SO), color='blue')
Notice the clear bimodal structure suggesting that we have one group of teams more likely to strikeout at bat and a second group less likely to strike out. If we try to use this feature as-is with standard linear modeling techniques, we will lose accuracy in final prediction. A better approach is to identify the underlying subgroup distributions, then determine the probability that a given team’s TEAM_BATTING_SO value belongs to each distribution.
The first step is to understand and model the likely subgroup distributions. R provides a package, mixtools (see R Vignette) which helps regress mixed models where data can be subdivided into subgroups. Here is a quick example showing a possible mix within BATTING_SO. Note that in the normalmixEM() call, we provide the range that spans both subgroup means. From the figure above, we see peaks at ~600 and ~900. I chose the range 300~1200 to ensure the mixmodel was able to capture both subgroups.
# Select BATTING_SO column and remove any missing data
df_mix <- df %>%
dplyr::select(TEAM_BATTING_SO) %>%
drop_na()
# Calculate mixed distributions for BATTING_SO
model <- normalmixEM(df_mix$TEAM_BATTING_SO,
lambda = .5,
mu = c(300, 1200),
sigma = 5,
maxit=100)
## number of iterations= 61
# Simple plot to illustrate possible bimodal mix of groups
plot(model,
whichplots = 2,
density = TRUE,
main2 = "TEAM_BATTING_SO Possible Distributions",
xlab2 = "TEAM_BATTING_SO")
Observe that there is a crossover point ~715 where the probability of belonging to either group is equal. Below this point, it is more likely the team belongs to the red group and above ~715, it’s more likely the team belongs to the green group. However, because there is overlap, we have some uncertainty.
The first distribution (red) has \(\mu_1=523.8452395\) and \(\sigma_1 = 158.2155465\) and the second distribution (blue) has \(\mu_2=908.9373793\) and \(\sigma_2 = 158.2155465\).
We now have two possible feature engineering options.
TEAM_BATTING_SO < 715 is assigned a value of 1 and >= 715 is assigned a value of 2.For this example, I will use the later approach. With this information we can add two new features to our dataset for each value of TEAM_BATTING_SO that contains the probability that value belongs to either the lower or higher distribution.
df_mix$TEAM_BATTING_SO_p1 <- pnorm(df_mix$TEAM_BATTING_SO, mean=model$mu[1], sd=model$sigma[1], lower.tail = F)
df_mix$TEAM_BATTING_SO_p2 <- pnorm(df_mix$TEAM_BATTING_SO, mean=model$mu[2], sd=model$sigma[2], lower.tail = T)
At this point, we could optionally drop the original TEAM_BATTING_SO feature and continue with any modeling approach using the two new features, TEAM_BATTING_SO_p1 and TEAM_BATTING_SO_p2.