TPOT Regressor vs. Neural Network Regressor

TPOT in Python is an automated machine learning library that picks the best possible model for a given classification or regression problem.

TPOT documentation can be found here.

TPOT tutorials are offered through DataCamp and Kaggle.

More detailed explanations of TPOT can be found on Towards Data Science.

First, we will use TPOT Regression to predict the number of wins an NBA team will have and evaluate model performance. Then we will predict the same problem using a Keras Neural Network and evaluate model performance; allowing for comparison between the TPOT automated regression and deep learning for a regression problem.

Let’s begin with TPOT Regression.

Read in the data found here

import pandas as pd
dt = pd.read_csv('C:/Users/aengland/Documents/dt_NBA_reg.csv')

Get the names of the columns in the dataset

print(dt.columns)
## Index(['W', 'PTS', 'oppPTS', 'FG', 'FGA', '2P', '2PA', '3P', '3PA', 'FT',
##        'FTA', 'ORB', 'DRB', 'AST', 'STL', 'BLK', 'TOV'],
##       dtype='object')

We will be using PTS (points scored), oppPTS (opposing team points), FG (field goals made), FGA (field goal attempts), 2P (2 point field goals made), 2PA (2 point field goal attmpts), 3P (3 point field goals made), 3PA (3 point field goal attempts), FT (free throws made), FTA (free throw attempts), ORB (offensive rebounds), DRB (defensive rebounds), AST (assists), STL (steals), BLK (blocks), and TOV (turnovers) to predict W (number of wins)

Preview the data

print(dt.head(5))
##     W   PTS  oppPTS    FG   FGA    2P  ...    ORB   DRB   AST  STL  BLK   TOV
## 0  44  8032    7999  3084  6644  2378  ...    758  2593  2007  664  369  1219
## 1  49  7944    7798  2942  6544  2314  ...   1047  2460  1668  599  391  1206
## 2  21  7661    8418  2823  6649  2354  ...    917  2389  1587  591  479  1153
## 3  45  7641    7615  2926  6698  2480  ...   1026  2514  1886  588  417  1171
## 4  24  7913    8297  2993  6901  2446  ...   1004  2359  1694  647  334  1149
## 
## [5 rows x 17 columns]

Get the dt info

print(dt.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 863 entries, 0 to 862
## Data columns (total 17 columns):
## W         863 non-null int64
## PTS       863 non-null int64
## oppPTS    863 non-null int64
## FG        863 non-null int64
## FGA       863 non-null int64
## 2P        863 non-null int64
## 2PA       863 non-null int64
## 3P        863 non-null int64
## 3PA       863 non-null int64
## FT        863 non-null int64
## FTA       863 non-null int64
## ORB       863 non-null int64
## DRB       863 non-null int64
## AST       863 non-null int64
## STL       863 non-null int64
## BLK       863 non-null int64
## TOV       863 non-null int64
## dtypes: int64(17)
## memory usage: 114.7 KB
## None

Get the desriptives of each variable

print(dt.describe())
##                 W           PTS     ...              BLK          TOV
## count  863.000000    863.000000     ...       863.000000   863.000000
## mean    40.989571   8360.232908     ...       419.793743  1299.221321
## std     12.744268    577.260038     ...        81.956890   153.200143
## min     11.000000   6901.000000     ...       204.000000   931.000000
## 25%     31.000000   7930.500000     ...       359.000000  1192.000000
## 50%     42.000000   8296.000000     ...       410.000000  1280.000000
## 75%     50.500000   8769.000000     ...       468.500000  1391.500000
## max     72.000000  10371.000000     ...       716.000000  1873.000000
## 
## [8 rows x 17 columns]

Check each column for the proportion of missing values

print(dt.isnull().sum()/dt.shape[0])
## W         0.0
## PTS       0.0
## oppPTS    0.0
## FG        0.0
## FGA       0.0
## 2P        0.0
## 2PA       0.0
## 3P        0.0
## 3PA       0.0
## FT        0.0
## FTA       0.0
## ORB       0.0
## DRB       0.0
## AST       0.0
## STL       0.0
## BLK       0.0
## TOV       0.0
## dtype: float64

Get X’s and y

X = dt.drop('W', axis=1)
y = dt['W']

Create train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Instantiate TPOTRegressor model with a patience of 3 and a maximum of 20 generations

from tpot import TPOTRegressor
model = TPOTRegressor(generations=20, verbosity=2, scoring='r2', early_stop=3)

fit the model to the training dataset

model.fit(X_train, y_train)

Show the best pipeline

print(model.fitted_pipeline_)
## Pipeline(memory=None,
##      steps=[('stackingestimator-1', StackingEstimator(estimator=LassoLarsCV(copy_X=True, cv=None, eps=2.220446049250313e-16,
##       fit_intercept=True, max_iter=500, max_n_alphas=1000, n_jobs=1,
##       normalize=True, positive=False, precompute='auto', verbose=False))), ('normalizer', Normalizer(copy=True,...x_n_alphas=1000, n_jobs=1,
##       normalize=True, positive=False, precompute='auto', verbose=False))])

Get predictions on the testing data

predictions = model.predict(X_test)

Plot a scatterplot of the actual vs. predicted values with a trendline and a Pearson correlation (r)

Print the interpretation of the Pearson r correlation coefficient

## There is a very strong, positive linear relationship between the predicted and actual values.

Print the regression metrics

## MAE:        2.476
## MSE:        9.263
## RMSE:       3.043
## R-Squared:  0.946

Plot histogram of the residuals

## C:\Users\aengland\AppData\Local\CONTIN~1\ANACON~1\lib\importlib\_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
##   return f(*args, **kwds)
## C:\Users\aengland\AppData\Local\CONTIN~1\ANACON~1\lib\site-packages\matplotlib\axes\_axes.py:6499: MatplotlibDeprecationWarning: 
## The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
##   alternative="'density'", removal="3.1")

Check the residuals for normality using the Shapiro-Wilk test

## Shapiro-Wilk test statistic: 1.0
## p-value: 0.67
## Fail to reject the null hypothesis. Data is normally distributed.

How accurate was this model?

## This model was able to predict within +/- 2.476 wins

How long did it take to build this model?

## Time to complete the TPOT model: 11.67 min.

Now, let’s use a Keras neural network to predict how many wins an NBA team will have.

We will use 19 nodes in the hidden layer because I ran a nested loop over 12 different models [with 12-24 nodes in the hidden layer] through 10 iterations each and found greatest mean R-Squared and lowest mean RMSE in the model with 19 nodes in the hidden layer (for more information on this see this article).

## C:\Users\aengland\AppData\Local\CONTIN~1\ANACON~1\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
##   from ._conv import register_converters as _register_converters
## Using TensorFlow backend.

Read in the data

import pandas as pd
dt = pd.read_csv('C:/Users/aengland/dt_NBA_reg.csv')

Standardize the predictor variables.

# Standardize predictor variables
scaler = StandardScaler()
# Fit scaler to the features
scaler.fit(dt.drop('W', axis = 1))
# Transform features to scaled version
scaled_features = scaler.transform(dt.drop('W', axis = 1))
# Save into data frame
dt_feat = pd.DataFrame(scaled_features, columns=dt.loc[:,dt.columns != 'W'].columns)

Get X’s and y

X = dt_feat
y = dt['W']

Create train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Save the weights of the model with the lowest loss val

filepath = 'C:/Users/aengland/model_weights.hdf5'
callbacks_list = [ModelCheckpoint(filepath, monitor='val_loss', verbose=True, save_best_only=True, mode='min')]

# This model has 1 hidden layer with 19 nodes
model = Sequential()
# Add first layer
model.add(Dense(units=len(X.columns), activation='relu', input_shape=(len(X.columns),)))
# Add next layer layer
model.add(Dense(units=19, activation='relu'))
# Add output layer
model.add(Dense(units=1))

model.compile(loss='mean_squared_error', optimizer='adam')

Fit model

model.fit(X, y, validation_split=0.33, epochs=300, callbacks=callbacks_list, verbose=True)

Load in the weights of the best model

model.load_weights('C:/Users/aengland/model_weights.hdf5')

Re-compile the model

model.compile(loss='mean_squared_error', optimizer='adam')

Re-fit the model

model.fit(X_train, y_train, validation_split=0.33, epochs=10, callbacks=[EarlyStopping(patience=3)], verbose=True)
## Train on 387 samples, validate on 191 samples
## Epoch 1/10
## 
 32/387 [=>............................] - ETA: 1s - loss: 5.3190
387/387 [==============================] - 0s 394us/step - loss: 9.6275 - val_loss: 9.0791
## Epoch 2/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 17.5623
387/387 [==============================] - 0s 23us/step - loss: 9.5175 - val_loss: 9.1212
## Epoch 3/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 9.1533
387/387 [==============================] - 0s 23us/step - loss: 9.4638 - val_loss: 9.0533
## Epoch 4/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 6.7724
387/387 [==============================] - 0s 21us/step - loss: 9.2282 - val_loss: 9.0321
## Epoch 5/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 8.5148
387/387 [==============================] - 0s 23us/step - loss: 9.0893 - val_loss: 9.0648
## Epoch 6/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 7.3680
387/387 [==============================] - 0s 23us/step - loss: 9.1010 - val_loss: 9.1082
## Epoch 7/10
## 
 32/387 [=>............................] - ETA: 0s - loss: 8.3683
387/387 [==============================] - 0s 23us/step - loss: 9.1061 - val_loss: 9.2011

Get predictions

predictions = model.predict(X_test)[:,0]

Plot a scatterplot of the actual vs. predicted values with a trendline and a Pearson correlation (r)