project1

2022-11-17

The Dataset

For this project, I will be using the NFL combine data dataset, from kaggle.com. The NFL combine is an athletic performance test for college football players entering the NFL. The dataset provides the stats for all NFL combine performances that each player in the dataset participated in.

Some notable information in this dataset for the use of this project are:

the names of the players
their performances in the events they participated in (including 40 yard dash time and vertical jump)
the position they play
their draft position
their AV (approximate value), a grade given by league scouting associates to each player’s performance

Importing The Dataset

The dataset’s link from kaggle is: https://www.kaggle.com/datasets/savvastj/nfl-combine-data

df <- read.csv(file = 'data/combine_data_since_2000_PROCESSED_2018-04-26.csv')

The Objective

With my dataset, I want to try to make models that relate a player’s draft position to their combine performance. By comparing certain values of combine results among players of the same position, we can determine which parts of the combine (if any) are critical in playing a role in draft position to better understand how important the physical performances really are.

Python

library(reticulate)
conda_create('r-reticulate', packages = "python")

## + '/Users/derrickhiebert/opt/anaconda3/bin/conda' 'create' '--yes' '--name' 'r-reticulate' 'python' '--quiet' '-c' 'conda-forge'

conda_install('re-reticulate', packages = 'pandas')

## + '/Users/derrickhiebert/opt/anaconda3/bin/conda' 'install' '--yes' '--name' 're-reticulate' '-c' 'conda-forge' 'pandas'

#conda_install('re-reticulate', packages = 'scikit-learn')
#use_python("/usr/bin/python3")
py_install("scikit-learn")

## + '/Users/derrickhiebert/.virtualenvs/r-reticulate/bin/python' -m pip install --upgrade --no-user 'scikit-learn'

Python Linear Regression Model

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/combine_data_since_2000_PROCESSED_2018-04-26.csv')
lr = LinearRegression()

df1 = df[df['Pick'].notna() & df['Forty'].notna() & df['Vertical'].notna() & df['Cone'].notna()]
df1 = df1[(df1.Pos == "WR") | (df1.Pos == "RB")]
x1 = df1[["Forty", "Vertical", "Cone"]]
y1 = df1[["Pick"]]
x_train1, x_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size = 0.25, random_state = 0)
lr.fit(x_train1, y_train1)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred1 = lr.predict(x_test1)
dfx1 = pd.DataFrame(y_test1.iloc[:,0].values)
dfy1 = pd.DataFrame(y_pred1)
dfr1 = dfy1 - dfx1

LinearRegression()

Predicted Pick based on Combine Tests

Histogram of Residuals from Linear Regression Model

Predicted Pick based on Approximate Value

Histogram of Residuals from Linear Regression Model

Conclusion

From the data, and the prediction model that was made, it can be determined that the combine events are not a good basis on their own for determining draft position, as there are many factors outside of them, but the combine as a whole, in relation to what scouts already know about these players can be viewed as important. Upon viewing the predicted versus actual picks for AV (approximate value), the model did not like to project players going late, however a combine score to draft position correlation can be clearly made from it. To conclude, one could say the combine may not do too much for a draft position of a player, due to too many factors outside the performance, however when comparing the right variables, a relation can be made, which can give us somewhat of a way to predict draft position based on the combine, which is normally not viewed as the best way to determine draft stock.