# Checkpoint #1

# Library's 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
NFL <- read.csv("~/Downloads/local/bing/2025 - 2026 junior/fall 2025/dida 325/NFL.csv")
View(NFL)

# Who has hired you?
# We work as an NFL team scout for the Buffalo Bills, focusing on recruiting and finding new talent, and hoping to predict who and when people will be drafted in the NFL draft.

# What are their goals for this analysis?
# To find new talent to enhance our teams roster, some new young talent with great results in the combine in all combine fields overall. 

# What is the origin of the dataset
# The origin of the datatset is from the Pro Football Reference Site 
# https://www.pro-football-reference.com/?utm_source=pfr&utm_medium=sr_xsite&utm_campaign=2023_01_srnav

# Finding if the columns are numerical or categorical
str(NFL)
## 'data.frame':    3477 obs. of  18 variables:
##  $ Year               : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ Player             : chr  "Beanie Wells\\WellCh00" "Will Davis\\DaviWi99" "Herman Johnson\\JohnHe23" "Rashad Johnson\\JohnRa98" ...
##  $ Age                : int  20 22 24 23 22 23 24 21 23 22 ...
##  $ School             : chr  "Ohio St." "Illinois" "LSU" "Alabama" ...
##  $ Height             : num  1.85 1.88 2.01 1.8 1.88 ...
##  $ Weight             : num  106.6 118.4 165.1 92.1 110.7 ...
##  $ Sprint_40yd        : num  4.38 4.84 5.5 4.49 4.76 5.28 4.98 5.32 4.53 4.44 ...
##  $ Vertical_Jump      : num  85.1 83.8 NA 94 92.7 ...
##  $ Bench_Press_Reps   : int  25 27 21 15 26 29 NA 19 28 14 ...
##  $ Broad_Jump         : num  325 292 NA 305 305 ...
##  $ Agility_3cone      : num  NA 7.38 NA 7.09 7.1 NA NA 7.87 7.46 6.93 ...
##  $ Shuttle            : num  NA 4.45 NA 4.23 4.4 NA NA 4.88 4.43 4.16 ...
##  $ Drafted..tm.rnd.yr.: chr  "Arizona Cardinals / 1st / 31st pick / 2009" "Arizona Cardinals / 6th / 204th pick / 2009" "Arizona Cardinals / 5th / 167th pick / 2009" "Arizona Cardinals / 3rd / 95th pick / 2009" ...
##  $ BMI                : num  31 33.5 41 28.3 31.3 ...
##  $ Player_Type        : chr  "offense" "defense" "offense" "defense" ...
##  $ Position_Type      : chr  "backs_receivers" "defensive_lineman" "offensive_lineman" "defensive_back" ...
##  $ Position           : chr  "RB" "DE" "OG" "FS" ...
##  $ Drafted            : chr  "Yes" "Yes" "Yes" "Yes" ...
# Explaining what each column represents:
# Player: This is the name of the player
# Age: Age in years the given player is
# School: College that the given player went too
# Height: Height in METERS that the player is
# Weight: weight of player measured in kilograms
# Sprint_40yd: The players 40 yd spring time
# Vertical_Jump: The players vertical jump in centimeters
# Bench_Press_Reps: Maximum bench press reps achieved while lifting 102.1 KG
# Broad_Jump: Broad jump result in centimeters
# Agility_3cone: 3 cone agility test timed using seconds
# Shuttle: this is lateral shuttle time tested in seconds
# Drafted..tm.rnd.yr: Team the athlete was drafted by, the draft round, the draft pick, and the year
# BMI: Body mass index tested in kilograms by metered^2
# Player_Type: Offense or defensive player or special teams classification
# Position_Type: broad classification of the athletes playing position
# Position: the given players playing position
# Drafted: was the player drafted during the NFL draft
# Checkpoint #2

# Research Questions
# All of these have to do with diagnosing the data, seeing what matters most, statistical significance, and more

#Research Question 1:
#Can we predict which quarterback should be drafted first based on their overall combine performance score?

# Research Question 2:
# Regression: Which of the combined attributes have the most significance when it comes to being drafted.

#Research Question 3:
#Data Visualization: Do combine performance metrics differ across positions, do WRs, CBs, RBs, etc show similar speed and strength profiles?

#Research Question 4: 
#Data visualization: Does coming from a certain school significantly impact the odds of being drafted into the NFL?
# Checkpoint #3

# Research Question 2:
# Regression: Which of the combined attributes have the most significance when it comes to being drafted.
NFL$Drafted <- as.factor(NFL$Drafted)
model_draft <- glm(Drafted ~ Sprint_40yd + Vertical_Jump + Bench_Press_Reps + Broad_Jump + Agility_3cone + Shuttle + Weight, data = NFL, family = binomial)
summary(model_draft)
## 
## Call:
## glm(formula = Drafted ~ Sprint_40yd + Vertical_Jump + Bench_Press_Reps + 
##     Broad_Jump + Agility_3cone + Shuttle + Weight, family = binomial, 
##     data = NFL)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      13.395531   3.281109   4.083 4.45e-05 ***
## Sprint_40yd      -3.443932   0.504409  -6.828 8.63e-12 ***
## Vertical_Jump     0.019707   0.009751   2.021  0.04327 *  
## Bench_Press_Reps  0.032711   0.012303   2.659  0.00785 ** 
## Broad_Jump        0.003151   0.004969   0.634  0.52594    
## Agility_3cone    -0.643511   0.315738  -2.038  0.04154 *  
## Shuttle          -0.627600   0.476012  -1.318  0.18735    
## Weight            0.072456   0.007983   9.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2156.6  on 1730  degrees of freedom
## Residual deviance: 1946.3  on 1723  degrees of freedom
##   (1746 observations deleted due to missingness)
## AIC: 1962.3
## 
## Number of Fisher Scoring iterations: 4
# Regression diagnosis:
# 40 yard sprint, weight, and bench press reps are the most statistically significant factors, because they have the lowest p-values. A faster 40 yard sprint time, higher weight, and more bench press reps all increase the odds of getting drafted. Slower 3 cone times lower the odds of getting drafted. Broad jump and shuttle are not very significant, meaning they won't likely affect the likelihood of being drafted. All in all, the regression shows which attributes are most important to potentially getting drafted. 

# Missing Columns
# Height showed a negative coefficient, which makes no logical sense for predicting draft chances so it was removed. BMI was removed from the dataset, because it is redundant, since it is based on both height and weight.

# The regression helps us understand which physical characteristics from the combine are the most important when predicting whether or not a player gets drafted. 40 yard sprint, weight, and bench press reps are the most statistically significant factors. In other words, NFL teams prioritize players who are fast, heavy, and strong. By identifying which traits are the most important to the team, we can see which players match those physical aspect requirements and we can avoid prioritizing characteristics that aren't relevant when it comes to draft decisions. Overall, this allows the team to improve their draft model by identifying overlooked and undervalued prospects whose statistically significant characteristics exceed those of players in similar positions and project who is most likely to succeed at a professional level in the NFL. 
# Research Question 3:
# Data Visualization: Do combine performance metrics differ across positions, do WRs, CBs, RBs, etc show similar speed and strength profiles?

# Explanation: 
# The bar graph compares the average 40 yard sprint times (speed) and bench press reps (strength) across various NFL positions. Most positions that are dependent on speed, such as WR, CB, RB, usually have faster sprint times while positions that are dependent on strength such as offensive and defensive linemen generally have more bench press reps. Linebackers and edge rushers tend to fall in the middle with balanced speed and strength. Overall, this graph helps show how atheltic performance depends on position, which helps the Bills scout roles using physical trait evaluations. 

# c: The bar graph compares the average 40-yard sprint times (speed) and bench press reps (strength) across different NFL positions. The results show clear differences in athletic profiles: speed-oriented positions such as wide receivers, cornerbacks, and running backs consistently display faster sprint times, while strength-heavy positions such as offensive and defensive linemen show much higher bench press averages. Linebackers and edge rushers fall between these extremes, demonstrating balanced speed and strength. Overall, this visualization helps identify how athletic performance varies by position, which assists the Bills scouting department in evaluating the physical traits most important for each role.

# Filtered Positions:
# Quarterbacks, kickers, punters, and long snappers were removed from this analysis because their performance in the 40-yard dash and bench press does not meaningfully reflect their actual effectiveness or draft value. These positions rely more on technical skills, accuracy, or specialized roles rather than speed or upper-body strength. Removing them ensures the visualization focuses on positions where combine metrics provide relevant and comparable insights.

# The visualization helps our scouting model by showing how important differences in speed and strength are between different NFL positions. Some positions are more reliant on speed while others show better scores relative to others when it comes to strength. By putting speed and strength stats side by side, we can see which characteristics are more relevant for each position. For the Bills, this allows us to evaluate potential prospects, where if a player's speed or strength isn't up to the level of other players in a similar position, they may not be a good fit for the team in regards to that role. If the player's speed or strength exceeds that of players in similar positions, they may be a good fit and be an overlooked and undervalued pick. Overall, this chart helps us better understand which recruits have the right skill set to improve the team. 

# By comparing both traits side-by-side, we can immediately see which physical characteristics define each position.For the Buffalo Bills, this helps us evaluate prospects more accurately: if a player’s speed or strength does not match the typical profile for their position, that may signal a potential mismatch in athletic fit. If a prospect exceeds the average for their position—such as a faster-than-normal linebacker or a stronger-than-typical tight end—that could highlight undervalued draft upside. Overall, this chart strengthens our ability to predict who fits the Bills’ system and who has the physical tools to succeed at the next level.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.