My goal for this analysis is to tidy and transform the wide dataset that contains some basic statistics for NFL players and decipher how their physical attributes such as their height and weight, correlate to the length of their careers. The raw data provided by my classmates post from Discussion 5A, incudes a variety of data such as a players birthplace, college, name, and experience. This data is untidy with alot of missing values and inconsistent formatting, so it’ll be important to have the height and weight in a long format for analysis.
Planned Workflow
I’ll use tidyverse to load the csv containing their statistics, rename column headers into a consistent format, and also changing text strings seasons into a numerical value. I’ll also do some data separations to split columns for birthplace and career length calculations by splitting years played into start and end. After I finish tidying, I’ll use ggplot to determine if height and weight serve as significant predictors of career length.
Anticipated Challenges
A challenge with this dataset is the inconsistency of the columns. There’s many players that do not have all their information provided so they’re left blank. There’s also many player positions that are left blank which can also possibly play a part in career longevity but we’ll have to use the height and weight as discussed above to assess if that matters in career length.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nfl_raw <-read_csv("Basic_Stats.csv")
Rows: 17172 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Birth Place, Birthday, College, Current Status, Current Team, Expe...
dbl (4): Age, Height (inches), Number, Weight (lbs)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(nfl_tidy, aes(x = trait_value, y = career_length, color = physical_trait)) +geom_jitter(alpha =0.3) +geom_smooth(method ="lm", color ="black") +facet_wrap(~physical_trait, scales ="free_x") +labs(title ="NFL Career Length vs. Physical Attributes",x ="Measured Value (Inches or Lbs)",y ="Career Length (Years)",color ="Trait Type" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 6192 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 6192 rows containing missing values or values outside the scale range
(`geom_point()`).
Conclusion
Based on the trend in the graph, it appears that there’s a positive correlation for a players physical attributes when it involves their height and weight, and the length of their career. It shows that as a players height and weight increases, the length of their career also increases as well based on the information available for players in this dataset. You may also notice that there’s also players in this cloud that are below the the trend which can showcase that even having upper-level height and weight wont guarantee a long career in the NFL.