setwd("C:/Users/danie/OneDrive/Documents/Data 110 work/Final Project")
nba_data <- read.csv("all_seasons.csv")
As a sports enthusiast with a passion for basketball, I have chosen to explore an intriguing dataset that encapsulates the dynamic and diverse world of the National Basketball Association (NBA). This dataset, sourced from Basketball Reference (https://www.basketball-reference.com/), provides a comprehensive look into the professional careers of NBA players across various seasons. The data, accessible on Kaggle, is a rich compilation of player statistics and attributes, offering insights into the performance and progression of athletes in one of the world’s most prominent basketball leagues.
Basketball Reference, the original source, is renowned for its meticulous data compilation methods, ensuring the reliability and accuracy of the information presented. This source is favored by basketball analysts, enthusiasts, and professionals alike for its depth and breadth of data, covering nearly every aspect of the sport at a professional level.
The methodology behind the collection of this data, while not detailed in a ReadMe file, can be inferred from the standard practices of Basketball Reference. The data is likely gathered through a combination of official NBA records, game footage analysis, and statistical aggregations. This process involves collecting real-time game data, player statistics (both on-court performance and physical attributes), and team information, all of which are meticulously verified and updated.
Upon acquiring the dataset, the initial step involved a thorough examination of its structure and contents. The cleaning process entailed:
Removing Redundancies: Eliminating any duplicate records to ensure the uniqueness of each data entry.
Handling Missing Values: Assessing and addressing missing data points, either by imputing values where appropriate or removing records with substantial gaps.
Data Type Conversions: Ensuring that each variable was in its correct format (e.g., converting numerical data stored as strings into numeric types, parsing dates correctly).
Normalization: Standardizing certain variables for consistency, such as converting all heights and weights to a uniform measurement system.
These steps were crucial in preparing the dataset for a robust analysis, ensuring the integrity and reliability of the insights derived from it.
My choice of this dataset is driven by a deep-rooted love for basketball(sports in general) and a curiosity about the intricacies of player performances and their evolution.
Here’s a breakdown of the variables in the dataset:
player_name: The name of the basketball player.
team_abbreviation: Abbreviation of the team name.
age: Age of the player.
player_height: Height of the player in centimeters.
player_weight: Weight of the player in kilograms.
college: The college where the player studied, if applicable.
country: The player’s country of origin.
draft_year: The year the player was drafted.
draft_round: The round of the draft in which the player was selected.
draft_number: The number at which the player was drafted.
gp: Games played.
pts: Points per game.
reb: Rebounds per game.
ast: Assists per game.
net_rating: The player’s net rating.
oreb_pct: Offensive rebound percentage.
dreb_pct: Defensive rebound percentage.
usg_pct: Usage percentage.
ts_pct: True shooting percentage.
ast_pct: Assist percentage.
season: The season of play.
Player Performance Trends: How does a player’s performance (e.g., points, rebounds, assists) change with age?
Physical Attributes and Performance: Is there a correlation between a player’s height or weight and their performance metrics?
Team and Player Analysis: Which teams have had the most successful players in terms of points, assists, and rebounds?
Draft Analysis: How does the draft round or number correlate with a player’s performance in the league?
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Player Performance Trends
performance_trends_plot <- nba_data %>%
filter(!is.na(pts), !is.na(reb), !is.na(ast)) %>%
group_by(age) %>%
summarize(avg_points = mean(pts, na.rm = TRUE),
avg_rebounds = mean(reb, na.rm = TRUE),
avg_assists = mean(ast, na.rm = TRUE)) %>%
gather(key = "metric", value = "average", -age) %>%
ggplot(aes(x = age, y = average, color = metric)) +
geom_line() +
scale_color_manual(values = c("blue", "green", "red")) +
labs(title = "Player Performance Trends by Age",
x = "Age",
y = "Average Metric",
color = "Metric",
caption = "Data source: NBA Dataset") +
theme_minimal() +
ggtitle("Player Performance Trends Across Ages") +
theme(plot.caption = element_text(hjust = 0))
# Convert to interactive plot
performance_trends_interactive <- ggplotly(performance_trends_plot)
# Display the plot
performance_trends_interactive
Player Performance Trends: This visualization shows the average points, rebounds, and assists of NBA players across different ages. The lines represent the three key performance metrics. The plot provides insights into how a player’s performance evolves with age, helping to identify peak performance ages.
# Physical Attributes and Performance
attributes_performance_plot <- nba_data %>%
filter(!is.na(player_height), !is.na(player_weight), !is.na(pts)) %>%
ggplot(aes(x = player_height, y = pts, color = player_weight)) +
geom_point() +
scale_color_gradient(low = "purple", high = "orange") +
labs(title = "Relationship Between Player Height, Weight, and Points",
x = "Player Height (cm)",
y = "Points",
color = "Player Weight (kg)",
caption = "Data source: NBA Dataset") +
theme_light() +
ggtitle("Height, Weight, and Points Correlation in NBA Players")
# Convert to interactive plot
attributes_performance_interactive <- ggplotly(attributes_performance_plot)
# Display the plot
attributes_performance_interactive
Physical Attributes and Performance: This plot illustrates the correlation between players’ height, weight, and their points scored. Points are colored based on players’ weight, showing how physical attributes might influence scoring abilities.
# Team and Player Analysis
team_player_analysis_plot <- nba_data %>%
filter(!is.na(pts), !is.na(reb), !is.na(ast)) %>%
group_by(team_abbreviation) %>%
summarize(total_points = sum(pts, na.rm = TRUE),
total_rebounds = sum(reb, na.rm = TRUE),
total_assists = sum(ast, na.rm = TRUE)) %>%
gather(key = "metric", value = "total", -team_abbreviation) %>%
ggplot(aes(x = reorder(team_abbreviation, total), y = total, fill = metric)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("cyan", "magenta", "yellow")) +
labs(title = "Total Performance Metrics by Team",
x = "Team",
y = "Total Metric",
fill = "Metric",
caption = "Data source: NBA Dataset") +
theme_classic() +
coord_flip()
# Convert to interactive plot
team_player_analysis_interactive <- ggplotly(team_player_analysis_plot)
# Display the plot
team_player_analysis_interactive
Team and Player Analysis: This bar chart displays the total points, rebounds, and assists accumulated by players in each team. The metrics are differentiated by color, providing a clear comparison between teams’ overall performance.
# Draft Analysis
draft_analysis_plot <- nba_data %>%
filter(!is.na(draft_number), draft_number != "Undrafted") %>%
mutate(draft_number = as.numeric(draft_number)) %>%
group_by(draft_number) %>%
summarize(avg_points = mean(pts, na.rm = TRUE),
avg_rebounds = mean(reb, na.rm = TRUE),
avg_assists = mean(ast, na.rm = TRUE)) %>%
gather(key = "metric", value = "average", -draft_number) %>%
ggplot(aes(x = draft_number, y = average, color = metric)) +
geom_line() +
scale_color_manual(values = c("darkred", "darkgreen", "darkblue")) +
labs(title = "Average Performance Metrics by Draft Number",
x = "Draft Number",
y = "Average Metric",
color = "Metric",
caption = "Data source: NBA Dataset") +
theme_bw() +
ggtitle("Impact of Draft Number on NBA Players' Performance")
# Convert to interactive plot
draft_analysis_interactive <- ggplotly(draft_analysis_plot)
# Display the plot
draft_analysis_interactive
Draft Analysis: This line graph shows the average points, rebounds, and assists of NBA players based on their draft number. Different colors represent different metrics. This visualization can help identify trends in how the draft number impacts a player’s performance in the league.
Player Performance Trends: This visualization charts the changes in players’ performances - points, rebounds, and assists - with age. It highlights peak performance ages and the impact of aging on a player’s game. Some players may defy typical trends, maintaining or even improving performance with age.
Physical Attributes and Performance: This analysis explores the relationship between height, weight, and performance metrics. Certain physical attributes might favor specific statistics, and unique player profiles emerge, showcasing how different body types impact the game.
Team and Player Analysis: This visualization assesses teams based on their players’ aggregate performance metrics. It shows team strategies and their effectiveness, highlighting the reliance on star players versus a balanced approach.
Draft Analysis: This analysis examines the correlation between draft position and NBA performance. It challenges the expectation that higher draft picks always perform better, showcasing the unpredictability of player development and the effectiveness of scouting.
Each of these analyses provides a unique lens through which the dynamics of the NBA can be understood. From the aging curves of players to the impact of physical attributes, team strategies, and the uncertainties of the draft, these visualizations offer a deeper dive into the complex world of professional basketball. They not only highlight current trends and patterns but also open doors to further exploration, revealing the intricate interplay of factors that define success in the NBA. This comprehensive analysis is a testament to the richness of the dataset and the endless possibilities it holds for anyone passionate about understanding the nuances of basketball.