Data Selection

  • For this project, I chose to analyze NBA player data. The data set contains over two decades of data on each player who has been part of an NBA teams’ roster.

Data Description

The data set includes various stats for NBA players from 1996 to 2020.
The stats that we will mostly use are:

  • Counting stats such as points, rebounds, assists, and games played
  • Info about the player such as name, weight, height, age and nationality
  • Info about the player’s team such as NBA team, college and draft position
  • Team percentage stats such as usage pct and assist pct

Plot 1:

Best individual seasons

Best Individual Seasons

For this plot, my goal is to show only the players who were better than average in each statistical category. The data was achieved with the following code:

#only show players who are above average in PTS, REB, and AST 
#                                         with Net_Rating > 0
plot1data <-stats %>% filter(
                      stats$pts > mean(stats$pts, na.rm = TRUE),
                      stats$reb > mean(stats$reb, na.rm = TRUE),
                      stats$ast > mean(stats$ast, na.rm = TRUE), 
                      stats$gp > 41, #played over half season
                      stats$net_rating > 2)#positive impact

Best Individual Seasons

This scatter plot consists of all the players who were above average in PTS, REB and AST who also had a positive net rating when on the court (min 41 games played).

Best Individual Seasons

Code for the previous plot:

plot1 <- plot_ly(plot1data,mode = 'markers', x = ~pts, y = ~reb, z=~ast,  
                marker=list(
                size=5,
                color=plot1data$year,
                colorbar=list(
                title='Year'),
                colorscale='Viridis',
                reversescale =T), 
                text = ~paste(plot1data$player_name, plot1data$season))
plot1 <- plot1 %>% add_markers()
plot1 <- plot1%>% layout(title = "Best NBA seasons")
plot1 <- plot1 %>% layout(scene = list(xaxis = list(title = 'PPG'),
                     yaxis = list(title = 'RPG'),
                     zaxis = list(title = 'APG')))

Plot 2:

NBA player heights

NBA Player Heights through the years

This plot will show the distribution of heights throughout each NBA season. The data was achieved with the following code:

NBA Player Heights through the years

  • Below is the code used for the previous plot:
plot2 <- plot_ly(stats, x=~year, y = ~player_height,  
                 type = "box", 
                 marker = list(color = 'rgb(255,1,1)'),
                 line = list(color = 'rgb(0,0,0)'),
                 text = ~paste(stats$player_name, stats$season))
plot2 <- plot2%>%layout(title = "NBA Player Height Each Year", 
                        yaxis = list(title ="Player Height (cm)"), 
                        xaxis= list(title = "Year"))

Plot 3

What colleges produced the most NBA players?

College contributions to the NBA

This plot will show what colleges current NBA players attended in their years before entering the NBA.
To do this, I used the code below.

#get players from 2019
plot3data <- stats %>% filter(stats$year == 2019)
plot3data = plot3data%>%group_by(college)%>%mutate(count=n())
uniqueColleges = plot3data %>% distinct(college, .keep_all = TRUE)
#only count colleges with at least 5 players
uniqueColleges = uniqueColleges[uniqueColleges$count >= 5,]

College contributions to the NBA

Plot Code

Below is the code that I used to achieve the previous plot:

plot3 <-ggplot(uniqueColleges, aes(x = "", y = uniqueColleges$count,
            fill = uniqueColleges$college))+
            geom_bar(width = 1, stat = "identity", color = "white") +
            labs(x = "", y = "", title = "2019 NBA Player's Colleges \n",
            fill = "Colleges") +
            geom_text(aes(label = uniqueColleges$count),
            position = position_stack(vjust = 0.5)) +
            coord_polar(theta = "y")

Plot 4

Highest PPG for each year

Highest PPG for each year

  #group by each year and get top 1 in PTS
plot4data<-stats %>% group_by(year) %>% top_n(1, pts)

Highest PPG for each year

Highest PPG for each year

plot4data<-stats %>% group_by(year) %>% top_n(1, pts)

plot4<-ggplot(data=plot4data, aes(x=plot4data$year, 
                                  y=plot4data$pts, width = 0.65))+
              geom_bar(stat="identity", fill="orange")+
              geom_text(aes(label=plot4data$pts), vjust=-0.3, size=3.5)+
              ggtitle("Highest PPG for Each Season")+
              ylab("Points Per Game")+
              xlab("Year")+
              theme_minimal()

Simple Linear Regression

  • My goal for this linear regression model is to see if being taller will contribute to your NBA scoring ability
  • To do this I provided our model with data of heights and pts for all NBA players in our data set.
height<- stats$player_height
pts <- stats$pts
traindata <- data.frame(height, pts)

Linear Regression Model

  • Next I will run our regression model and plot it with a trendline:
model <- lm(pts ~ height, data = traindata)
traindata %>% 
    plot_ly(x = ~height) %>% 
  add_markers(y = ~pts) %>% 
  add_lines(x = ~height, y = fitted(model))%>%
  layout(title = 'Height Vs Pts ', xaxis = list(title = 'Player Height(cm)'), 
         yaxis = list(title = 'PTS'), showlegend = F)

Linear Regression plot

Evaluation

MultipleR
## [1] 0.003663154
AdjustedR
## [1] 0.003577983

Evaluation

Clearly, there is no connection between height and NBA scoring ability in the NBA based on our NBA player data.

The end