NBA Final Project

Author

Ben Coyle

Introduction

The purpose of this report is to study different factors that play apart in an average NBA game. Obviously players, coaches, and front offices play a big part in the NBA but in this report I want to look at more of the miscellaneous variables including the referees and the different NBA arenas.

Inquiry

I am a big NBA fan. I enjoy following the NBA as I know most of the players and follow all 30 teams. Since I have a love for the NBA I knew that I wanted to choose a basketball related project. I wanted to know if there are external factors that play a part in the NBA. This is why I chose to look at the referees and different arenas in this report. I started by researching referees in the NBA.

Data Dictionary

REFEREE: Referee Name

ROLE: Crew or Chief

GENDER: Male or Female

EXPERIENCE..YEARS: Years reffing

GAMES.OFFICIATED: NBA games officiated in 2023

HOME.TEAM.WIN: % of games that home team has won when officiating

HOME.TEAM.POINTS.DIFFERENTIAL: Point difference for home team when officiating

TOTAL.POINTS.PER.GAME: Points per game by teams when reffing

CALLED.FOULS.PER.GAME: Fouls called per game when reffing

FOUL..AGAINST.ROAD.TEAM: Fouls called against the road team

FOUL..AGAINST.HOME.TEAM: Fouls called against the home team

FOUL.DIFFERENTIAL: Foul differential against road to home team

reviewer_name: User name

reviewer_location: User location

review_date: date of review

review_title: Title of review

review_content: What the review says

arena: Madison Square Garden or Chase Center

How do Referees Impact the NBA?

Whenever there seems to be a close game, the losing team often blames the refs. I am curious to see if there has been any bias in reffing this season or at least in correlation going on in reffing. I ended up making 5 different graphs.

First, I wanted to see if the amount of experience a ref had made a difference in which refs were Crew and which refs were Chief.

Second, I wanted to see if refs called more fouls on the road team compared the home team.

Third, I wanted to see how often the home team won when reffing.

Fourth, I wanted to see if there was a big disparity in fouls called in a game among refs.

And last, I wanted to see the diversity in the gender of refs.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.2
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'readr' was built under R version 4.3.2
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
Warning: package 'tidytext' was built under R version 4.3.2
library(ggwordcloud)
Warning: package 'ggwordcloud' was built under R version 4.3.2
library(textdata)
Warning: package 'textdata' was built under R version 4.3.2
library(readr)
library(rvest)
Warning: package 'rvest' was built under R version 4.3.2

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(lubridate)
library(httr)
Warning: package 'httr' was built under R version 4.3.2

Attaching package: 'httr'

The following object is masked from 'package:textdata':

    cache_info
library(stringr)
library(ggplot2)
library(knitr)
refstats <- read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/coyleb2_xavier_edu/EbAFXKKGRYBKoFjMtinm9skB4MrSppjCqSmglgELMfRdNg?download=1")

Role vs Experience

average_experience <- refstats %>%
  group_by(ROLE) %>%
  summarise(average_experience = mean(EXPERIENCE..YEARS., na.rm = TRUE))

ggplot(average_experience, aes(x = ROLE, y = average_experience, fill = ROLE)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Experience Years by NBA Referee Role", x = "Referee Role", y = "Average Experience Years") +
  theme_minimal()

Looking at the graph above, the average Chief referee had experience close to 19 years whereas the average crew had an average experience of 12 years. Although the disparity did not surprise me, I was surprised on how long people are refs. Being a ref for 19 years is a long time.

Home vs Away Foul Differential

ggplot(refstats, aes(x = REFEREE, y = FOUL.DIFFERENTIAL..Against.Road.Team.....Against.Home.Team., fill = ROLE)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Foul Differential (Against Road Team) - (Against Home Team) by Referee",
       x = "Referee",
       y = "Foul Differential",
       fill = "Role") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

facet_wrap(~ROLE, scales = "free_y")
<ggproto object: Class FacetWrap, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetWrap, Facet, gg>

For this graph, being at zero means that the ref is calling equally as many foul for the home and road team. Above zero means that refs are calling more fouls for the road team and below zero means more for the home team. Although fouls do not have to be even, it does seem that refs are calling more fouls on the road team.

Home vs Away Win Percentage

ggplot(refstats, aes(x = REFEREE, y = FOUL.DIFFERENTIAL..Against.Road.Team.....Against.Home.Team., fill = ROLE)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Foul Differential (Against Road Team) - (Against Home Team) by Referee",
       x = "Referee",
       y = "Foul Differential",
       fill = "Role") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

facet_wrap(~ROLE, scales = "free_y")
<ggproto object: Class FacetWrap, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetWrap, Facet, gg>

For this graph, being at 0.5 means that the road team is winning just as much as the home team. Above 0.5 means the home team wins more, and below 0.5 means the road team wins more. Now with home court advantage since 2004 the home team has won around 59% of the time. Although a couple refs seem really high or really low, this is about what I was expecting.

Called Fouls by Ref

ggplot(refstats, aes(x = ROLE, y = CALLED.FOULS.PER.GAME, fill = ROLE)) +
  geom_boxplot() +
  labs(title = "Box Plot of Called Fouls per Game by Referee Role",
       x = "Referee Role",
       y = "Called Fouls per Game") +
  theme_minimal()

For this graph, I wanted to see if there was a big disparity in fouls called. Outside of a couple outliers most refs on average call around 40 fouls a game.

Ref by Gender

ggplot(refstats, aes(x = GENDER, y = EXPERIENCE..YEARS., color = GENDER)) +
  geom_point() +
  labs(title = "Scatter Plot of Gender vs Years of Experience",
       x = "Gender",
       y = "Years of Experience") +
  theme_minimal()

write.csv(refstats, "referee_data.csv", row.names = FALSE)

This surprised me, there are only four female refs in the NBA. It does make sense that they do not have much experience as I see this number growing over the years.

Referee Conclusion

Based off my results, I learned that every ref is different than each other. Although I do not believe they are rigging games, every ref calls a game differently. Some factors can be experience and home crowd. Overall, I have never really thought about officiating so this was a cool project to do.

How does Madison Square Garden Compare to Chase Center?

Every arena across the NBA is different in their own way. Some arenas have different color courts, some arenas are bigger and some are louder. I want to compare the oldest arena ever (Madison Square Garden) with the newest arena (Chase Center) to see how their reviews compare to each other. I used tripadvisor.com to compare the reviews.

arena_stats <- read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/coyleb2_xavier_edu/Ed8jUQqCzI9BgpEdY9t8RD0BwMJzlGFDBwk-rehS1n824g?download=1")

Madison Square Garden vs Chase Center Net Sentiment Comparison

words <- arena_stats %>%
  unnest_tokens(word, review_content)

# Get Bing sentiment lexicon
bing_lexicon <- get_sentiments("bing")

# Join with Bing sentiments
word_sentiments <- words %>%
  inner_join(bing_lexicon, by = "word")

# Count positive and negative sentiments for each arena
sentiment_counts <- word_sentiments %>%
  count(arena, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(total = positive + negative, net_sentiment = positive - negative)

# Filter for Madison Square Garden and Chase Center
filtered_arenas <- sentiment_counts %>%
  filter(arena %in% c("Madison Square Garden", "Chase Center"))

# Compare sentiments
print(filtered_arenas)

ggplot(filtered_arenas, aes(x = arena, y = net_sentiment, fill = arena)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Net Sentiment Comparison: Madison Square Garden vs Chase Center",
       x = "Arena", y = "Net Sentiment Score")

Looking at the table and graph above, Madison Square Garden seem to have more positive and negative reviews. This makes sense because since this arena is so historic, people have strong opinions. Overall, Chase Center has a higher net sentiment which is good because since Chase Center has opened, people have been giving it positive reviews.

Sentiment Over Time

bing_lexicon <- get_sentiments("bing")
arena_sentiments <- arena_stats %>%
  filter(arena %in% c("Madison Square Garden", "Chase Center")) %>%
  unnest_tokens(word, review_content) %>%
  inner_join(bing_lexicon, by = "word") %>%
  mutate(review_date = mdy(review_date)) %>%
  group_by(arena, review_date, sentiment) %>%
  summarize(count = n(), .groups = 'drop')
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `review_date = mdy(review_date)`.
Caused by warning:
!  84 failed to parse.
# Calculate sentiment score (positive - negative)
sentiment_scores <- arena_sentiments %>%
  spread(sentiment, count, fill = 0) %>%
  mutate(sentiment_score = positive - negative)

# Plotting sentiment over time
ggplot(sentiment_scores, aes(x = review_date, y = sentiment_score, color = arena)) +
  geom_line() +
  labs(title = "Sentiment Over Time for Madison Square Garden",
       x = "Review Date", y = "Sentiment Score") +
  theme_minimal()
Warning: Removed 1 row containing missing values (`geom_line()`).

For this graph, I only looked at Madison Square Garden’s sentiment over time because the chase center is so new. I realized that Madison Square Garden’s sentiment over time has a lot to do with the Knicks performance. They were disappointing in 2021 and 2022 and are now getting back on track. In 2020 their sentiment was probably so high because not many people could review MSG during the corona virus.

Word Analysis

arenastats_filtered <- arena_stats %>%
  filter(arena %in% c("Madison Square Garden", "Chase Center")) %>%
  unnest_tokens(word, review_content) %>%
  anti_join(stop_words, by = "word")  # Remove common stop words

# Count word frequencies
word_counts <- arenastats_filtered %>%
  count(arena, word, sort = TRUE) %>%
  group_by(arena) %>%
  top_n(10, n)  # Get top 10 words for each arena

# Plotting
ggplot(word_counts, aes(x = reorder(word, n), y = n, fill = arena)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~arena, scales = "free_y") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top Words in Reviews for Madison Square Garden and Chase Center",
       x = "Word", y = "Frequency")

For this graph, I am comparing the ten most common words used when reviewing Madison Square Garden and the Chase Center. For the Chase Center it was all words you are expecting like seats, game, and food. For Madison Square Garden one of the most common words was “people”. This makes sense because of how packed it is. Money was also a common theme among the reviews as well.

Arena Conclusion

Based off my results, I learned that an experience at Madison Square Garden is a lot different than the experience at the Chase Center. Every stadium in the NBA is so unique and there are different pros and cons that come with each stadium. That is why it is so cool to visit a new stadium to watch a basketball game.

Verdict

Although the NBA is focused around the players, coaches, and front offices. There are a lot of other factors that go into the NBA that most people take for granted. First, The type of ref does have an impact on the game. Second, each arena is different from the next. These factors are what helps make the NBA so great.