Week 5 Data Dive

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NBA Dataset

Loading in the data:

NBA_Data <- read_csv("NBA Dataset for Submission.csv")

## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filtering out unnecessary games:

NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')

Looking at Two Columns

two_columns <- NBA_Data |>
  select(dreb_home, pts_qtr1_home)

I chose these two columns (defensive rebounds for home teams and 1st quarter points for home teams) due to their presence of NAs and an unknown reasoning why. Within the scope of the game of basketball, both of these columns should not be left with NAs.

sum_of_dreb_nas <- sum(is.na(two_columns$dreb_home))
print(paste("Defensive rebound NA count:", sum_of_dreb_nas))

## [1] "Defensive rebound NA count: 359"

sum_of_dreb_q1pts <- sum(is.na(two_columns$pts_qtr1_home))
print(paste("Q1 points NA count:", sum_of_dreb_q1pts))

## [1] "Q1 points NA count: 23831"

The documentation dictates that this data is scraped daily from NBA stats, but offers no further explanation as to why there are these aforementioned NAs.

Visualizing the Defensive Rebound Numbers

missing_dreb <- sum(is.na(two_columns$dreb_home))

two_columns |>
  ggplot(aes(x = dreb_home)) +
  geom_histogram(binwidth = 1, fill = "gray60", color = "white") +
  annotate("text",
           x = Inf, y = Inf,
           label = paste("Missing values:", missing_dreb),
           hjust = 1.1, vjust = 1.5,
           color = "red", size = 5) +
  labs(
    title = "Distribution of Home Defensive Rebounds (Missing Count Annotated)",
    x = "Home Defensive Rebounds",
    y = "Count"
  ) +
  theme_minimal()

## Warning: Removed 359 rows containing non-finite outside the scale range
## (`stat_bin()`).

While any NAs in the column is less than comforting, I think there are enough rows to get by that if I chose to pursue this column further, I could omit them and continue with analysis on the effects of rebounding on games. That being said, there are certainly risks associated with that, especially with the fact that this data is simply scraped from the NBA and therefore really should not have any null values in the rebounding categories - it is literally impossible for a team to go an entire game without a single rebound. It illuminates other challenges and risks with the data like potentially the next analysis.

Visualizing the First Quarter Points Number

missing_count <- sum(is.na(two_columns$pts_qtr1_home))

two_columns |>
  ggplot(aes(x = pts_qtr1_home)) +
  geom_histogram(binwidth = 2, fill = "gray60", color = "white") +
  annotate("text",
           x = Inf, y = Inf,
           label = paste("Missing values:", missing_count),
           hjust = 1.1, vjust = 1.5,
           color = "red", size = 5) +
  labs(
    title = "Distribution of Home Q1 Points (Missing Count Annotated)",
    x = "Home Points in Q1",
    y = "Count"
  ) +
  theme_minimal()

## Warning: Removed 23831 rows containing non-finite outside the scale range
## (`stat_bin()`).

I think the biggest concern here is the volume of rows affected in this graphic. Having nearly half of the rows missing data is a large concern and if I was counting on this particular statistic for insight than it would cause me great pause on my data set. However, I think this just discourages me from specific types of analyses, mainly quarter by quarter scoring. When I look at the general points scored for teams in a game, I don’t have the same NA issue so I will stick to utilizing those columns. I am concerned with the data set containing NAs, but I also think there is enough other data to work around this issue and continue to pursue it to not force me to switch data sets.

Looking at Two Categorical Columns

two_cat_columns <- NBA_Data |>   
  select(team_abbreviation_home, wl_home)

I chose these two columns (home team’s abbreviation and the game result) for their direct involvement in every game.

Explicitly Missing Rows

sum(is.na(two_cat_columns$team_abbreviation_home))

## [1] 0

If this returns anything > 0, those rows are explicitly missing. However, we see here that there are no explicitly missing rows, unlike earlier in the rebounding or first quarter point numbers.

Implicitly Missing Rows

length(unique(two_cat_columns$team_abbreviation_home))

## [1] 42

There are 30 active NBA teams so I was expecting this number to be at least 30. Anything less than that would indicate that there are missing rows. The opposite also makes sense since teams have relocated and changed names/cities over the years leading to new abbreviations.

There are also no empty groups relating to these two columns as well.

Continuous Column: Attendance

attendance_col <- NBA_Data |>
  select(attendance) |>
  filter(!is.na(attendance))

summary(attendance_col)

##    attendance    
##  Min.   :     0  
##  1st Qu.: 14890  
##  Median : 17317  
##  Mean   : 16716  
##  3rd Qu.: 19068  
##  Max.   :200049

Q1 <- quantile(attendance_col$attendance, 0.25, na.rm = TRUE)
Q3 <- quantile(attendance_col$attendance, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

attendance_col |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

We see in the above visualization and summary statistics that there is quite a large distribution within the attendance column. However, it seems pretty skewed below the 50K marker, with very few observations above it, so I am choosing that as a starting place for ridding outliers.

Getting rid of anything above 50K

attendance_col <- attendance_col |>
  filter(attendance <= 50000)

summary(attendance_col)

##    attendance   
##  Min.   :    0  
##  1st Qu.:14890  
##  Median :17317  
##  Mean   :16705  
##  3rd Qu.:19068  
##  Max.   :49551

attendance_col |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

This distribution looks a lot more “normal” than the earlier “unfiltered” version. With very few visual observations approaching 50K, one could still argue that they are outliers though, so I am curious what the arbitrary 1.5 times IQR approach will yield.

Trying the 1.5IQR Method

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

attendance_filtered <- attendance_col |>
  filter(attendance >= lower_bound & attendance <= upper_bound)

summary(attendance_filtered)

##    attendance   
##  Min.   : 8625  
##  1st Qu.:15025  
##  Median :17389  
##  Mean   :16989  
##  3rd Qu.:19127  
##  Max.   :25297

attendance_filtered |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

In contrast to any of the previous distributions, we don’t see any obvious outliers here and this resembles a relatively normal distribution. However, we also trimmed off a decent chunk of data to get here so was it worth it?

attendance_filtered |>
  ggplot(aes(y = attendance)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Game Attendance",
    y = "Attendance",
    x = ""
  ) +
  theme_minimal()

I think we get a lot more cohesive, and normal, of a distribution when we eliminate the outliers and get closer to the center. For the purpose of this exercise, I think it is safe to label the upper and lower bounds, as determined by the arbitrary test we discussed in class as the outliers, but I think one could also make the argument for different numbers depending on the issue at hand. I think the distribution for everything under 50,000 is not a poor distribution and could serve as a worthy use case if necessary as well. I would utilize one of the final two distributions depending on how much variance I am looking for in the data and what the question at hand is.