Week 5 Data Dive

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NBA Dataset

Loading in the data:

NBA_Data <- read_csv("NBA Dataset for Submission.csv")

## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filtering out unnecessary games:

NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')

Looking at Two Columns

two_columns <- NBA_Data |>
  select(dreb_home, pts_qtr1_home)

I chose these two columns (defensive rebounds for home teams and 1st quarter points for home teams) due to their presence of NAs and an unknown reasoning why. Within the scope of the game of basketball, both of these columns should not be left with NAs.

sum_of_dreb_nas <- sum(is.na(two_columns$dreb_home))
print(paste("Defensive rebound NA count:", sum_of_dreb_nas))

## [1] "Defensive rebound NA count: 359"

sum_of_dreb_q1pts <- sum(is.na(two_columns$pts_qtr1_home))
print(paste("Q1 points NA count:", sum_of_dreb_q1pts))

## [1] "Q1 points NA count: 23831"

The documentation dictates that this data is scraped daily from NBA stats, but offers no further explanation as to why there are these aforementioned NAs.

Visualizing the Defensive Rebound Numbers

missing_dreb <- sum(is.na(two_columns$dreb_home))

two_columns |>
  ggplot(aes(x = dreb_home)) +
  geom_histogram(binwidth = 1, fill = "gray60", color = "white") +
  annotate("text",
           x = Inf, y = Inf,
           label = paste("Missing values:", missing_dreb),
           hjust = 1.1, vjust = 1.5,
           color = "red", size = 5) +
  labs(
    title = "Distribution of Home Defensive Rebounds (Missing Count Annotated)",
    x = "Home Defensive Rebounds",
    y = "Count"
  ) +
  theme_minimal()

## Warning: Removed 359 rows containing non-finite outside the scale range
## (`stat_bin()`).

Visualizing the First Quarter Points Number

missing_count <- sum(is.na(two_columns$pts_qtr1_home))

two_columns |>
  ggplot(aes(x = pts_qtr1_home)) +
  geom_histogram(binwidth = 2, fill = "gray60", color = "white") +
  annotate("text",
           x = Inf, y = Inf,
           label = paste("Missing values:", missing_count),
           hjust = 1.1, vjust = 1.5,
           color = "red", size = 5) +
  labs(
    title = "Distribution of Home Q1 Points (Missing Count Annotated)",
    x = "Home Points in Q1",
    y = "Count"
  ) +
  theme_minimal()

## Warning: Removed 23831 rows containing non-finite outside the scale range
## (`stat_bin()`).

I think the biggest concern here is the volume of rows affected in the secondary graphic. Having nearly half of the rows missing data is a large concern and if I was counting on this particular statistic for insight than it would cause me great pause on my data set. However, I think this just discourages me from specific types of analysis, mainly quarter by quarter scoring. When I look at the general points scored for teams in a game, I don’t have the same NA issue so I will stick to utilizing those columns.

Looking at Two Categorical Columns

two_cat_columns <- NBA_Data |>   
  select(team_abbreviation_home, wl_home)

I chose these two columns (home team’s abbreviation and the game result) for their direct involvement in every game.

Explicitly Missing Rows

sum(is.na(two_cat_columns$team_abbreviation_home))

## [1] 0

If this returns anything > 0, those rows are explicitly missing. However, we see here that there are no explicitly missing rows, unlike earlier in the rebounding or first quarter point numbers.

Implicitly Missing Rows

length(unique(two_cat_columns$team_abbreviation_home))

## [1] 42

There are 30 active NBA teams so I was expecting this number to be at least 30. Anything less than that would indicate that there are missing rows. The opposite also makes sense since teams have relocated and changed names/cities over the years leading to new abbreviations.

There are also no empty groups relating to these two columns as well.

Continuous Column: Attendance

attendance_col <- NBA_Data |>
  select(attendance) |>
  filter(!is.na(attendance))

summary(attendance_col)

##    attendance    
##  Min.   :     0  
##  1st Qu.: 14890  
##  Median : 17317  
##  Mean   : 16716  
##  3rd Qu.: 19068  
##  Max.   :200049

Q1 <- quantile(attendance_col$attendance, 0.25, na.rm = TRUE)
Q3 <- quantile(attendance_col$attendance, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

attendance_col |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

Getting rid of anything above 50K

attendance_col <- attendance_col |>
  filter(attendance <= 50000)

summary(attendance_col)

##    attendance   
##  Min.   :    0  
##  1st Qu.:14890  
##  Median :17317  
##  Mean   :16705  
##  3rd Qu.:19068  
##  Max.   :49551

attendance_col |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

Trying the 1.5IQR Method

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

attendance_filtered <- attendance_col |>
  filter(attendance >= lower_bound & attendance <= upper_bound)

summary(attendance_filtered)

##    attendance   
##  Min.   : 8625  
##  1st Qu.:15025  
##  Median :17389  
##  Mean   :16989  
##  3rd Qu.:19127  
##  Max.   :25297

attendance_filtered |>
  ggplot(aes(x = attendance)) +
  geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Game Attendance",
    x = "Attendance",
    y = "Count"
  ) +
  theme_minimal()

attendance_filtered |>
  ggplot(aes(y = attendance)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Game Attendance",
    y = "Attendance",
    x = ""
  ) +
  theme_minimal()

I think we get a lot more cohesive, and normal, of a distribution when we eliminate the outliers and get closer to the center. For the purpose of this exercise, I think it is safe to label the upper and lower bounds, as determined by the arbitrary test we discussed in class as the outliers, but I think one could also make the argument for different numbers depending on the issue at hand. I think the distribution for everything under 50,000 is not a poor distribution and could serve as a worthy use case if necessary as well.