library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NBA_Data <- NBA_Data |>
filter(season_type != 'All-Star',
season_type != 'All Star',
season_type != 'Pre Season')
two_columns <- NBA_Data |>
select(dreb_home, pts_qtr1_home)
I chose these two columns (defensive rebounds for home teams and 1st quarter points for home teams) due to their presence of NAs and an unknown reasoning why. Within the scope of the game of basketball, both of these columns should not be left with NAs.
sum_of_dreb_nas <- sum(is.na(two_columns$dreb_home))
print(paste("Defensive rebound NA count:", sum_of_dreb_nas))
## [1] "Defensive rebound NA count: 359"
sum_of_dreb_q1pts <- sum(is.na(two_columns$pts_qtr1_home))
print(paste("Q1 points NA count:", sum_of_dreb_q1pts))
## [1] "Q1 points NA count: 23831"
The documentation dictates that this data is scraped daily from NBA stats, but offers no further explanation as to why there are these aforementioned NAs.
missing_dreb <- sum(is.na(two_columns$dreb_home))
two_columns |>
ggplot(aes(x = dreb_home)) +
geom_histogram(binwidth = 1, fill = "gray60", color = "white") +
annotate("text",
x = Inf, y = Inf,
label = paste("Missing values:", missing_dreb),
hjust = 1.1, vjust = 1.5,
color = "red", size = 5) +
labs(
title = "Distribution of Home Defensive Rebounds (Missing Count Annotated)",
x = "Home Defensive Rebounds",
y = "Count"
) +
theme_minimal()
## Warning: Removed 359 rows containing non-finite outside the scale range
## (`stat_bin()`).
missing_count <- sum(is.na(two_columns$pts_qtr1_home))
two_columns |>
ggplot(aes(x = pts_qtr1_home)) +
geom_histogram(binwidth = 2, fill = "gray60", color = "white") +
annotate("text",
x = Inf, y = Inf,
label = paste("Missing values:", missing_count),
hjust = 1.1, vjust = 1.5,
color = "red", size = 5) +
labs(
title = "Distribution of Home Q1 Points (Missing Count Annotated)",
x = "Home Points in Q1",
y = "Count"
) +
theme_minimal()
## Warning: Removed 23831 rows containing non-finite outside the scale range
## (`stat_bin()`).
I think the biggest concern here is the volume of rows affected in the secondary graphic. Having nearly half of the rows missing data is a large concern and if I was counting on this particular statistic for insight than it would cause me great pause on my data set. However, I think this just discourages me from specific types of analysis, mainly quarter by quarter scoring. When I look at the general points scored for teams in a game, I don’t have the same NA issue so I will stick to utilizing those columns.
two_cat_columns <- NBA_Data |>
select(team_abbreviation_home, wl_home)
I chose these two columns (home team’s abbreviation and the game result) for their direct involvement in every game.
sum(is.na(two_cat_columns$team_abbreviation_home))
## [1] 0
If this returns anything > 0, those rows are explicitly missing. However, we see here that there are no explicitly missing rows, unlike earlier in the rebounding or first quarter point numbers.
length(unique(two_cat_columns$team_abbreviation_home))
## [1] 42
There are 30 active NBA teams so I was expecting this number to be at least 30. Anything less than that would indicate that there are missing rows. The opposite also makes sense since teams have relocated and changed names/cities over the years leading to new abbreviations.
There are also no empty groups relating to these two columns as well.
attendance_col <- NBA_Data |>
select(attendance) |>
filter(!is.na(attendance))
summary(attendance_col)
## attendance
## Min. : 0
## 1st Qu.: 14890
## Median : 17317
## Mean : 16716
## 3rd Qu.: 19068
## Max. :200049
Q1 <- quantile(attendance_col$attendance, 0.25, na.rm = TRUE)
Q3 <- quantile(attendance_col$attendance, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
attendance_col |>
ggplot(aes(x = attendance)) +
geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Game Attendance",
x = "Attendance",
y = "Count"
) +
theme_minimal()
attendance_col <- attendance_col |>
filter(attendance <= 50000)
summary(attendance_col)
## attendance
## Min. : 0
## 1st Qu.:14890
## Median :17317
## Mean :16705
## 3rd Qu.:19068
## Max. :49551
attendance_col |>
ggplot(aes(x = attendance)) +
geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Game Attendance",
x = "Attendance",
y = "Count"
) +
theme_minimal()
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
attendance_filtered <- attendance_col |>
filter(attendance >= lower_bound & attendance <= upper_bound)
summary(attendance_filtered)
## attendance
## Min. : 8625
## 1st Qu.:15025
## Median :17389
## Mean :16989
## 3rd Qu.:19127
## Max. :25297
attendance_filtered |>
ggplot(aes(x = attendance)) +
geom_histogram(binwidth = 1000, fill = "steelblue", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Game Attendance",
x = "Attendance",
y = "Count"
) +
theme_minimal()
attendance_filtered |>
ggplot(aes(y = attendance)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Game Attendance",
y = "Attendance",
x = ""
) +
theme_minimal()
I think we get a lot more cohesive, and normal, of a distribution when we eliminate the outliers and get closer to the center. For the purpose of this exercise, I think it is safe to label the upper and lower bounds, as determined by the arbitrary test we discussed in class as the outliers, but I think one could also make the argument for different numbers depending on the issue at hand. I think the distribution for everything under 50,000 is not a poor distribution and could serve as a worthy use case if necessary as well.