Week 2 Data Dive - Dan Harris

Overview of Tasks:

  1. A numeric summary of at least two columns of data

    • For categorical columns, this should include unique values and counts

    • For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)

    • These summaries can be combined

  2. A set of at least 3 novel question to investigate…

  3. Address at least one question using an aggregate function

  4. Visual summaries of at least two columns of data

    • This should include distributions at least

    • In addition, you should consider trends, correlations, and interactions between variables

    • Use different channels (e.g., color) to show how categorical variables interact with continuous variables

*explain what insight was gained.

Initiation Steps

Step 1 - Load Libraries

library(tidyverse) #load the tidyverse library
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2 - Load data set

*using the players.csv dataset for this week only

t_players <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/players.csv", delim = ",")  #load the players.csv 
## Rows: 1873 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): name, link, college, bbrID
## dbl (26): rank, draft_year, draft_rd, draft_pk, recruit_year, pick_overall, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(t_players)

Task Demonstrations

Task 1 - A numeric summary of at least two columns of data

#looking at what % of entries have a draft year

i_NA_draft_yr <- sum(is.na(t_players$draft_year)) #variable to store result of number of rows with NA value
paste("# of NA Rows: ", i_NA_draft_yr)            #text version of results
## [1] "# of NA Rows:  1103"
i_total_rows <- nrow(t_players)                   #vaiable to store result of total number or rows
paste("total # of rows: ", i_total_rows)          #text version of resultst
## [1] "total # of rows:  1873"
d_percent_w_year <- (1 - (i_NA_draft_yr/i_total_rows))                  #% of valid rows in draft year column

paste("% with draft year: ", d_percent_w_year)
## [1] "% with draft year:  0.411105178857448"
summary(t_players$total_seasons)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   4.000   5.071   8.000  18.000     940
summary(t_players$draft_pk)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   10.00   22.00   24.35   38.00   60.00    1466
Insights
  • Only 41% of players were drafted into the NBA

  • For players total seasons:
    On average drafted players lasted 5 seasons with 25% lasting 1 season or less

  • For players draft pick:
    On average players were picked around 24th, with 25% being selected 10th or higher. The validity of this data is brought into question as 1466 of 1873 rows did not have values. This may mean that the player was not drafted or picks past 60 were not recorded. Further investigation would be needed to determine which case is more likely.

Task 2 - A set of at least 3 novel question to investigate…

  • What is the distribution of ranked players that were drafted into the NBA?
    The expectation is that players ranked in the 1st quartile were drafted more frequently than players lower ranked.

  • What schools have the highest number of drafted players?
    A follow up question might be, what was the success rate of those players in the NBA?

  • What was the average # of seasons for drafted players?

  • What is the percentage of NA values in the data set for each column?

Insights
  • Looking at the data there are significant number of columns with NA values. It is unclear if this is because data was not collected or that NA and null values reflect information that was out of scope. Ideally, if that is the case the data should reflect that more accurately.

Task 3 - Address at least one question using an aggregate function

t_players |>
  summarise(mean_number_of_seasons = mean(total_seasons, na.rm=TRUE))
## # A tibble: 1 × 1
##   mean_number_of_seasons
##                    <dbl>
## 1                   5.07
insights
  • This shows a more specific function to return a given value compared to the Summary method listed in Task 1. The value is the same but the summary function for the total_seasons column provided a wider scope. This is useful when trying to understand the data but of less value when the specific value is required for further calculation.

Task 4 - Visual summaries of at least two columns of data

basic plotting reminder - grammar of graphics plot = ggplot

General approach to visualization

What’s the question?
What data supports/informs the answer to the question?
What visualization strategy best illustrates the answer?

Q1. Which colleges have the most players drafted into the NBA?

#build a vector to use for limiting the colleges to just the top
colleges <- t_players |>           #create the variable and identify the data
  group_by(college) |>             #group by the colleges
  summarise(row_count = n()) |>    #create a summary column called row_count
  arrange(desc(row_count)) |>      #arrange the rows in descending order
  filter(row_count >=33, !is.na(college)) |>   #filter to anything over 33 and not null
  pluck("college")                 #pull out the remaining college data into a vector

#colleges                          #test the results, commented out after testing

plt <- t_players |>                   #create the variable and id the data
  filter(college %in% colleges) |>    #filter the dataframe to just colleges
  ggplot() +                          #initiate the plot
  geom_bar(mapping = aes(x=college)) +    #establish the mapping criteria
  labs(title = "Which colleges have the most drafted players") +
  theme(axis.text.x =element_text(angle=90,hjust=1, vjust=0.5)) #format the plot

plt                               #show the plot

Q2. Line chart of Duke Players drafted in to the NBA

summary_data <- t_players |>
  filter(college == "Duke University" & !is.na(draft_year)) |>  #Duke with a draft value
  group_by(college, draft_year) |>                    #grouping 
  summarise(count = n(), .groups = "keep")            #inlcude the count of rows

plot <- ggplot(summary_data, aes(x = draft_year, y = count)) +   #initiate the plot
  geom_line() +                                                  #use a line plot
  labs(title = "# of Duke players drafted to NBA", x = "Year", y = "# Drafted") +
  theme(legend.position = "bottom")

plot                                              #show the plot

Insights
  • Plotting in R does not feel intuitive.

  • In plot one there was too much data to show any significance, by filtering down to only schools that had more than 33 players in the data set it was much easier to see that Kentucky, Kansas and Duke had teh most players drafted into the NBA based on the data set.

  • In plot two, it was more interesting to look at players from one school. The interesting thing from this data is that early 2000’s and 2010’s had banner years. An improvement to this graph might be better lablels on the data points.