A numeric summary of at least two columns of data
For categorical columns, this should include unique values and counts
For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)
These summaries can be combined
A set of at least 3 novel question to investigate…
Address at least one question using an aggregate function
Visual summaries of at least two columns of data
This should include distributions at least
In addition, you should consider trends, correlations, and interactions between variables
Use different channels (e.g., color) to show how categorical variables interact with continuous variables
*explain what insight was gained.
library(tidyverse) #load the tidyverse library
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
*using the players.csv dataset for this week only
t_players <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/players.csv", delim = ",") #load the players.csv
## Rows: 1873 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, link, college, bbrID
## dbl (26): rank, draft_year, draft_rd, draft_pk, recruit_year, pick_overall, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(t_players)
#looking at what % of entries have a draft year
i_NA_draft_yr <- sum(is.na(t_players$draft_year)) #variable to store result of number of rows with NA value
paste("# of NA Rows: ", i_NA_draft_yr) #text version of results
## [1] "# of NA Rows: 1103"
i_total_rows <- nrow(t_players) #vaiable to store result of total number or rows
paste("total # of rows: ", i_total_rows) #text version of resultst
## [1] "total # of rows: 1873"
d_percent_w_year <- (1 - (i_NA_draft_yr/i_total_rows)) #% of valid rows in draft year column
paste("% with draft year: ", d_percent_w_year)
## [1] "% with draft year: 0.411105178857448"
summary(t_players$total_seasons)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 4.000 5.071 8.000 18.000 940
summary(t_players$draft_pk)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 10.00 22.00 24.35 38.00 60.00 1466
Only 41% of players were drafted into the NBA
For players total seasons:
On average drafted players lasted 5 seasons with 25% lasting 1 season or
less
For players draft pick:
On average players were picked around 24th, with 25% being selected 10th
or higher. The validity of this data is brought into question as 1466 of
1873 rows did not have values. This may mean that the player was not
drafted or picks past 60 were not recorded. Further investigation would
be needed to determine which case is more likely.
What is the distribution of ranked players that were
drafted into the NBA?
The expectation is that players ranked in the 1st quartile were drafted
more frequently than players lower ranked.
What schools have the highest number of drafted
players?
A follow up question might be, what was the success rate of
those players in the NBA?
What was the average # of seasons for drafted players?
What is the percentage of NA values in the data set for each column?
t_players |>
summarise(mean_number_of_seasons = mean(total_seasons, na.rm=TRUE))
## # A tibble: 1 × 1
## mean_number_of_seasons
## <dbl>
## 1 5.07
basic plotting reminder - grammar of graphics plot = ggplot
General approach to visualization
What’s the question?
What data supports/informs the answer to the question?
What visualization strategy best illustrates the answer?
#build a vector to use for limiting the colleges to just the top
colleges <- t_players |> #create the variable and identify the data
group_by(college) |> #group by the colleges
summarise(row_count = n()) |> #create a summary column called row_count
arrange(desc(row_count)) |> #arrange the rows in descending order
filter(row_count >=33, !is.na(college)) |> #filter to anything over 33 and not null
pluck("college") #pull out the remaining college data into a vector
#colleges #test the results, commented out after testing
plt <- t_players |> #create the variable and id the data
filter(college %in% colleges) |> #filter the dataframe to just colleges
ggplot() + #initiate the plot
geom_bar(mapping = aes(x=college)) + #establish the mapping criteria
labs(title = "Which colleges have the most drafted players") +
theme(axis.text.x =element_text(angle=90,hjust=1, vjust=0.5)) #format the plot
plt #show the plot
summary_data <- t_players |>
filter(college == "Duke University" & !is.na(draft_year)) |> #Duke with a draft value
group_by(college, draft_year) |> #grouping
summarise(count = n(), .groups = "keep") #inlcude the count of rows
plot <- ggplot(summary_data, aes(x = draft_year, y = count)) + #initiate the plot
geom_line() + #use a line plot
labs(title = "# of Duke players drafted to NBA", x = "Year", y = "# Drafted") +
theme(legend.position = "bottom")
plot #show the plot
Plotting in R does not feel intuitive.
In plot one there was too much data to show any significance, by filtering down to only schools that had more than 33 players in the data set it was much easier to see that Kentucky, Kansas and Duke had teh most players drafted into the NBA based on the data set.
In plot two, it was more interesting to look at players from one school. The interesting thing from this data is that early 2000’s and 2010’s had banner years. An improvement to this graph might be better lablels on the data points.