library(tidyverse)
library(readxl)Final Project
Load Packages
First, I will load the packages necessary for this project.
Read-in Data
Next, I will read my data in. The dataset contains my basketball stats from the past 3 seasons.
bballstats <- read_excel("data/bballstats.xlsx")
bballstats# A tibble: 85 × 27
date opponent site started min fgm fga fgper `3fgm`
<dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-11-11 00:00:00 Valley City… Home No 14 2 6 0.333 0
2 2023-11-18 00:00:00 Parkside Away No 7 4 5 0.8 4
3 2023-11-25 00:00:00 Northern Mi… Home No 5 0 0 0 0
4 2023-11-26 00:00:00 Michigan Te… Home No 2 0 1 0 0
5 2023-12-01 00:00:00 Crookston Home No 5 1 5 0.2 1
6 2023-12-02 00:00:00 Minot Home No 2 0 3 0 0
7 2023-12-08 00:00:00 Northern Away No 3 1 2 0.5 0
8 2023-12-09 00:00:00 Moorhead Away No 8 0 1 0 0
9 2023-12-12 00:00:00 Mary Away No 6 0 1 0 0
10 2024-01-05 00:00:00 Sioux Falls Away No 6 2 2 1 2
# ℹ 75 more rows
# ℹ 18 more variables: `3fga` <dbl>, `3fgper` <dbl>, ftm <dbl>, fta <dbl>,
# ftper <dbl>, offreb <dbl>, defreb <dbl>, totreb <dbl>, avgreb <dbl>,
# pf <dbl>, assists <dbl>, turnovers <dbl>, blocks <dbl>, steals <dbl>,
# points <dbl>, avgpoints <dbl>, eff <dbl>, outcome <chr>
Displaying and Analyzing Data
This first graph shows points by site in a box plot. This is helpful to see where I normally average the most points. From this, there doesn’t seem to be a clear standout of where I typically average the most points.
bballstats |>
ggplot(mapping = aes(x = site, y = points)) +
geom_boxplot() +
labs(title = "Average Points for Home, Away, and Neutral Games",
x = "Home, Away, or Neutral",
y = "Average Points")Looking at average points in a win or loss is important to see if I contribute more in a win or a loss. Overall, I average slightly more points in a win, but it is not statistically significant that if I score above a certain amount of points, it is guaranteed a win.
bballstats |>
ggplot(mapping = aes(x = outcome, y = points)) +
geom_boxplot() +
labs(title = "Average Points for Win vs. Loss",
x = "Win or Loss",
y = "Average Points")This graph shows the points for each game in order. Seeing the trends can see how my scoring has fluctuated over time and whether it has gone up, down, or stayed the same. In this case, you can see it went up from the first season to the second season, and then fluctuated quite a bit for the third season, meaning my scoring was not consistent throughout the entirety of this last season.
bballstats |>
ggplot(mapping = aes(x = date, y = points)) +
geom_line() +
labs(title = "Points Over Time",
x = "Date",
y = "Points")This next plot shows multiple different variables, and looking at all of these stats together can help show a trend. An example can be not scoring very much along with little to no assists, little to no rebounds, and a loss. To me, it does not seem like there is a very apparent trend, but the two major outliers I see both have one or more values that are greater than the median values by a substantial amount.
bballstats |>
ggplot(mapping = aes(x = assists, y = points, size = totreb,
color = outcome)) +
geom_point() +
labs(title = "Points, Assists, and Rebounds for Wins and Losses",
x = "Assists",
y = "Points",
color = "Outcome",
size = "Total Rebounds")Seeing a graph representation of field goal percentage for 2-points compared to how many attempts I have can give me an idea of whether shooting more is beneficial to my percentage. I also colored it to interpret the made field goals better, but doing the math for attempts and percentage works, too. Clearly, the more shots I take, the more I can make, but free throws are not counted in this, so the total points I scored cannot be calculated.
bballstats |>
mutate(
`2fga` = fga - `3fga`,
`2fgm` = fgm - `3fgm`,
`2fgper` = `2fgm` / `2fga`) |>
relocate(`2fgm`, `2fga`, `2fgper`, .after = fgper) |>
mutate(
`2fgper` = if_else(is.na(`2fgper`), 0, `2fgper`)) |>
mutate(
`2fgper` = round(`2fgper`, 3)) |>
ggplot(mapping = aes(x = `2fga`, y = `2fgper`, color = `2fgm`)) +
geom_point() +
labs(title = "Field Goal Percentage for 2-pointers with Attempts and Makes",
x = "Field Goals Attempted",
y = "Field Goal Percentage",
color = "Field Goals Made")Generating the medians of each of the criteria below shows the values that are “average” for me and what I want to aim to achieve for each game. Seeing these values separately is good to know what I have averaged throughout the past 3 seasons.
bballstats |>
summarize(
med_points = median(points, na.rm = TRUE),
med_reb = median(totreb, na.rm = TRUE),
med_ast = median(assists, na.rm = TRUE),
med_to = median(turnovers, na.rm = TRUE))# A tibble: 1 × 4
med_points med_reb med_ast med_to
<dbl> <dbl> <dbl> <dbl>
1 8 3 2 2
Now, I want to see what games I have achieved equal to or above the median for the criteria I have chosen.
bballstats |>
filter(points >= median(points) &
totreb >= median(totreb) &
assists >= median(assists) &
turnovers <= median(turnovers))# A tibble: 11 × 27
date opponent site started min fgm fga fgper `3fgm`
<dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2024-11-08 00:00:00 Central Okl… Home Yes 31 3 7 0.429 0
2 2024-12-14 00:00:00 Northern Home Yes 29 3 7 0.429 2
3 2025-01-11 00:00:00 Duluth Away Yes 28 7 13 0.538 3
4 2025-01-24 00:00:00 Sioux Falls Home Yes 27 4 14 0.286 3
5 2025-01-25 00:00:00 SMSU Home Yes 31 5 19 0.263 3
6 2025-02-15 00:00:00 Augie Away Yes 30 6 11 0.545 4
7 2025-02-22 00:00:00 Winona Home Yes 25 7 17 0.412 3
8 2025-11-14 00:00:00 Washburn Neut… Yes 26 8 14 0.571 2
9 2025-11-20 00:00:00 Northern Mi… Away Yes 35 9 21 0.429 1
10 2026-02-05 00:00:00 SMSU Home Yes 32 4 10 0.4 1
11 2026-02-14 00:00:00 Augie Away Yes 27 7 11 0.636 5
# ℹ 18 more variables: `3fga` <dbl>, `3fgper` <dbl>, ftm <dbl>, fta <dbl>,
# ftper <dbl>, offreb <dbl>, defreb <dbl>, totreb <dbl>, avgreb <dbl>,
# pf <dbl>, assists <dbl>, turnovers <dbl>, blocks <dbl>, steals <dbl>,
# points <dbl>, avgpoints <dbl>, eff <dbl>, outcome <chr>
Now that I can see that there are 11 games that I performed above “average” in points, rebounds, assists, and turnovers, I also want to see the win or loss right away and keep only the selected variables.
bballstats |>
filter(points >= median(points) &
totreb >= median(totreb) &
assists >= median(assists) &
turnovers <= median(turnovers)) |>
select(date, opponent, outcome, site, points, totreb, assists, turnovers)# A tibble: 11 × 8
date opponent outcome site points totreb assists turnovers
<dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2024-11-08 00:00:00 Central Ok… Win Home 9 6 3 1
2 2024-12-14 00:00:00 Northern Win Home 8 4 2 2
3 2025-01-11 00:00:00 Duluth Win Away 25 13 2 0
4 2025-01-24 00:00:00 Sioux Falls Loss Home 15 7 5 1
5 2025-01-25 00:00:00 SMSU Loss Home 13 5 2 2
6 2025-02-15 00:00:00 Augie Win Away 20 7 7 2
7 2025-02-22 00:00:00 Winona Win Home 17 7 2 2
8 2025-11-14 00:00:00 Washburn Win Neut… 19 3 3 2
9 2025-11-20 00:00:00 Northern M… Loss Away 22 3 2 1
10 2026-02-05 00:00:00 SMSU Win Home 9 4 4 2
11 2026-02-14 00:00:00 Augie Win Away 19 8 2 2
Now I can see that some of my most well-rounded games where I made 8 or more points, had 3 or more rebounds, had 2 or more assists, and had 2 or less turnovers have resulted in more wins than losses, leading me to believe that when I am playing better and not only scoring more, but assisting my teammates, grabbing rebounds, and taking care of the ball by not turning it over, my team typically plays better.