What Does It Take to Win a Volleyball Game?

Author

Tessa McCollum

Introduction

The data set I will be exploring for this project is about the play styles of all teams in NCAA division 1 women’s volleyball, from the years 2022-2023. This data was collected and posted on Score Network, but the original source of the data is from the official NCAA stats page. The set covers stats on variables like assists per game, team attacks per set, digs per set, etc., for each team in the division. For my purposes, I will be analyzing and comparing the variables of blocks per set, kills per set, aces per set, win/loss percentage, digs per set, opponent hitting percentage, and region. In volleyball, according to the NCAA, a block, is when one or multiple players sharply deflect a spike from the opposing team leading to a score on their opponent’s court, or sometimes sending the ball out of bounds, and it is a defensive move. A kill is when a player spikes the ball onto their opponents side, scoring a point for their team, and it is an offensive move. Aces are when a player is serving, and their technique is done so expertly, that the opponents are unable to effectively defend against it before it touches the court on the opposing side, meaning the serving team only touched the ball once before scoring a point. An ace, like a kill, is also an offensive move. A dig is when a player gets under a spiked ball in time, successfully stopping it from hitting the ground, and it is a defensive move. Finally, in volleyball, a set simply refers to the period of play (one of 5 in a game) where the teams compete to reach 25 points, leading by at least two. For my analysis I am interested in comparing the teams in division one, based on their playing style, whether they are more offensive or defensive, to see which strength of play generally results in more wins.

Loading All Libraries and Importing Data Set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(ggfortify)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("~/Documents/Data 110")
volleyball_set <- readr::read_csv("volleyball_ncaa_div1_2022_23.csv")

Rows: 334 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Team, Conference, region
dbl (11): aces_per_set, assists_per_set, team_attacks_per_set, blocks_per_se...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

volleyball_set

# A tibble: 334 × 14
   Team      Conference region aces_per_set assists_per_set team_attacks_per_set
   <chr>     <chr>      <chr>         <dbl>           <dbl>                <dbl>
 1 Lafayette Patriot    East           2.33            11.0                 34.5
 2 Delaware… MEAC       South…         2.2             11.4                 30.0
 3 Yale      Ivy League East           2.15            12.6                 35.4
 4 Coppin S… MEAC       South…         2.15            10.6                 32.5
 5 Saint Lo… Atlantic … East           2.03            11.6                 34.1
 6 UTEP      C-USA      South          1.98            11.5                 31.5
 7 Samford   SoCon      South…         1.93            12.3                 37.0
 8 Lipscomb  ASUN       South…         1.91            12.9                 35.6
 9 Southern… MVC        Midwe…         1.91            13.2                 36.4
10 Howard    MEAC       South…         1.91            11.9                 31.6
# ℹ 324 more rows
# ℹ 8 more variables: blocks_per_set <dbl>, digs_per_set <dbl>,
#   hitting_pctg <dbl>, kills_per_set <dbl>, opp_hitting_pctg <dbl>, W <dbl>,
#   L <dbl>, win_loss_pctg <dbl>

Data Cleaning:

sub setting the original data set to include only the relevant variables

volleyball_linear <- volleyball_set |>
  select(win_loss_pctg,
         blocks_per_set,
         aces_per_set,
         digs_per_set,
         opp_hitting_pctg,
         kills_per_set)
volleyball_linear

# A tibble: 334 × 6
   win_loss_pctg blocks_per_set aces_per_set digs_per_set opp_hitting_pctg
           <dbl>          <dbl>        <dbl>        <dbl>            <dbl>
 1         0.348           1.31         2.33         13.6            0.227
 2         0.774           2.17         2.2          12.6            0.137
 3         0.885           1.82         2.15         15.3            0.155
 4         0.676           1.81         2.15         14.2            0.17 
 5         0.581           1.83         2.03         14.3            0.188
 6         0.567           2.39         1.98         12.6            0.175
 7         0.594           1.73         1.93         15.4            0.202
 8         0.552           1.85         1.91         13.2            0.23 
 9         0.581           1.36         1.91         15.2            0.237
10         0.667           1.73         1.91         13.2            0.179
# ℹ 324 more rows
# ℹ 1 more variable: kills_per_set <dbl>

Linear Regression Win/Loss Percentage and Summary Statistics:

lm_wins <- lm(win_loss_pctg ~ blocks_per_set + kills_per_set + aces_per_set + digs_per_set + opp_hitting_pctg, data = volleyball_linear)

summary(lm_wins)


Call:
lm(formula = win_loss_pctg ~ blocks_per_set + kills_per_set + 
    aces_per_set + digs_per_set + opp_hitting_pctg, data = volleyball_linear)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18800 -0.04988 -0.00263  0.04491  0.50786 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.4160293  0.1403802   2.964  0.00326 ** 
blocks_per_set   -0.0004115  0.0170761  -0.024  0.98079    
kills_per_set     0.0933211  0.0048392  19.284  < 2e-16 ***
aces_per_set      0.0312029  0.0237766   1.312  0.19032    
digs_per_set     -0.0252110  0.0042676  -5.908 8.69e-09 ***
opp_hitting_pctg -3.6293317  0.2580834 -14.063  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08487 on 328 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8186 
F-statistic: 301.6 on 5 and 328 DF,  p-value: < 2.2e-16

Diagnostic Plot 1:

using autoplot to create 6 plots, arranged 2 by 3

diagnostic_plot_1 <- autoplot(lm_wins, 1:6, nrow = 2, ncol = 3)
diagnostic_plot_1

Diagnostic Plot 2:

attempt at scatterplot matrix

diagnostic_plot_2 <- ggpairs(volleyball_linear)
diagnostic_plot_2

linear Regression equation:

win/loss percentage = -0.0004(blocks_per_set_) + 0.0933(kills_per_set) + 0.0312(aces_per_set) - 0.0252(digs_per_set) - 3.6293(opp_hitting_pctg) + 0.416

Analysis:

Data Cleaning:

sub setting original data set and mutating a new column representing the absolute value of the difference between a team’s kills per set and digs per set (offensive and defensive plays)

final_org <- volleyball_set |>
  select(digs_per_set,
         kills_per_set,
         win_loss_pctg,
         region)
final_org <- final_org |>
  mutate(difference = abs(kills_per_set - digs_per_set)) |>
  arrange(-win_loss_pctg)

final_org

# A tibble: 334 × 5
   digs_per_set kills_per_set win_loss_pctg region    difference
          <dbl>         <dbl>         <dbl> <chr>          <dbl>
 1         13.4          14.4         0.966 South          1    
 2         13.2          13.7         0.939 West           0.450
 3         13.2          13.8         0.935 East           0.640
 4         14.3          14.6         0.933 Southeast      0.320
 5         12.8          13           0.912 East           0.220
 6         13.4          14           0.886 East           0.610
 7         15.3          13.9         0.885 East           1.39 
 8         15.2          13.8         0.882 Southeast      1.32 
 9         12.5          13.6         0.879 South          1.12 
10         15.7          14.4         0.879 Midwest        1.29 
# ℹ 324 more rows

Plot 1:

plot_1 <- final_org |>
  ggplot(aes(x = digs_per_set, y = kills_per_set, color = region, fill = win_loss_pctg)) +
  geom_point(stroke = 0.7, shape = 21, size = 4) +
  scale_color_manual(values = c(`Midwest` = 'blue', `South` = 'yellow', `East` = 'red', `Southeast` = 'orange', `West` = 'purple')) +
  scale_fill_gradient(low = 'hotpink', high = 'green') +
  labs(color = 'Region',
       title = 'Womens Volleyball Teams Compared by Defensive and Offensive Strength',
       x = 'Digs Per Set (defensive)',
       y = 'Kills Per Set (offensive)',
       subtitle = 'Division 1 NCAA, 2022-2023',
       fill = 'Win Percentage',
       caption = 'Source: NCAA Stats') +
  theme_dark()

plot_1

Plot 1 Part 2:

Adding plotly and messing up my legends

ggplotly(plot_1)

Plot 2:

plot_2 <- final_org |>
  ggplot(aes(x = difference, y = win_loss_pctg, color = region)) +
  geom_point(stroke = 0.7, shape = 21, size = 4) +
    geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c(`Midwest` = 'blue', `South` = 'yellow', `East` = 'red', `Southeast` = 'orange', `West` = 'purple')) +
  labs(color = 'Region',
       title = 'Womens Volleyball Teams Compared by Difference in Defensive and Offensive Plays',
       x = 'Difference in Kills and Digs Per Set',
       y = 'Win Percentage',
       caption = 'Source: NCAA Stats') +
  theme_dark()

plot_2

`geom_smooth()` using formula = 'y ~ x'

Plot 2 Interactive Version (final):

ggplotly(plot_2)

`geom_smooth()` using formula = 'y ~ x'

Essay:

To create my visualization, I first created a sub set of the original data set that only included the variables digs_per_set, kills_per_set, win_loss_percentage, and Region. I then mutated a column to represent the absolute value of the difference between digs_per_set and kills_per_set to see if the difference between offensive and defensive plays widens as a team’s win percentage goes down. For the purposes of the visualizations I wanted to make, there wasn’t really any cleaning required beyond that, so I jumped right into ggplot. I used the sub data set to create a scatterplot where each team in NCAA division 1 women’s volleyball is represented by a dot that is filled by a team’s win percentage, and the color that outlines the dots represents one of five regions represented in the NCAA. The x-axis is digs_per_set (a defensive play), while the y-axis is kills_per_set (an offensive play). I chose the variables of digs_per_set and kills_per_set because my linear regression analysis told me they were significant predictors for whether or not a team won or lost, as opposed to a variable like the offensive metric of aces_per_set, which seemed to be somewhat random. My aim was to find out the answer to two questions: 1. were there any teams that had a strong defense over offense, and still had a high win percentage, and 2. were there any teams that were relatively strong in offense, but weak in defense and therefore did not do well. When looking at the plot, the data shows basically what one might have expected, meaning the answer to both of my questions is essentially no. Most teams that have a high win percentage simply have excellent defense and offense, instead of relying on one or the other, and the same goes for teams ranking low in win percentage, they were simply weak in both areas (or so I thought on first glance). I will say I noticed some teams that had relatively lower digs per set, and still had quite a high win percentage because their offensive made up for their lack in defense, which is sort of similar to my line of thought. I also noticed that some of the teams that ranked highly in kills and digs did not do as well in overall win percentage as one might expect, relatively, which just seems unlucky, and maybe I would have found out more about that if I had inspected opp_hits_pctg (opponent hits percentage) more closely. The region does not seem to be very much of a factor. Then, I realized maybe to some degree, as win percentage goes down, the gap between a team’s kills per set and digs per set grows wider, favoring one or the other, whereas the highest ranked teams in wins seem to have relatively more equidistant kills to digs. Maybe it is not as simple as lower ranked teams having equally low stats in defense and offense, every team is in division 1 after all. This inspired me to create my second visualization which is a scatter plot comparing the absolute value of the difference between each teams defensive and offensive plays in a set, to their ranking in terms of wins. According to this scatterplot, there is some sort of negative relationship between those two variables, which I find interesting and a little more complex. Some things I wish I could have done include, making the interactivity for my first plot make more sense. I realized Plotly seems to combine axis that are auto filled instead of explicitly stated in ggplot, and so while I still decided to include the interactive version as I think it is decipherable, I recognize the keys are messy and confusing to read. I really struggled with this visualization, I spent many hours trying to get more interesting looking things to work, and eventually settled for this. I also realize the dots in my first plot are so many and clumped together that it makes it a little hard to differentiate individual scores (hence why I wanted to make the plot interactive), so I wish that I had thought of a good way to downsize. It always seemed like I either didn’t have enough data, like for example, I tried taking the top 5 and bottom 5 teams, which didn’t tell me much, or I was surpassing 10 colors for categorical variables, or some other problem of the sort.

Introduction

Loading All Libraries and Importing Data Set

Data Cleaning:

sub setting the original data set to include only the relevant variables

Linear Regression Win/Loss Percentage and Summary Statistics:

Diagnostic Plot 1:

using autoplot to create 6 plots, arranged 2 by 3

Diagnostic Plot 2:

attempt at scatterplot matrix

linear Regression equation:

win/loss percentage = -0.0004(blocks_per_set_) + 0.0933(kills_per_set) + 0.0312(aces_per_set) - 0.0252(digs_per_set) - 3.6293(opp_hitting_pctg) + 0.416

Analysis:

Data Cleaning:

sub setting original data set and mutating a new column representing the absolute value of the difference between a team’s kills per set and digs per set (offensive and defensive plays)

Plot 1:

Plot 1 Part 2:

Adding plotly and messing up my legends

Plot 2:

Plot 2 Interactive Version (final):

Essay:

Source: NCAA Division 1 Women’s Volleyball stats page 2022-2023