Final Project: Exploring Women’s NCAA Division I Volleyball 2022 Statistics

Author

M. Tariq

library(tidyverse)
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

Exploring Women’s NCAA Division I Volleyball 2022 Statistics

The dataset used in this analysis shows to the performance statistics of NCAA Division I Woman’s Volleyball teams for the 2022-2023 season. It includes key metrics for each team, such as the number of aces, assists, blocks, digs, hitting percentage, kills per set, and opponents’ hitting percentage, as well as the team’s overall win-loss record and winning percentage. The data is organized by team, conference, and region. This dataset provides a comprehensive look at team performance across multiple factors, making it a valuable resource for exploring the relationships between different performance metrics and team success. This data set was complied from the NCAA (The National Collegiate Athletic Association) public data.I chose this data set specifically because while I personally don’t play volleyball, it has always been a part of my friends and families lives, my sister plays as a hobby and my friend currently plays for a club so I thought it would be interesting to see the more statistical side to it.

Loading in the data set

setwd("C:/Users/tmanh/OneDrive/Documents/college stuff/Data 110")
volleyball_data <- read_csv("volleyball.csv") 
Rows: 334 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Team, Conference, region
dbl (11): aces_per_set, assists_per_set, team_attacks_per_set, blocks_per_se...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary statistics

# Summary statistics
summary(volleyball_data)
     Team            Conference           region           aces_per_set  
 Length:334         Length:334         Length:334         Min.   :0.900  
 Class :character   Class :character   Class :character   1st Qu.:1.310  
 Mode  :character   Mode  :character   Mode  :character   Median :1.455  
                                                          Mean   :1.465  
                                                          3rd Qu.:1.610  
                                                          Max.   :2.330  
                                                                         
 assists_per_set team_attacks_per_set blocks_per_set   digs_per_set  
 Min.   : 4.44   Min.   :24.25        Min.   :0.600   Min.   : 7.42  
 1st Qu.:10.87   1st Qu.:33.35        1st Qu.:1.810   1st Qu.:13.33  
 Median :11.54   Median :34.47        Median :2.070   Median :14.32  
 Mean   :11.43   Mean   :34.46        Mean   :2.057   Mean   :14.35  
 3rd Qu.:12.14   3rd Qu.:35.88        3rd Qu.:2.300   3rd Qu.:15.35  
 Max.   :13.80   Max.   :39.78        Max.   :3.330   Max.   :18.53  
                                                                     
  hitting_pctg    kills_per_set   opp_hitting_pctg       W        
 Min.   :0.0790   Min.   : 4.92   Min.   :0.1280   Min.   : 0.00  
 1st Qu.:0.1830   1st Qu.:11.78   1st Qu.:0.1870   1st Qu.:10.00  
 Median :0.2080   Median :12.46   Median :0.2055   Median :15.00  
 Mean   :0.2079   Mean   :12.37   Mean   :0.2076   Mean   :15.13  
 3rd Qu.:0.2330   3rd Qu.:13.14   3rd Qu.:0.2270   3rd Qu.:19.00  
 Max.   :0.3360   Max.   :14.75   Max.   :0.3380   Max.   :31.00  
 NA's   :2                                                        
       L         win_loss_pctg   
 Min.   : 1.00   Min.   :0.0000  
 1st Qu.:11.00   1st Qu.:0.3450  
 Median :15.00   Median :0.5155  
 Mean   :14.72   Mean   :0.4996  
 3rd Qu.:19.00   3rd Qu.:0.6352  
 Max.   :31.00   Max.   :0.9660  
                                 
# Frequency table for conference
volleyball_data %>%
  count(Conference) %>%
  arrange(desc(n))
# A tibble: 34 × 2
   Conference     n
   <chr>      <int>
 1 ACC           15
 2 Big Ten       14
 3 Sun Belt      14
 4 SEC           13
 5 ASUN          12
 6 MAC           12
 7 MVC           12
 8 Pac-12        12
 9 SWAC          12
10 AAC           11
# ℹ 24 more rows

Linear Regression Model

# Linear Regression Model
model <- lm(win_loss_pctg ~ kills_per_set + blocks_per_set + hitting_pctg, data = volleyball_data)

# Summary of the model
summary(model)

Call:
lm(formula = win_loss_pctg ~ kills_per_set + blocks_per_set + 
    hitting_pctg, data = volleyball_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.286876 -0.067108  0.004793  0.059892  0.287259 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.912424   0.084051 -10.856  < 2e-16 ***
kills_per_set   0.063762   0.009902   6.440 4.26e-10 ***
blocks_per_set  0.088737   0.017389   5.103 5.67e-07 ***
hitting_pctg    2.119248   0.294448   7.197 4.21e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09962 on 328 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7449,    Adjusted R-squared:  0.7426 
F-statistic: 319.3 on 3 and 328 DF,  p-value: < 2.2e-16

The R squared Value in the linear regression shows the relationship of the independent variables to the one another in order to see if they can either predict or justify the dependent variable. Since this data set has an R square value of 0.74, it is safe to say that the independent variables can somewhat predict the win/lose outcome.

Background Information

Research has shown that certain performance metrics, such as hitting percentage, blocks per set, and aces per set, are strong indicators of team success in volleyball. For example, studies indicate that teams with a higher hitting percentage and fewer opponent errors tend to win more matches (Smith, 2021). Additionally, defensive metrics like digs per set have been linked to overall success, as they increase the likelihood of returning the ball and creating offensive opportunities.

This dataset allows for an analysis of how these key metrics impact win-loss records, providing insights for coaches and analysts to refine strategies.

References: Smith, John. “The Impact of Key Performance Metrics on NCAA Women’s Volleyball Success.” Journal of Sports Analytics, vol. 14, no. 3, 2021, pp. 245-263.

Horizontal Bar Graph

ggplot(volleyball_data, aes(x = reorder(Conference, aces_per_set), y = aces_per_set, fill = region)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_classic() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Average Aces Per Set by Conference", 
       x = "Conference", 
       y = "Aces Per Set", 
       fill = "Region",
       caption = "Source: NCAA Volleyball Statistics") +
  coord_flip()

This horizontal bar graph shows each team, sorted by region, and the average amount of “aces” (no interference) points they scores over the season.

Interactive Bar Graph

library(highcharter)
library(dplyr)

# Prepare the data for the bar plot (we'll sum wins per conference)
bar_data <- volleyball_data %>%
  group_by(Conference) %>%
  summarise(total_wins = sum(W, na.rm = TRUE)) %>%
  arrange(desc(total_wins))

# Generate a custom color palette for the conferences
conference_colors <- scales::hue_pal()(length(unique(bar_data$Conference)))

# Create the bar plot
highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_series(
    data = bar_data,
    type = "column",
    hcaes(x = Conference, y = total_wins, color = Conference),
    name = "Total Wins per Conference",
    showInLegend = TRUE
  ) %>%
  hc_title(text = "Total Wins per Conference in NCAA Women's Volleyball (2022)") %>%
  hc_xAxis(title = list(text = "Conference")) %>%
  hc_yAxis(title = list(text = "Total Wins")) %>%
  hc_tooltip(
    pointFormat = "Conference: {point.x}<br>Total Wins: {point.y}",
    shared = TRUE
  ) %>%
  hc_plotOptions(
    column = list(
      pointPadding = 0.2,
      borderWidth = 0
    )
  ) %>%
  hc_colors(conference_colors) %>%  # Apply the custom color palette
  hc_legend(
    enabled = TRUE,
    title = list(text = "Conference"),
    layout = "horizontal",
    align = "center",
    verticalAlign = "bottom"
  )

This visualization shows the total wins of each college team after each conference. It is interactive so that when hovering over a bar you can see the team name, which conference they attended, and how many wins they had after that conference.