Final Project: Exploring Women’s NCAA Division I Volleyball 2022 Statistics
Author
M. Tariq
library(tidyverse)
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Exploring Women’s NCAA Division I Volleyball 2022 Statistics
The dataset used in this analysis shows to the performance statistics of NCAA Division I Woman’s Volleyball teams for the 2022-2023 season. It includes key metrics for each team, such as the number of aces, assists, blocks, digs, hitting percentage, kills per set, and opponents’ hitting percentage, as well as the team’s overall win-loss record and winning percentage. The data is organized by team, conference, and region. This dataset provides a comprehensive look at team performance across multiple factors, making it a valuable resource for exploring the relationships between different performance metrics and team success. This data set was complied from the NCAA (The National Collegiate Athletic Association) public data.I chose this data set specifically because while I personally don’t play volleyball, it has always been a part of my friends and families lives, my sister plays as a hobby and my friend currently plays for a club so I thought it would be interesting to see the more statistical side to it.
Rows: 334 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Team, Conference, region
dbl (11): aces_per_set, assists_per_set, team_attacks_per_set, blocks_per_se...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary statistics
# Summary statisticssummary(volleyball_data)
Team Conference region aces_per_set
Length:334 Length:334 Length:334 Min. :0.900
Class :character Class :character Class :character 1st Qu.:1.310
Mode :character Mode :character Mode :character Median :1.455
Mean :1.465
3rd Qu.:1.610
Max. :2.330
assists_per_set team_attacks_per_set blocks_per_set digs_per_set
Min. : 4.44 Min. :24.25 Min. :0.600 Min. : 7.42
1st Qu.:10.87 1st Qu.:33.35 1st Qu.:1.810 1st Qu.:13.33
Median :11.54 Median :34.47 Median :2.070 Median :14.32
Mean :11.43 Mean :34.46 Mean :2.057 Mean :14.35
3rd Qu.:12.14 3rd Qu.:35.88 3rd Qu.:2.300 3rd Qu.:15.35
Max. :13.80 Max. :39.78 Max. :3.330 Max. :18.53
hitting_pctg kills_per_set opp_hitting_pctg W
Min. :0.0790 Min. : 4.92 Min. :0.1280 Min. : 0.00
1st Qu.:0.1830 1st Qu.:11.78 1st Qu.:0.1870 1st Qu.:10.00
Median :0.2080 Median :12.46 Median :0.2055 Median :15.00
Mean :0.2079 Mean :12.37 Mean :0.2076 Mean :15.13
3rd Qu.:0.2330 3rd Qu.:13.14 3rd Qu.:0.2270 3rd Qu.:19.00
Max. :0.3360 Max. :14.75 Max. :0.3380 Max. :31.00
NA's :2
L win_loss_pctg
Min. : 1.00 Min. :0.0000
1st Qu.:11.00 1st Qu.:0.3450
Median :15.00 Median :0.5155
Mean :14.72 Mean :0.4996
3rd Qu.:19.00 3rd Qu.:0.6352
Max. :31.00 Max. :0.9660
# Frequency table for conferencevolleyball_data %>%count(Conference) %>%arrange(desc(n))
# A tibble: 34 × 2
Conference n
<chr> <int>
1 ACC 15
2 Big Ten 14
3 Sun Belt 14
4 SEC 13
5 ASUN 12
6 MAC 12
7 MVC 12
8 Pac-12 12
9 SWAC 12
10 AAC 11
# ℹ 24 more rows
Linear Regression Model
# Linear Regression Modelmodel <-lm(win_loss_pctg ~ kills_per_set + blocks_per_set + hitting_pctg, data = volleyball_data)# Summary of the modelsummary(model)
Call:
lm(formula = win_loss_pctg ~ kills_per_set + blocks_per_set +
hitting_pctg, data = volleyball_data)
Residuals:
Min 1Q Median 3Q Max
-0.286876 -0.067108 0.004793 0.059892 0.287259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.912424 0.084051 -10.856 < 2e-16 ***
kills_per_set 0.063762 0.009902 6.440 4.26e-10 ***
blocks_per_set 0.088737 0.017389 5.103 5.67e-07 ***
hitting_pctg 2.119248 0.294448 7.197 4.21e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09962 on 328 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7449, Adjusted R-squared: 0.7426
F-statistic: 319.3 on 3 and 328 DF, p-value: < 2.2e-16
The R squared Value in the linear regression shows the relationship of the independent variables to the one another in order to see if they can either predict or justify the dependent variable. Since this data set has an R square value of 0.74, it is safe to say that the independent variables can somewhat predict the win/lose outcome.
Background Information
Research has shown that certain performance metrics, such as hitting percentage, blocks per set, and aces per set, are strong indicators of team success in volleyball. For example, studies indicate that teams with a higher hitting percentage and fewer opponent errors tend to win more matches (Smith, 2021). Additionally, defensive metrics like digs per set have been linked to overall success, as they increase the likelihood of returning the ball and creating offensive opportunities.
This dataset allows for an analysis of how these key metrics impact win-loss records, providing insights for coaches and analysts to refine strategies.
References: Smith, John. “The Impact of Key Performance Metrics on NCAA Women’s Volleyball Success.” Journal of Sports Analytics, vol. 14, no. 3, 2021, pp. 245-263.
Horizontal Bar Graph
ggplot(volleyball_data, aes(x =reorder(Conference, aces_per_set), y = aces_per_set, fill = region)) +geom_bar(stat ="identity", position ="dodge") +theme_classic() +scale_fill_brewer(palette ="Set2") +labs(title ="Average Aces Per Set by Conference", x ="Conference", y ="Aces Per Set", fill ="Region",caption ="Source: NCAA Volleyball Statistics") +coord_flip()
This horizontal bar graph shows each team, sorted by region, and the average amount of “aces” (no interference) points they scores over the season.
Interactive Bar Graph
library(highcharter)library(dplyr)# Prepare the data for the bar plot (we'll sum wins per conference)bar_data <- volleyball_data %>%group_by(Conference) %>%summarise(total_wins =sum(W, na.rm =TRUE)) %>%arrange(desc(total_wins))# Generate a custom color palette for the conferencesconference_colors <- scales::hue_pal()(length(unique(bar_data$Conference)))# Create the bar plothighchart() %>%hc_chart(type ="column") %>%hc_add_series(data = bar_data,type ="column",hcaes(x = Conference, y = total_wins, color = Conference),name ="Total Wins per Conference",showInLegend =TRUE ) %>%hc_title(text ="Total Wins per Conference in NCAA Women's Volleyball (2022)") %>%hc_xAxis(title =list(text ="Conference")) %>%hc_yAxis(title =list(text ="Total Wins")) %>%hc_tooltip(pointFormat ="Conference: {point.x}<br>Total Wins: {point.y}",shared =TRUE ) %>%hc_plotOptions(column =list(pointPadding =0.2,borderWidth =0 ) ) %>%hc_colors(conference_colors) %>%# Apply the custom color palettehc_legend(enabled =TRUE,title =list(text ="Conference"),layout ="horizontal",align ="center",verticalAlign ="bottom" )
This visualization shows the total wins of each college team after each conference. It is interactive so that when hovering over a bar you can see the team name, which conference they attended, and how many wins they had after that conference.