Untitled

Project 2

Source: From NCAA [Logo], by National Collegiate Athletic Association, 2023, NCAA

(https://www.google.com/url?sa=t&source=web&rct=j&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2F2023_NCAA_Division_I_women%27s_volleyball_tournament&ved=0CBYQjRxqFwoTCMiLyKqJgZQDFQAAAAAdAAAAABAG&opi=89978449)

Fair Use

Introduction

For this project, I wanted to see which variable, or variables had a greater effect on the win percentages in volleyball matches? This would allow me to see which specific team action had the greatest effect on the team’s success. Volleyball is a sport centered around two teams trying to have the volleyball land on the opposing teams ground. Team’s must coordinate offensive and defensive action to win the match. I have personally enjoyed playing and watching this sport, and I have friends who currently play collegiate volleyball. The data-set I utilized for this project was the, “Teams Statistics for Division I Women’s Volleyball,” from the SCORE Sports Data Repository. The publishers of the data-set, Jack Fay, A.J. Dykstra, and Ivan Ramler, published the data-set on July 24th 2023. These statistics correlated to the NCAA Division I Women’s Volleyball 2022-2023 season, with the official information gathered directly from NCAA statistics. The data-set contained 344 observations with 14 variables. There were three categorical variables and eleven quantitative variables. The categorical variables included the Team name (Team), the athletic conference the team played in (Conference) and the region the team plays in (region). The quantitative variables were much more thorough, and included a mixture of averages and direct numbers for the team’s overall performance. To start, the data-set included the total amount of wins (W) and losses (L) the team had in the season.

For the average quantitative variables, the data-set goes off the set’s of the match. Each volleyball match in NCAA Division I is played to fives sets, with the winner being the first to three sets won. The data-set included three percentage statistics, with two of them being based around the hitting percentages. The hitting percentage per set calculates the total amount of successful offensive attacks with the total amount of offensive errors, and dividing the sum by the total offensive attempts. This is calculated for each teams hitting percentage (hitting_pctg) and their opponents hitting percentage (opp_hitting_pctg). Each team’s match record for the season is also calculated through their total amount of wins being divided by the matches they played (win_loss_pctg). Individual offensive and defensive statistics were also included. Defensive averages included the averages for blocks (blocks_per_set) and defensive passes (digs_per_set). Offensive averages included hits that scored a point (kills_per_set), serves that directly led to a point (aces_per_set), the amount of times the ball is hit into the opposing team’s side (team_attacks_per_set) and the amount of assists (assists_per_set). Thankfully, the data-set was already formatted for tidy, so I didn’t need to do too much wrangling. I only needed remove rows with null responses from the data-set. This helps to prevent any biases or other formatting issues from appearing later on in my regression analysis and visualizations. For distinct filtering, I also removed rows who had a win/loss percentage of less than 0.5, to make sure I only included teams with a positive record. This allows me to better compare and analyze variables based on how they impact team performance. Finally, I created a new variable to see the efficiency of the offensive team, through subtracting the opponents hitting percentage from the teams hitting percentage. This creates a much more thorough analysis of the total performance of the team, as it takes into account the offensive and defensive capabilities and how they interact.

Loading necessary libraries

# Loading neccesary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DataExplorer)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)

# Loading the dataset
volleyball <- readr::read_csv("volleyball_ncaa_div1_2022_23.csv")

Rows: 334 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Team, Conference, region
dbl (11): aces_per_set, assists_per_set, team_attacks_per_set, blocks_per_se...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning/DPLYR

# Removing observations with null values 
wrangledset <- volleyball %>% filter(complete.cases(.))

# Filter the dataset so that the season record is positive.
filteredset <- 
  filter(wrangledset, win_loss_pctg > 0.500) %>%
  mutate(efficiency_hitting = hitting_pctg - opp_hitting_pctg)

# Filtered set uses around 170 observations, less than 800
head(filteredset)

# A tibble: 6 × 15
  Team       Conference region aces_per_set assists_per_set team_attacks_per_set
  <chr>      <chr>      <chr>         <dbl>           <dbl>                <dbl>
1 Delaware … MEAC       South…         2.2             11.4                 30.0
2 Yale       Ivy League East           2.15            12.6                 35.4
3 Coppin St. MEAC       South…         2.15            10.6                 32.5
4 Saint Lou… Atlantic … East           2.03            11.6                 34.1
5 UTEP       C-USA      South          1.98            11.5                 31.5
6 Samford    SoCon      South…         1.93            12.3                 37.0
# ℹ 9 more variables: blocks_per_set <dbl>, digs_per_set <dbl>,
#   hitting_pctg <dbl>, kills_per_set <dbl>, opp_hitting_pctg <dbl>, W <dbl>,
#   L <dbl>, win_loss_pctg <dbl>, efficiency_hitting <dbl>

Exploration through Simple Plots

# Creates a new plot, p1, with the X-values being the region and the Y-values being the win/loss pecentage
p1 <- ggplot(filteredset, aes(x = region, y = win_loss_pctg))+
  
# Sets the plot to a boxplot
  geom_boxplot()+
  
# Changes the plot theme to a dark plot
  theme_dark()

# Shows the plot
p1

# Creates a second plot, p2, with the X-value being the Kills, the y-value being the win/loss pecentage, and the color representing the 
# region.
p2 <- ggplot(filteredset, aes(x = kills_per_set, y = win_loss_pctg, color = region))+
  
# Plots the points, and sets the transparency to 0.3
  geom_point(alpha = 0.3)+
  
# Adds a Linear Regression line with a confidence interval, but simultaneously removes the confidence interval
  geom_smooth(method = 'lm',formula = y~x, se = FALSE)
  
# Shows the plot
p2

# Creates the third plot, p3, with the exact same method as p2 but changes the x-values to the blocks
p3 <- ggplot(filteredset, aes(x = blocks_per_set, y = win_loss_pctg, color = region))+
  geom_point(alpha = 0.3)+
  geom_smooth(method = 'lm',formula = y~x, se = FALSE)
  
p3

Multiple Linear Regression Analysis

Correlation:

plot_correlation(filteredset)

2 features with more than 20 categories ignored!
Team: 170 categories
Conference: 32 categories

Multiple Linear Regression

mlr0 <- lm(formula = win_loss_pctg ~ aces_per_set + assists_per_set + team_attacks_per_set + blocks_per_set + digs_per_set + kills_per_set, data = filteredset)
summary(mlr0)


Call:
lm(formula = win_loss_pctg ~ aces_per_set + assists_per_set + 
    team_attacks_per_set + blocks_per_set + digs_per_set + kills_per_set, 
    data = filteredset)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.180743 -0.040959 -0.008785  0.044802  0.186040 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.379787   0.183159  -2.074 0.039696 *  
aces_per_set          0.104684   0.027294   3.835 0.000179 ***
assists_per_set      -0.008765   0.035352  -0.248 0.804491    
team_attacks_per_set -0.046098   0.006254  -7.371 8.05e-12 ***
blocks_per_set        0.108233   0.018333   5.904 2.00e-08 ***
digs_per_set          0.054544   0.008080   6.751 2.45e-10 ***
kills_per_set         0.117719   0.035164   3.348 0.001012 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07014 on 163 degrees of freedom
Multiple R-squared:  0.6169,    Adjusted R-squared:  0.6028 
F-statistic: 43.74 on 6 and 163 DF,  p-value: < 2.2e-16

For the multiple linear regression, I used the offensive and defensive variables for my predictors and the winning percentage for my response variable. This allowed me to see how each variable, both the offensive and defensive, contributed to the performance of the team. Originally, I was going to consider moving region to numerical and seeing how it played into the winning teams performance. However, I was unable to find strong correlations between those variables and the winning performance, so I stayed with my offensive and defensive variables. All the variables had significant correlation between them and the performance.

My equation was: win_loss_pctg = B0 + B1(aces_per_set) + B2(assists_per_set) +B3(team_attacks_per_set) +B4(blocks_per_set) + B5(digs_per_set) +B6(kills_per_set). My adjusted R-Squared was indicated that these statistics contributed to roughly 60% of variance, allowing me to generally say that these statistics contributed heavily to overall team performance. This is further justified through my P-value being so small, being at a number that couldn’t be quantified by the computer. It being so small also signified that it was statistically significant.

Visualization

Main:

highchart() |>
  hc_add_series(
    # Data
    data = filteredset,
    # Method
    type = "scatter",
    hcaes(
      x = efficiency_hitting,
      y = win_loss_pctg,
      group = region
      )
  ) |>
hc_xAxis(title = list(text = "Hitting Efficiency")) |>
hc_yAxis(title = list(text = "Win/Loss Percentage")) |>
hc_title(text = "Team Hitting Efficiecny correlation to Win/Loss Percentage") |>
hc_caption(text = "Source: SCORE Sports Data Repository")

Analysis/Conclusion

The visualization showed a clear positive trend in hitting efficiency and wining percentage. This is shown to be something that isn’t too strongly related to the region, however the west and Midwest were shown to be strong here. The spread is still somewhat represented still, meaning that other variables still played a significant role in the winning percentage of the team. I would have liked to incorporate other factors, like conference or offensive errors to better get an understanding of it.

In conclusion, I was able to successfully see that certain variables, like kill percentage, did make a significant impact into hitting percentage than others. This does inicate that specific team action had the greatest effect on the team’s success, with the multiple linear regression showing so. The adjusted R-Squared being so large correlating to the variance also successfully showed this, with the P-value being significantly small making it. The regression confirmed that offensive variables were much more strong at indicating team success compared to defensive variables.

Bibliography

Fay, J. (2023, July 24). Team Statistics for Division I Women’s Volleyball – SCORE Sports Data Repository. Scorenetwork.org. https://data.scorenetwork.org/volleyball/volleyball_ncaa_team_stats.html

complete.cases function | R Documentation. (n.d.). Www.rdocumentation.org. https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/complete.cases

‌