Introduction:

For some context, I am a huge Women’s Basketball fan. I grew up in Connecticut, my mom played Division III basketball, and as a result, I went to many UConn Women’s Basketball games as a child. I now live in San Francisco, and when I am back on the East Coast during the NCAA Women’s Basketball Season, I make a point to attend a game. Couple my passion for Women’s Basketball with the current time of the year (March Madness), and I thought it would be a fun idea to make the subject of this blog Women’s Basketball.

For this blog, I wanted to use a model to determine whether I could predict the success of a team based on shooting percentages and per-game averages for standard basketball metrics (Assists, Steals, Points Scored (for and against), Turnovers, etc.). “Success,” in the case of this model, will mean Wins v.s. Losses. I do understand there are such things as “Good Losses” and “Bad Wins” in the world of basketball, and for my model, I will not be considering such complexity.

Packages:

I will use the following packages for my modeling process.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glue)
## Warning: package 'glue' was built under R version 4.3.1
library(ggplot2)
library(ggfortify)

NCAA WBB Teams Data (2005-2024):

I realized that the data that I wanted did not exist in an easily “downloadable” format. However, the data that I wanted to work with does exist on basketballreference.com, so all I had to do was figure out a way to scrape the data from that webpage.

# Create an empty dataframe with column names
seasons <- data.frame(matrix(ncol = 33, nrow = 0))
colnames(seasons) <- c('School', 'Games_Played', 'Wins', 'Losses', 'W_L%', 'SRS', 'SOS', 
                       'Conf_Wins', 'Conf_Losses', 'Home_Wins', 'Home_Losses', 'Away_Wins', 'Away_Losses', 
                       'Points_Scored', 'Points_Against', 'Minutes_Played', 'FGM', 'FGA', 'FG%', '3PM', 
                       '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 
                       'Season')

# Loop through years and scrape data
for(i in 2005:2024) {
  teams_url <- glue('https://www.sports-reference.com/cbb/seasons/women/{i}-school-stats.html')
  
  teams <- teams_url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="basic_school_stats"]') %>%
    html_table(header = TRUE)
  
  teams <- teams[[1]]
  
  teams <- teams[,-1]
  
  colnames(teams) <- c('School', 'Games_Played', 'Wins', 'Losses', 'W_L%', 'SRS', 'SOS', 'Blank_1', 'Conf_Wins', 'Conf_Losses', 'Blank_2', 'Home_Wins', 'Home_Losses', 'Blank_3', 'Away_Wins', 'Away_Losses', 'Blank_4', 'Points_Scored', 'Points_Against', 'Blank_5', 'Minutes_Played', 'FGM', 'FGA', 'FG%', '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF')
  
  teams <- teams[-1,]
  
  teams <- teams %>%
    select(-c(Blank_1, Blank_2, Blank_3, Blank_4, Blank_5)) %>%
    mutate(Season = glue('{i-1}-{i}'))
  
  seasons <- rbind(seasons, teams)
  
  # Add a delay to avoid making too many requests too quickly
  Sys.sleep(4) # Delay for 4 seconds
}

I started by creating a empty data frame with the column names of the table I would scrape from. When using Basketball Reference, I realized that the same sort of data for each season of NCAA Women’s Basketball could be found just by changing the year number in the URL, which lead me to discover that if I simply ran a for loop to replace the year number in the URL, I could scrape a dataset that included multiple seasons. Working with a larger dataset is more desirable when building a model, so I went ahead and built my for loop.

My webscraping process was not without challenge, though. I did not realize that basketballreference.com has anti-bot rules in place, namely a twenty-requests-per-minute rule. I submitted over twenty requests in one minute and I was put in “bot-jail.” This placement in “bot-jail” was only temporary (1 hour) and it was a learning experience for me to check a websites scraping rules. To protect myself from future issues related to requests per minute, I added a command to implement a 4 second delay after each loop.

Simple Data Cleaning:

teams_2005_2024 = seasons

teams_2005_2024 = teams_2005_2024 %>%
  mutate_at(vars(Games_Played, Wins, Losses, `W_L%`, SRS, SOS, Conf_Wins, Conf_Losses, Home_Wins, Home_Losses, Away_Wins, Away_Losses, Points_Scored, Points_Against, Minutes_Played, FGM, FGA, `FG%`, `3PM`, `3PA`, `3P%`, FTA, FTM, `FT%`, ORB, TRB, AST, STL, BLK, TOV, PF), as.double)
## Warning: There were 31 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Games_Played = .Primitive("as.double")(Games_Played)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 30 remaining warnings.
teams_2005_2024$Season = as.character(teams_2005_2024$Season)

As an initial step, I assigned a new variable to the seasons dataset. I did this so that if I wanted to restart the data cleaning process, I would not have to rerun my for loop. I wanted to implement extra protections after my short stay in “bot-jail.” I then changed the variables to the relevant data types.

Shooting Percentages and Per Game Fields:

teams_2005_2024 = teams_2005_2024 %>%
  mutate_at(vars(Points_Scored, Points_Against, Minutes_Played, FGM, FGA,`3PM`, `3PA`, FTM, FTA, ORB, TRB, AST, STL, BLK, TOV, PF), list(per_game = ~ ./Games_Played))

game_stats = teams_2005_2024 %>%
  select(matches("per_game$|%|^Wins|^Losses")) %>%
  mutate(AST_vs_TOV = AST_per_game / TOV_per_game)

game_stats <- na.omit(game_stats)

game_stats
## # A tibble: 6,865 × 23
##     Wins Losses `W_L%` `FG%` `3P%` `FT%` Points_Scored_per_game
##    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>                  <dbl>
##  1     7     21  0.25  0.368 0.287 0.644                   58.3
##  2     3     25  0.107 0.362 0.299 0.684                   55  
##  3    13     15  0.464 0.385 0.332 0.664                   69.3
##  4    17     11  0.607 0.375 0.279 0.703                   57.4
##  5    15     14  0.517 0.367 0.269 0.638                   55.9
##  6    14     14  0.5   0.4   0.356 0.652                   61.1
##  7    21      7  0.75  0.391 0.288 0.627                   65.5
##  8    12     16  0.429 0.38  0.284 0.624                   56.7
##  9    11     17  0.393 0.408 0.305 0.649                   63.3
## 10    20     12  0.625 0.43  0.336 0.651                   70.6
## # ℹ 6,855 more rows
## # ℹ 16 more variables: Points_Against_per_game <dbl>,
## #   Minutes_Played_per_game <dbl>, FGM_per_game <dbl>, FGA_per_game <dbl>,
## #   `3PM_per_game` <dbl>, `3PA_per_game` <dbl>, FTM_per_game <dbl>,
## #   FTA_per_game <dbl>, ORB_per_game <dbl>, TRB_per_game <dbl>,
## #   AST_per_game <dbl>, STL_per_game <dbl>, BLK_per_game <dbl>,
## #   TOV_per_game <dbl>, PF_per_game <dbl>, AST_vs_TOV <dbl>

The data I scraped from reflected season totals and I wanted per-game averages for my analysis, so I created new columns for each metric for which I wanted a per-game average. I then filtered the dataset down to per-game averages, shooting percentages, and Wins and loss Counts. I then removed a the NA values, which was not a significant portion of the original dataset and still left me with 6798 observations.

Checking Variable Distributions:

hist_list <- list()

for (col in names(game_stats)) {
  # Exclude non-numeric variables if needed
  if (is.numeric(game_stats[[col]])) {
    # Create a histogram for the current predictor variable
    hist_list[[col]] <- hist(game_stats[[col]], main = paste("Histogram of", col), xlab = col)
  }
}

# Print the histograms
print(hist_list)
## $Wins
## $breaks
## [1]  0  5 10 15 20 25 30 35 40
## 
## $counts
## [1]  474 1248 1829 1602 1149  430  113   20
## 
## $density
## [1] 0.0138091770 0.0363583394 0.0532847779 0.0466715222 0.0334741442
## [6] 0.0125273125 0.0032920612 0.0005826657
## 
## $mids
## [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $Losses
## $breaks
##  [1]  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30
## 
## $counts
##  [1]  48 139 233 447 642 838 878 990 797 675 523 329 223  90  13
## 
## $density
##  [1] 0.0034959942 0.0101238165 0.0169701384 0.0325564457 0.0467589221
##  [6] 0.0610342316 0.0639475601 0.0721048798 0.0580480699 0.0491624181
## [11] 0.0380917698 0.0239621267 0.0162418063 0.0065549891 0.0009468318
## 
## $mids
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`W_L%`
## $breaks
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
## 
## $counts
##  [1]  131  389  736  976 1223 1189 1083  754  301   83
## 
## $density
##  [1] 0.1908230 0.5666424 1.0721049 1.4217043 1.7815004 1.7319738 1.5775674
##  [8] 1.0983248 0.4384559 0.1209031
## 
## $mids
##  [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`FG%`
## $breaks
##  [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54
## 
## $counts
##  [1]    3   19  162  529 1201 1742 1573  976  428  150   56   22    4
## 
## $density
##  [1]  0.02184996  0.13838310  1.17989803  3.85287691  8.74726875 12.68754552
##  [7] 11.45666424  7.10852149  3.11726147  1.09249818  0.40786599  0.16023307
## [13]  0.02913328
## 
## $mids
##  [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`3P%`
## $breaks
##  [1] 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42
## [16] 0.44 0.46
## 
## $counts
##  [1]    1    0    4   20   78  281  757 1273 1727 1402  801  362  131   24    3
## [16]    1
## 
## $density
##  [1]  0.007283321  0.000000000  0.029133285  0.145666424  0.568099053
##  [6]  2.046613256  5.513474144  9.271667881 12.578295703 10.211216315
## [11]  5.833940277  2.636562272  0.954115076  0.174799709  0.021849964
## [16]  0.007283321
## 
## $mids
##  [1] 0.15 0.17 0.19 0.21 0.23 0.25 0.27 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43
## [16] 0.45
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`FT%`
## $breaks
##  [1] 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78
## [16] 0.80 0.82 0.84 0.86
## 
## $counts
##  [1]    1    4   11   27   87  223  460  716 1040 1212 1172  900  581  286  113
## [16]   26    5    1
## 
## $density
##  [1] 0.007283321 0.029133285 0.080116533 0.196649672 0.633648944 1.624180626
##  [7] 3.350327749 5.214857975 7.574654042 8.827385288 8.536052440 6.554989075
## [13] 4.231609614 2.083029862 0.823015295 0.189366351 0.036416606 0.007283321
## 
## $mids
##  [1] 0.51 0.53 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 0.77 0.79
## [16] 0.81 0.83 0.85
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $Points_Scored_per_game
## $breaks
##  [1]  20  40  60  80 100 120 140 160 180 200 220 240 260 280
## 
## $counts
##  [1]    5 1793 4925  134    4    0    2    0    0    0    0    1    1
## 
## $density
##  [1] 3.641661e-05 1.305899e-02 3.587036e-02 9.759650e-04 2.913328e-05
##  [6] 0.000000e+00 1.456664e-05 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 7.283321e-06 7.283321e-06
## 
## $mids
##  [1]  30  50  70  90 110 130 150 170 190 210 230 250 270
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $Points_Against_per_game
## $breaks
##  [1]   0  20  40  60  80 100 120 140 160 180 200 220
## 
## $counts
##  [1]    2    4 2125 4699   29    2    2    0    0    0    2
## 
## $density
##  [1] 1.456664e-05 2.913328e-05 1.547706e-02 3.422433e-02 2.112163e-04
##  [6] 1.456664e-05 1.456664e-05 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 1.456664e-05
## 
## $mids
##  [1]  10  30  50  70  90 110 130 150 170 190 210
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $Minutes_Played_per_game
## $breaks
##  [1]  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160
## 
## $counts
##  [1]    2    2 1817 5031    5    3    1    0    2    0    0    0    1    0    1
## 
## $density
##  [1] 2.913328e-05 2.913328e-05 2.646759e-02 7.328478e-02 7.283321e-05
##  [6] 4.369993e-05 1.456664e-05 0.000000e+00 2.913328e-05 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 1.456664e-05 0.000000e+00 1.456664e-05
## 
## $mids
##  [1]  15  25  35  45  55  65  75  85  95 105 115 125 135 145 155
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $FGM_per_game
## $breaks
##  [1]  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95 100
## 
## $counts
##  [1]    5  603 4587 1571   91    2    2    0    2    0    0    0    0    0    0
## [16]    1    0    1
## 
## $density
##  [1] 1.456664e-04 1.756737e-02 1.336344e-01 4.576839e-02 2.651129e-03
##  [6] 5.826657e-05 5.826657e-05 0.000000e+00 5.826657e-05 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [16] 2.913328e-05 0.000000e+00 2.913328e-05
## 
## $mids
##  [1] 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5
## [16] 87.5 92.5 97.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $FGA_per_game
## $breaks
##  [1]  20  40  60  80 100 120 140 160 180 200 220 240
## 
## $counts
##  [1]    4 4599 2251    5    2    2    0    0    0    1    1
## 
## $density
##  [1] 2.913328e-05 3.349599e-02 1.639476e-02 3.641661e-05 1.456664e-05
##  [6] 1.456664e-05 0.000000e+00 0.000000e+00 0.000000e+00 7.283321e-06
## [11] 7.283321e-06
## 
## $mids
##  [1]  30  50  70  90 110 130 150 170 190 210 230
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`3PM_per_game`
## $breaks
##  [1]  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30
## 
## $counts
##  [1]   28 1159 3441 1807  375   48    5    0    0    0    0    1    0    0    1
## 
## $density
##  [1] 2.039330e-03 8.441369e-02 2.506191e-01 1.316096e-01 2.731245e-02
##  [6] 3.495994e-03 3.641661e-04 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 7.283321e-05 0.000000e+00 0.000000e+00 7.283321e-05
## 
## $mids
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $`3PA_per_game`
## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
## 
## $counts
##  [1]    7  174 1938 3119 1334  259   26    5    1    0    0    0    0    0    1
## [16]    0    1
## 
## $density
##  [1] 2.039330e-04 5.069192e-03 5.646031e-02 9.086672e-02 3.886380e-02
##  [6] 7.545521e-03 7.574654e-04 1.456664e-04 2.913328e-05 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.913328e-05
## [16] 0.000000e+00 2.913328e-05
## 
## $mids
##  [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5
## [16] 77.5 82.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $FTM_per_game
## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45 50 55
## 
## $counts
##  [1]    1  930 5299  627    3    1    1    1    0    1    1
## 
## $density
##  [1] 2.913328e-05 2.709395e-02 1.543773e-01 1.826657e-02 8.739985e-05
##  [6] 2.913328e-05 2.913328e-05 2.913328e-05 0.000000e+00 2.913328e-05
## [11] 2.913328e-05
## 
## $mids
##  [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $FTA_per_game
## $breaks
##  [1]  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
## 
## $counts
##  [1]   16 1151 4285 1342   65    1    1    0    1    1    0    0    0    1    1
## 
## $density
##  [1] 4.661326e-04 3.353241e-02 1.248361e-01 3.909687e-02 1.893664e-03
##  [6] 2.913328e-05 2.913328e-05 0.000000e+00 2.913328e-05 2.913328e-05
## [11] 0.000000e+00 0.000000e+00 0.000000e+00 2.913328e-05 2.913328e-05
## 
## $mids
##  [1]  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $ORB_per_game
## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45 50
## 
## $counts
##  [1]    7 2027 4463  359    3    2    2    0    0    2
## 
## $density
##  [1] 2.039330e-04 5.905317e-02 1.300218e-01 1.045885e-02 8.739985e-05
##  [6] 5.826657e-05 5.826657e-05 0.000000e+00 0.000000e+00 5.826657e-05
## 
## $mids
##  [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $TRB_per_game
## $breaks
##  [1]  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150
## 
## $counts
##  [1]    2  494 5827  531    3    4    0    2    0    0    0    0    1    1
## 
## $density
##  [1] 2.913328e-05 7.195921e-03 8.487983e-02 7.734887e-03 4.369993e-05
##  [6] 5.826657e-05 0.000000e+00 2.913328e-05 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 1.456664e-05 1.456664e-05
## 
## $mids
##  [1]  15  25  35  45  55  65  75  85  95 105 115 125 135 145
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $AST_per_game
## $breaks
##  [1]  5 10 15 20 25 30 35 40 45 50 55 60 65
## 
## $counts
##  [1]  435 5284 1099   43    2    0    0    0    1    0    0    1
## 
## $density
##  [1] 1.267298e-02 1.539403e-01 3.201748e-02 1.252731e-03 5.826657e-05
##  [6] 0.000000e+00 0.000000e+00 0.000000e+00 2.913328e-05 0.000000e+00
## [11] 0.000000e+00 2.913328e-05
## 
## $mids
##  [1]  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $STL_per_game
## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45
## 
## $counts
## [1]  140 5801  912   10    0    0    1    0    1
## 
## $density
## [1] 4.078660e-03 1.690022e-01 2.656956e-02 2.913328e-04 0.000000e+00
## [6] 0.000000e+00 2.913328e-05 0.000000e+00 2.913328e-05
## 
## $mids
## [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $BLK_per_game
## $breaks
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13
## 
## $counts
##  [1]   29  759 2336 2237 1045  335   95   23    4    1    0    0    1
## 
## $density
##  [1] 0.0042243263 0.1105608157 0.3402767662 0.3258557902 0.1522214130
##  [6] 0.0487982520 0.0138383103 0.0033503277 0.0005826657 0.0001456664
## [11] 0.0000000000 0.0000000000 0.0001456664
## 
## $mids
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $TOV_per_game
## $breaks
##  [1]  5 10 15 20 25 30 35 40 45 50 55 60 65 70
## 
## $counts
##  [1]   16 2054 4265  508   15    2    3    0    0    0    1    0    1
## 
## $density
##  [1] 4.661326e-04 5.983977e-02 1.242535e-01 1.479971e-02 4.369993e-04
##  [6] 5.826657e-05 8.739985e-05 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 2.913328e-05 0.000000e+00 2.913328e-05
## 
## $mids
##  [1]  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $PF_per_game
## $breaks
##  [1]  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
## 
## $counts
##  [1]    3  959 5354  536    6    3    1    1    0    0    0    0    1    0    1
## 
## $density
##  [1] 8.739985e-05 2.793882e-02 1.559796e-01 1.561544e-02 1.747997e-04
##  [6] 8.739985e-05 2.913328e-05 2.913328e-05 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 2.913328e-05 0.000000e+00 2.913328e-05
## 
## $mids
##  [1]  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $AST_vs_TOV
## $breaks
##  [1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
## [20] 2.1
## 
## $counts
##  [1]    3   22  218  639 1191 1411 1308  829  539  345  161   92   46   34   13
## [16]    4    6    3    1
## 
## $density
##  [1] 0.004369993 0.032046613 0.317552804 0.930808449 1.734887109 2.055353241
##  [7] 1.905316824 1.207574654 0.785142025 0.502549162 0.234522942 0.134013110
## [13] 0.067006555 0.049526584 0.018936635 0.005826657 0.008739985 0.004369993
## [19] 0.001456664
## 
## $mids
##  [1] 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65
## [16] 1.75 1.85 1.95 2.05
## 
## $xname
## [1] "game_stats[[col]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Based on these distributions, it may be helpful to transform some of these variables or at least reinspect the distributions with smaller bin widths (namely, I believe that BLK_per_game could be transformed and variables such as STL_per_game may require a different bin width).

Reinspecting Predictor Variables:

stl = game_stats$STL_per_game

hist(stl, breaks = seq(min(stl), max(stl) + 1, by = 1), 
     main = "Histogram with Custom Bin Width", xlab = "Values", ylab = "Frequency")

`3pa` = game_stats$`3PA_per_game`

hist(`3pa`, breaks = seq(min(`3pa`), max(`3pa`) + 1, by = 2), 
     main = "Histogram with Custom Bin Width", xlab = "Values", ylab = "Frequency")

STL_per_game has a slight skew, but is generally normally distributed. I do not believe it would be worthwhile to transform STL_per_game as it would likely not change much. I would say the same for 3PA_per_game. Generally, these variables seem to be normally distributed.

Initial Model (model 1):

# Fit a logistic regression model
model1 <- lm(`W_L%` ~ `FG%` + `3P%` + `FT%` + Points_Scored_per_game + Points_Against_per_game + Minutes_Played_per_game + FGM_per_game + FGA_per_game + `3PM_per_game` + `3PA_per_game` + FTM_per_game + FTA_per_game + ORB_per_game + TRB_per_game + AST_per_game + STL_per_game + BLK_per_game + TOV_per_game + PF_per_game + AST_per_game * TOV_per_game,
             data = game_stats)

# Summary of the model
summary(model1)
## 
## Call:
## lm(formula = `W_L%` ~ `FG%` + `3P%` + `FT%` + Points_Scored_per_game + 
##     Points_Against_per_game + Minutes_Played_per_game + FGM_per_game + 
##     FGA_per_game + `3PM_per_game` + `3PA_per_game` + FTM_per_game + 
##     FTA_per_game + ORB_per_game + TRB_per_game + AST_per_game + 
##     STL_per_game + BLK_per_game + TOV_per_game + PF_per_game + 
##     AST_per_game * TOV_per_game, data = game_stats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82915 -0.04819  0.00284  0.05030  0.41890 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -5.216e-01  1.721e-01  -3.030 0.002452 ** 
## `FG%`                      3.250e+00  3.909e-01   8.314  < 2e-16 ***
## `3P%`                     -1.254e-01  1.152e-01  -1.089 0.276408    
## `FT%`                      3.504e-01  1.279e-01   2.739 0.006177 ** 
## Points_Scored_per_game    -2.017e-01  5.775e-01  -0.349 0.726841    
## Points_Against_per_game   -1.357e-02  2.307e-04 -58.824  < 2e-16 ***
## Minutes_Played_per_game   -7.063e-03  8.998e-04  -7.850 4.80e-15 ***
## FGM_per_game               3.885e-01  1.155e+00   0.336 0.736600    
## FGA_per_game               9.250e-03  2.721e-03   3.400 0.000678 ***
## `3PM_per_game`             2.306e-01  5.775e-01   0.399 0.689617    
## `3PA_per_game`            -3.396e-03  2.163e-03  -1.570 0.116479    
## FTM_per_game               1.984e-01  5.774e-01   0.344 0.731190    
## FTA_per_game               1.647e-02  4.971e-03   3.314 0.000924 ***
## ORB_per_game               4.153e-03  9.894e-04   4.198 2.73e-05 ***
## TRB_per_game               7.855e-03  6.265e-04  12.537  < 2e-16 ***
## AST_per_game               2.567e-03  1.142e-03   2.249 0.024560 *  
## STL_per_game               1.437e-02  8.281e-04  17.354  < 2e-16 ***
## BLK_per_game               5.518e-03  9.550e-04   5.778 7.88e-09 ***
## TOV_per_game              -1.929e-02  9.239e-04 -20.884  < 2e-16 ***
## PF_per_game               -3.861e-03  6.047e-04  -6.385 1.83e-10 ***
## AST_per_game:TOV_per_game  5.220e-05  5.124e-05   1.019 0.308351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07626 on 6844 degrees of freedom
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8503 
## F-statistic:  1951 on 20 and 6844 DF,  p-value: < 2.2e-16
autoplot(model1, label.size = 3)

For this model, I simply chose to have the per game averages and shooting percentages as the predictor variables. The model yielded a very small p-value (implying overall statistical signifigance) and a multiple \(R^2\) value of .8508, which means the model can account for about 85.6% of variance in the data (which is a rather significant \(R^2\) value).

The model does seem to meet the assumptions of normality based on the plot of residuals, although, it may help to remove some of the outliers labeled in the plots.

Interactions Model (model 2):

# Fit a logistic regression model
model2 <- lm(`W_L%` ~ Points_Scored_per_game : Points_Against_per_game + Minutes_Played_per_game + FGM_per_game : FGA_per_game + `3PM_per_game` : `3PA_per_game` + FTM_per_game : FTA_per_game + ORB_per_game + TRB_per_game + AST_per_game + STL_per_game + BLK_per_game + TOV_per_game + PF_per_game + AST_per_game : TOV_per_game,
             data = game_stats)

# Summary of the model
summary(model2)
## 
## Call:
## lm(formula = `W_L%` ~ Points_Scored_per_game:Points_Against_per_game + 
##     Minutes_Played_per_game + FGM_per_game:FGA_per_game + `3PM_per_game`:`3PA_per_game` + 
##     FTM_per_game:FTA_per_game + ORB_per_game + TRB_per_game + 
##     AST_per_game + STL_per_game + BLK_per_game + TOV_per_game + 
##     PF_per_game + AST_per_game:TOV_per_game, data = game_stats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78815 -0.06274  0.00195  0.06495  0.51194 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                     9.869e-01  5.092e-02  19.382
## Minutes_Played_per_game                        -2.251e-02  1.125e-03 -20.001
## ORB_per_game                                   -1.327e-02  1.191e-03 -11.136
## TRB_per_game                                    1.331e-02  7.808e-04  17.042
## AST_per_game                                    3.796e-02  1.280e-03  29.660
## STL_per_game                                    1.273e-02  1.030e-03  12.360
## BLK_per_game                                    1.103e-03  1.218e-03   0.906
## TOV_per_game                                   -1.597e-02  1.429e-03 -11.175
## PF_per_game                                    -1.030e-02  7.342e-04 -14.027
## Points_Scored_per_game:Points_Against_per_game -1.981e-04  4.329e-06 -45.763
## FGM_per_game:FGA_per_game                       4.056e-04  1.446e-05  28.048
## `3PM_per_game`:`3PA_per_game`                   4.685e-04  2.965e-05  15.804
## FTM_per_game:FTA_per_game                       1.092e-03  2.208e-05  49.429
## AST_per_game:TOV_per_game                      -4.847e-04  7.446e-05  -6.509
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## Minutes_Played_per_game                         < 2e-16 ***
## ORB_per_game                                    < 2e-16 ***
## TRB_per_game                                    < 2e-16 ***
## AST_per_game                                    < 2e-16 ***
## STL_per_game                                    < 2e-16 ***
## BLK_per_game                                      0.365    
## TOV_per_game                                    < 2e-16 ***
## PF_per_game                                     < 2e-16 ***
## Points_Scored_per_game:Points_Against_per_game  < 2e-16 ***
## FGM_per_game:FGA_per_game                       < 2e-16 ***
## `3PM_per_game`:`3PA_per_game`                   < 2e-16 ***
## FTM_per_game:FTA_per_game                       < 2e-16 ***
## AST_per_game:TOV_per_game                      8.08e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09742 on 6851 degrees of freedom
## Multiple R-squared:  0.7562, Adjusted R-squared:  0.7558 
## F-statistic:  1635 on 13 and 6851 DF,  p-value: < 2.2e-16
autoplot(model2, label.size = 3)

I want to predict the success of a team’s season based on per game averages and shooting percentages, and to do this I chose to linear regression model with some level of complexity with my predictors. I chose to model the interactions between specific variables given my contextual knowledge of the sport, for example, I did not choose to have TOV_per_game and AST_per_game as stand alone predictors and instead have the predictor be TOV_per_game * AST_per_game (the interaction between turnovers-per-game and assists-per-game).

The model actually proved to yield statistical significance with a very small p-value and a multiple \(R^2\) value of .7563, which means the model can account for about 75.63% of variance in the data.

Though, it is noteworthy that this model does have a lower \(R^2\) than model 1, which just took all variables at face value.

The model does seem to meet the assumptions of normality based on the plot of residuals, although, it may help to remove some of the outliers labeled in the plots (if this is the model that I end up choosing).

Model Selection:

Based on the \(R^2\) values, it would seem to be a no-brainer to use model1; however, it would make more sense to run some more diagnostics before I make that decision.

First, I’ll compare the AIC values.

AIC Comparison:

model1_aic = AIC(model1)
model2_aic = AIC(model2)

print(model1_aic)
## [1] -15829.91
print(model2_aic)
## [1] -12475.54

Negative BIC values indicate less information loss than positive BIC values, so both of these models seem to be good-fitting. However, given that model1’s BIC score is significantly less than model2’s BIC score, I would say this diagnostic tells me that Model 1 provides a better fit.

BIC Comparison:

model1_bic = BIC(model1)
model2_bic = BIC(model2)

print(model1_bic)
## [1] -15679.56
print(model2_bic)
## [1] -12373.03

Summary:

I would say that based on the fact that Model 1 has a rather desirable \(R^2\), more desirable AIC and BIC scores, and meets the assumptions of normality, it is both a solid model and the better fitting model than is Model 2. Sometimes the best model is the simpler model.