Data Analysis Challenge (DAC) - Spring 2024

In this Rmarkdown file, we will demonstrate how to read in datasets directly from Github and follow that with some explorations.

Import Libraries

These are some packages you could consider using for your analysis based on what you plan to do.

tidyverse - collection of packages for diverse tasks which include data manipulation (dplyr), reading and writing data (readr), and creating static data visualizations (ggplots).
Plotly - for intereactive data visualization
stats - for statistical analysis
#Install the packages first before import.

suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
suppressPackageStartupMessages(library(plotly))
## Warning: package 'plotly' was built under R version 4.3.2
suppressPackageStartupMessages(library(stats))
suppressPackageStartupMessages(library(knitr))
## Warning: package 'knitr' was built under R version 4.3.2

Data Exploration

df <- read_csv("https://raw.githubusercontent.com/cmich-stats-club/spring2024-dac/main/datasets/cbb.csv", show_col_types = FALSE)
str(df)
## spc_tbl_ [2,455 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ TEAM      : chr [1:2455] "North Carolina" "Wisconsin" "Michigan" "Texas Tech" ...
##  $ CONF      : chr [1:2455] "ACC" "B10" "B10" "B12" ...
##  $ G         : num [1:2455] 40 40 40 38 39 40 38 39 38 39 ...
##  $ W         : num [1:2455] 33 36 33 31 37 29 30 35 35 33 ...
##  $ ADJOE     : num [1:2455] 123 129 114 115 118 ...
##  $ ADJDE     : num [1:2455] 94.9 93.6 90.4 85.2 86.3 96.2 93.7 90.6 89.9 91.5 ...
##  $ BARTHAG   : num [1:2455] 0.953 0.976 0.938 0.97 0.973 ...
##  $ EFG_O     : num [1:2455] 52.6 54.8 53.9 53.5 56.6 49.9 54.6 56.6 55.2 51.7 ...
##  $ EFG_D     : num [1:2455] 48.1 47.7 47.7 43 41.1 46 48 46.5 44.7 48.1 ...
##  $ TOR       : num [1:2455] 15.4 12.4 14 17.7 16.2 18.1 14.6 16.3 14.7 16.2 ...
##  $ TORD      : num [1:2455] 18.2 15.8 19.5 22.8 17.1 16.1 18.7 18.6 17.5 18.6 ...
##  $ ORB       : num [1:2455] 40.7 32.1 25.5 27.4 30 42 32.5 35.8 30.4 41.3 ...
##  $ DRB       : num [1:2455] 30 23.7 24.9 28.7 26.2 29.7 29.4 30.2 25.4 25 ...
##  $ FTR       : num [1:2455] 32.3 36.2 30.7 32.9 39 51.8 28.4 39.8 29.1 34.3 ...
##  $ FTRD      : num [1:2455] 30.4 22.4 30 36.6 26.9 36.8 22.7 23.9 26.3 31.6 ...
##  $ 2P_O      : num [1:2455] 53.9 54.8 54.7 52.8 56.3 50 53.4 55.9 52.5 51 ...
##  $ 2P_D      : num [1:2455] 44.6 44.7 46.8 41.9 40 44.9 47.6 46.3 45.7 46.3 ...
##  $ 3P_O      : num [1:2455] 32.7 36.5 35.2 36.5 38.2 33.2 37.9 38.7 39.5 35.5 ...
##  $ 3P_D      : num [1:2455] 36.2 37.5 33.2 29.7 29 32.2 32.6 31.4 28.9 33.9 ...
##  $ ADJ_T     : num [1:2455] 71.7 59.3 65.9 67.5 71.5 65.9 64.8 66.4 60.7 72.8 ...
##  $ WAB       : num [1:2455] 8.6 11.3 6.9 7 7.7 3.9 6.2 10.7 11.1 8.4 ...
##  $ POSTSEASON: chr [1:2455] "2ND" "2ND" "2ND" "2ND" ...
##  $ SEED      : num [1:2455] 1 1 3 3 1 8 4 1 1 1 ...
##  $ YEAR      : num [1:2455] 2016 2015 2018 2019 2017 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   TEAM = col_character(),
##   ..   CONF = col_character(),
##   ..   G = col_double(),
##   ..   W = col_double(),
##   ..   ADJOE = col_double(),
##   ..   ADJDE = col_double(),
##   ..   BARTHAG = col_double(),
##   ..   EFG_O = col_double(),
##   ..   EFG_D = col_double(),
##   ..   TOR = col_double(),
##   ..   TORD = col_double(),
##   ..   ORB = col_double(),
##   ..   DRB = col_double(),
##   ..   FTR = col_double(),
##   ..   FTRD = col_double(),
##   ..   `2P_O` = col_double(),
##   ..   `2P_D` = col_double(),
##   ..   `3P_O` = col_double(),
##   ..   `3P_D` = col_double(),
##   ..   ADJ_T = col_double(),
##   ..   WAB = col_double(),
##   ..   POSTSEASON = col_character(),
##   ..   SEED = col_double(),
##   ..   YEAR = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

There are 2,455 records across 24 columns in the dataset.

summary(df)
##      TEAM               CONF                 G               W        
##  Length:2455        Length:2455        Min.   :15.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:30.00   1st Qu.:11.00  
##  Mode  :character   Mode  :character   Median :31.00   Median :16.00  
##                                        Mean   :31.49   Mean   :16.28  
##                                        3rd Qu.:33.00   3rd Qu.:21.00  
##                                        Max.   :40.00   Max.   :38.00  
##                                                                       
##      ADJOE           ADJDE          BARTHAG           EFG_O      
##  Min.   : 76.6   Min.   : 84.0   Min.   :0.0050   Min.   :39.20  
##  1st Qu.: 98.3   1st Qu.: 98.5   1st Qu.:0.2822   1st Qu.:47.75  
##  Median :103.0   Median :103.5   Median :0.4750   Median :49.70  
##  Mean   :103.3   Mean   :103.3   Mean   :0.4940   Mean   :49.81  
##  3rd Qu.:108.0   3rd Qu.:107.9   3rd Qu.:0.7122   3rd Qu.:51.90  
##  Max.   :129.1   Max.   :124.0   Max.   :0.9842   Max.   :59.80  
##                                                                  
##      EFG_D           TOR             TORD            ORB             DRB       
##  Min.   :39.6   Min.   :11.90   Min.   :10.20   Min.   :15.00   Min.   :18.40  
##  1st Qu.:48.0   1st Qu.:17.30   1st Qu.:17.20   1st Qu.:27.10   1st Qu.:27.90  
##  Median :50.0   Median :18.70   Median :18.60   Median :29.90   Median :30.00  
##  Mean   :50.0   Mean   :18.76   Mean   :18.69   Mean   :29.88   Mean   :30.08  
##  3rd Qu.:52.0   3rd Qu.:20.10   3rd Qu.:20.10   3rd Qu.:32.60   3rd Qu.:32.20  
##  Max.   :59.5   Max.   :27.10   Max.   :28.50   Max.   :43.60   Max.   :40.40  
##                                                                                
##       FTR             FTRD            2P_O           2P_D            3P_O      
##  Min.   :21.60   Min.   :21.80   Min.   :37.7   Min.   :37.70   Min.   :24.90  
##  1st Qu.:32.40   1st Qu.:31.90   1st Qu.:46.5   1st Qu.:46.70   1st Qu.:32.50  
##  Median :35.80   Median :35.80   Median :48.7   Median :49.00   Median :34.40  
##  Mean   :35.99   Mean   :36.27   Mean   :48.8   Mean   :48.98   Mean   :34.41  
##  3rd Qu.:39.50   3rd Qu.:40.20   3rd Qu.:51.0   3rd Qu.:51.30   3rd Qu.:36.30  
##  Max.   :58.60   Max.   :60.70   Max.   :62.6   Max.   :61.20   Max.   :44.10  
##                                                                                
##       3P_D          ADJ_T            WAB           POSTSEASON       
##  Min.   :27.1   Min.   :57.20   Min.   :-25.200   Length:2455       
##  1st Qu.:33.0   1st Qu.:65.70   1st Qu.:-13.000   Class :character  
##  Median :34.6   Median :67.80   Median : -8.300   Mode  :character  
##  Mean   :34.6   Mean   :67.81   Mean   : -7.802                     
##  3rd Qu.:36.2   3rd Qu.:70.00   3rd Qu.: -3.150                     
##  Max.   :43.1   Max.   :83.40   Max.   : 13.100                     
##                                                                     
##       SEED             YEAR     
##  Min.   : 1.000   Min.   :2013  
##  1st Qu.: 5.000   1st Qu.:2014  
##  Median : 9.000   Median :2016  
##  Mean   : 8.803   Mean   :2016  
##  3rd Qu.:13.000   3rd Qu.:2018  
##  Max.   :16.000   Max.   :2019  
##  NA's   :1979

From the YEAR summaries, we see that this data is the seasons 2013 through 2019.

#Get the number of missing values for each variable
sapply(df, function(x) sum(is.na(x)))
##       TEAM       CONF          G          W      ADJOE      ADJDE    BARTHAG 
##          0          0          0          0          0          0          0 
##      EFG_O      EFG_D        TOR       TORD        ORB        DRB        FTR 
##          0          0          0          0          0          0          0 
##       FTRD       2P_O       2P_D       3P_O       3P_D      ADJ_T        WAB 
##          0          0          0          0          0          0          0 
## POSTSEASON       SEED       YEAR 
##       1979       1979          0

The only columns with missing records are POSTSEASON and SEED. Are these missing on purpose or it is a data anomaly?

#Print out first 6 records
df %>%
  head(.) %>%
  kable(.)
TEAM CONF G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB DRB FTR FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB POSTSEASON SEED YEAR
North Carolina ACC 40 33 123.3 94.9 0.9531 52.6 48.1 15.4 18.2 40.7 30.0 32.3 30.4 53.9 44.6 32.7 36.2 71.7 8.6 2ND 1 2016
Wisconsin B10 40 36 129.1 93.6 0.9758 54.8 47.7 12.4 15.8 32.1 23.7 36.2 22.4 54.8 44.7 36.5 37.5 59.3 11.3 2ND 1 2015
Michigan B10 40 33 114.4 90.4 0.9375 53.9 47.7 14.0 19.5 25.5 24.9 30.7 30.0 54.7 46.8 35.2 33.2 65.9 6.9 2ND 3 2018
Texas Tech B12 38 31 115.2 85.2 0.9696 53.5 43.0 17.7 22.8 27.4 28.7 32.9 36.6 52.8 41.9 36.5 29.7 67.5 7.0 2ND 3 2019
Gonzaga WCC 39 37 117.8 86.3 0.9728 56.6 41.1 16.2 17.1 30.0 26.2 39.0 26.9 56.3 40.0 38.2 29.0 71.5 7.7 2ND 1 2017
Kentucky SEC 40 29 117.2 96.2 0.9062 49.9 46.0 18.1 16.1 42.0 29.7 51.8 36.8 50.0 44.9 33.2 32.2 65.9 3.9 2ND 8 2014
length(unique(df$TEAM)); length(unique(df$CONF))
## [1] 355
## [1] 35

There are 355 teams (schools) spanning 35 conferences.

NCAA Post-Season

#I want to focus on only the teams that made it to the post season
df_post_season <- df %>%
  filter(!is.na(POSTSEASON))
#Number of records in the post season subset table
dim(df_post_season)
## [1] 476  24
#Which conferences had the most representation post season 2013-2019
df_post_season %>%
  count(CONF) %>%
  arrange(desc(n))  %>%
  kable(.)
CONF n
ACC 48
B10 46
B12 45
BE 40
SEC 34
P12 32
A10 25
Amer 19
MWC 16
WCC 12
MVC 10
BSth 8
OVC 8
SB 8
AE 7
BSky 7
BW 7
CAA 7
CUSA 7
Horz 7
Ivy 7
MAAC 7
MAC 7
MEAC 7
NEC 7
Pat 7
SC 7
SWAC 7
Slnd 7
Sum 7
WAC 7
ASun 6

The ACC is the most represented conference, followed by the Big 10 (B10).
CMU is in the MAC. This conference has one representative per season. Who are these MAC representatives in the March Madness?

df_post_season %>%
  filter(CONF == "MAC") %>%
  arrange(desc(YEAR)) %>%
  kable(.)
TEAM CONF G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB DRB FTR FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB POSTSEASON SEED YEAR
Buffalo MAC 35 32 113.6 95.3 0.8819 53.4 47.3 16.0 20.3 31.7 26.9 32.7 31.9 55.7 49.3 33.7 29.3 74.9 5.5 R32 6 2019
Buffalo MAC 35 27 114.5 102.7 0.7761 54.8 48.8 16.5 18.1 31.9 29.1 29.9 39.5 54.3 48.4 37.1 33.1 73.8 -1.3 R32 13 2018
Kent St. MAC 35 22 104.1 102.9 0.5322 48.2 50.1 17.9 19.0 38.1 29.8 32.5 34.3 48.5 48.5 31.7 35.3 69.5 -7.2 R64 14 2017
Buffalo MAC 34 19 105.1 102.4 0.5746 49.6 49.8 18.8 17.7 31.8 28.7 41.6 37.8 49.0 48.8 33.7 34.3 72.9 -5.8 R64 14 2016
Buffalo MAC 33 23 109.1 98.0 0.7742 48.7 48.5 16.7 19.7 34.5 30.4 42.6 34.3 47.6 48.0 34.1 33.0 68.9 -1.3 R64 12 2015
Western Michigan MAC 32 22 104.7 102.5 0.5605 52.0 47.6 20.9 18.5 31.0 32.0 51.4 34.7 53.7 47.8 32.6 31.5 67.0 -2.9 R64 14 2014
Akron MAC 32 25 103.4 94.4 0.7402 51.0 44.6 20.8 20.2 37.6 30.5 36.9 35.0 52.1 42.7 32.7 32.8 66.5 0.3 R64 12 2013

Buffalo is the most dominant team in the MAC, 4 March Madness appearances in 7 years. CMU need to improve their odds of making it to the MAC.
Unfortunately, the farthest the MAC teams have gone is the Round of 32.