In this Rmarkdown file, we will demonstrate how to read in datasets directly from Github and follow that with some explorations.
These are some packages you could consider using for your analysis based on what you plan to do.
tidyverse - collection of packages for diverse tasks which include data manipulation (dplyr), reading and writing data (readr), and creating static data visualizations (ggplots).
Plotly - for intereactive data visualization
stats - for statistical analysis
#Install the packages first before import.
suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
suppressPackageStartupMessages(library(plotly))
## Warning: package 'plotly' was built under R version 4.3.2
suppressPackageStartupMessages(library(stats))
suppressPackageStartupMessages(library(knitr))
## Warning: package 'knitr' was built under R version 4.3.2
df <- read_csv("https://raw.githubusercontent.com/cmich-stats-club/spring2024-dac/main/datasets/cbb.csv", show_col_types = FALSE)
str(df)
## spc_tbl_ [2,455 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ TEAM : chr [1:2455] "North Carolina" "Wisconsin" "Michigan" "Texas Tech" ...
## $ CONF : chr [1:2455] "ACC" "B10" "B10" "B12" ...
## $ G : num [1:2455] 40 40 40 38 39 40 38 39 38 39 ...
## $ W : num [1:2455] 33 36 33 31 37 29 30 35 35 33 ...
## $ ADJOE : num [1:2455] 123 129 114 115 118 ...
## $ ADJDE : num [1:2455] 94.9 93.6 90.4 85.2 86.3 96.2 93.7 90.6 89.9 91.5 ...
## $ BARTHAG : num [1:2455] 0.953 0.976 0.938 0.97 0.973 ...
## $ EFG_O : num [1:2455] 52.6 54.8 53.9 53.5 56.6 49.9 54.6 56.6 55.2 51.7 ...
## $ EFG_D : num [1:2455] 48.1 47.7 47.7 43 41.1 46 48 46.5 44.7 48.1 ...
## $ TOR : num [1:2455] 15.4 12.4 14 17.7 16.2 18.1 14.6 16.3 14.7 16.2 ...
## $ TORD : num [1:2455] 18.2 15.8 19.5 22.8 17.1 16.1 18.7 18.6 17.5 18.6 ...
## $ ORB : num [1:2455] 40.7 32.1 25.5 27.4 30 42 32.5 35.8 30.4 41.3 ...
## $ DRB : num [1:2455] 30 23.7 24.9 28.7 26.2 29.7 29.4 30.2 25.4 25 ...
## $ FTR : num [1:2455] 32.3 36.2 30.7 32.9 39 51.8 28.4 39.8 29.1 34.3 ...
## $ FTRD : num [1:2455] 30.4 22.4 30 36.6 26.9 36.8 22.7 23.9 26.3 31.6 ...
## $ 2P_O : num [1:2455] 53.9 54.8 54.7 52.8 56.3 50 53.4 55.9 52.5 51 ...
## $ 2P_D : num [1:2455] 44.6 44.7 46.8 41.9 40 44.9 47.6 46.3 45.7 46.3 ...
## $ 3P_O : num [1:2455] 32.7 36.5 35.2 36.5 38.2 33.2 37.9 38.7 39.5 35.5 ...
## $ 3P_D : num [1:2455] 36.2 37.5 33.2 29.7 29 32.2 32.6 31.4 28.9 33.9 ...
## $ ADJ_T : num [1:2455] 71.7 59.3 65.9 67.5 71.5 65.9 64.8 66.4 60.7 72.8 ...
## $ WAB : num [1:2455] 8.6 11.3 6.9 7 7.7 3.9 6.2 10.7 11.1 8.4 ...
## $ POSTSEASON: chr [1:2455] "2ND" "2ND" "2ND" "2ND" ...
## $ SEED : num [1:2455] 1 1 3 3 1 8 4 1 1 1 ...
## $ YEAR : num [1:2455] 2016 2015 2018 2019 2017 ...
## - attr(*, "spec")=
## .. cols(
## .. TEAM = col_character(),
## .. CONF = col_character(),
## .. G = col_double(),
## .. W = col_double(),
## .. ADJOE = col_double(),
## .. ADJDE = col_double(),
## .. BARTHAG = col_double(),
## .. EFG_O = col_double(),
## .. EFG_D = col_double(),
## .. TOR = col_double(),
## .. TORD = col_double(),
## .. ORB = col_double(),
## .. DRB = col_double(),
## .. FTR = col_double(),
## .. FTRD = col_double(),
## .. `2P_O` = col_double(),
## .. `2P_D` = col_double(),
## .. `3P_O` = col_double(),
## .. `3P_D` = col_double(),
## .. ADJ_T = col_double(),
## .. WAB = col_double(),
## .. POSTSEASON = col_character(),
## .. SEED = col_double(),
## .. YEAR = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
There are 2,455 records across 24 columns in the dataset.
summary(df)
## TEAM CONF G W
## Length:2455 Length:2455 Min. :15.00 Min. : 0.00
## Class :character Class :character 1st Qu.:30.00 1st Qu.:11.00
## Mode :character Mode :character Median :31.00 Median :16.00
## Mean :31.49 Mean :16.28
## 3rd Qu.:33.00 3rd Qu.:21.00
## Max. :40.00 Max. :38.00
##
## ADJOE ADJDE BARTHAG EFG_O
## Min. : 76.6 Min. : 84.0 Min. :0.0050 Min. :39.20
## 1st Qu.: 98.3 1st Qu.: 98.5 1st Qu.:0.2822 1st Qu.:47.75
## Median :103.0 Median :103.5 Median :0.4750 Median :49.70
## Mean :103.3 Mean :103.3 Mean :0.4940 Mean :49.81
## 3rd Qu.:108.0 3rd Qu.:107.9 3rd Qu.:0.7122 3rd Qu.:51.90
## Max. :129.1 Max. :124.0 Max. :0.9842 Max. :59.80
##
## EFG_D TOR TORD ORB DRB
## Min. :39.6 Min. :11.90 Min. :10.20 Min. :15.00 Min. :18.40
## 1st Qu.:48.0 1st Qu.:17.30 1st Qu.:17.20 1st Qu.:27.10 1st Qu.:27.90
## Median :50.0 Median :18.70 Median :18.60 Median :29.90 Median :30.00
## Mean :50.0 Mean :18.76 Mean :18.69 Mean :29.88 Mean :30.08
## 3rd Qu.:52.0 3rd Qu.:20.10 3rd Qu.:20.10 3rd Qu.:32.60 3rd Qu.:32.20
## Max. :59.5 Max. :27.10 Max. :28.50 Max. :43.60 Max. :40.40
##
## FTR FTRD 2P_O 2P_D 3P_O
## Min. :21.60 Min. :21.80 Min. :37.7 Min. :37.70 Min. :24.90
## 1st Qu.:32.40 1st Qu.:31.90 1st Qu.:46.5 1st Qu.:46.70 1st Qu.:32.50
## Median :35.80 Median :35.80 Median :48.7 Median :49.00 Median :34.40
## Mean :35.99 Mean :36.27 Mean :48.8 Mean :48.98 Mean :34.41
## 3rd Qu.:39.50 3rd Qu.:40.20 3rd Qu.:51.0 3rd Qu.:51.30 3rd Qu.:36.30
## Max. :58.60 Max. :60.70 Max. :62.6 Max. :61.20 Max. :44.10
##
## 3P_D ADJ_T WAB POSTSEASON
## Min. :27.1 Min. :57.20 Min. :-25.200 Length:2455
## 1st Qu.:33.0 1st Qu.:65.70 1st Qu.:-13.000 Class :character
## Median :34.6 Median :67.80 Median : -8.300 Mode :character
## Mean :34.6 Mean :67.81 Mean : -7.802
## 3rd Qu.:36.2 3rd Qu.:70.00 3rd Qu.: -3.150
## Max. :43.1 Max. :83.40 Max. : 13.100
##
## SEED YEAR
## Min. : 1.000 Min. :2013
## 1st Qu.: 5.000 1st Qu.:2014
## Median : 9.000 Median :2016
## Mean : 8.803 Mean :2016
## 3rd Qu.:13.000 3rd Qu.:2018
## Max. :16.000 Max. :2019
## NA's :1979
From the YEAR summaries, we see that this data is the seasons 2013 through 2019.
#Get the number of missing values for each variable
sapply(df, function(x) sum(is.na(x)))
## TEAM CONF G W ADJOE ADJDE BARTHAG
## 0 0 0 0 0 0 0
## EFG_O EFG_D TOR TORD ORB DRB FTR
## 0 0 0 0 0 0 0
## FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB
## 0 0 0 0 0 0 0
## POSTSEASON SEED YEAR
## 1979 1979 0
The only columns with missing records are POSTSEASON and SEED. Are these missing on purpose or it is a data anomaly?
#Print out first 6 records
df %>%
head(.) %>%
kable(.)
| TEAM | CONF | G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | TORD | ORB | DRB | FTR | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | POSTSEASON | SEED | YEAR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| North Carolina | ACC | 40 | 33 | 123.3 | 94.9 | 0.9531 | 52.6 | 48.1 | 15.4 | 18.2 | 40.7 | 30.0 | 32.3 | 30.4 | 53.9 | 44.6 | 32.7 | 36.2 | 71.7 | 8.6 | 2ND | 1 | 2016 |
| Wisconsin | B10 | 40 | 36 | 129.1 | 93.6 | 0.9758 | 54.8 | 47.7 | 12.4 | 15.8 | 32.1 | 23.7 | 36.2 | 22.4 | 54.8 | 44.7 | 36.5 | 37.5 | 59.3 | 11.3 | 2ND | 1 | 2015 |
| Michigan | B10 | 40 | 33 | 114.4 | 90.4 | 0.9375 | 53.9 | 47.7 | 14.0 | 19.5 | 25.5 | 24.9 | 30.7 | 30.0 | 54.7 | 46.8 | 35.2 | 33.2 | 65.9 | 6.9 | 2ND | 3 | 2018 |
| Texas Tech | B12 | 38 | 31 | 115.2 | 85.2 | 0.9696 | 53.5 | 43.0 | 17.7 | 22.8 | 27.4 | 28.7 | 32.9 | 36.6 | 52.8 | 41.9 | 36.5 | 29.7 | 67.5 | 7.0 | 2ND | 3 | 2019 |
| Gonzaga | WCC | 39 | 37 | 117.8 | 86.3 | 0.9728 | 56.6 | 41.1 | 16.2 | 17.1 | 30.0 | 26.2 | 39.0 | 26.9 | 56.3 | 40.0 | 38.2 | 29.0 | 71.5 | 7.7 | 2ND | 1 | 2017 |
| Kentucky | SEC | 40 | 29 | 117.2 | 96.2 | 0.9062 | 49.9 | 46.0 | 18.1 | 16.1 | 42.0 | 29.7 | 51.8 | 36.8 | 50.0 | 44.9 | 33.2 | 32.2 | 65.9 | 3.9 | 2ND | 8 | 2014 |
length(unique(df$TEAM)); length(unique(df$CONF))
## [1] 355
## [1] 35
There are 355 teams (schools) spanning 35 conferences.
#I want to focus on only the teams that made it to the post season
df_post_season <- df %>%
filter(!is.na(POSTSEASON))
#Number of records in the post season subset table
dim(df_post_season)
## [1] 476 24
#Which conferences had the most representation post season 2013-2019
df_post_season %>%
count(CONF) %>%
arrange(desc(n)) %>%
kable(.)
| CONF | n |
|---|---|
| ACC | 48 |
| B10 | 46 |
| B12 | 45 |
| BE | 40 |
| SEC | 34 |
| P12 | 32 |
| A10 | 25 |
| Amer | 19 |
| MWC | 16 |
| WCC | 12 |
| MVC | 10 |
| BSth | 8 |
| OVC | 8 |
| SB | 8 |
| AE | 7 |
| BSky | 7 |
| BW | 7 |
| CAA | 7 |
| CUSA | 7 |
| Horz | 7 |
| Ivy | 7 |
| MAAC | 7 |
| MAC | 7 |
| MEAC | 7 |
| NEC | 7 |
| Pat | 7 |
| SC | 7 |
| SWAC | 7 |
| Slnd | 7 |
| Sum | 7 |
| WAC | 7 |
| ASun | 6 |
The ACC is the most represented conference, followed by the Big 10
(B10).
CMU is in the MAC. This conference has one representative per season.
Who are these MAC representatives in the March Madness?
df_post_season %>%
filter(CONF == "MAC") %>%
arrange(desc(YEAR)) %>%
kable(.)
| TEAM | CONF | G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | TORD | ORB | DRB | FTR | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | POSTSEASON | SEED | YEAR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Buffalo | MAC | 35 | 32 | 113.6 | 95.3 | 0.8819 | 53.4 | 47.3 | 16.0 | 20.3 | 31.7 | 26.9 | 32.7 | 31.9 | 55.7 | 49.3 | 33.7 | 29.3 | 74.9 | 5.5 | R32 | 6 | 2019 |
| Buffalo | MAC | 35 | 27 | 114.5 | 102.7 | 0.7761 | 54.8 | 48.8 | 16.5 | 18.1 | 31.9 | 29.1 | 29.9 | 39.5 | 54.3 | 48.4 | 37.1 | 33.1 | 73.8 | -1.3 | R32 | 13 | 2018 |
| Kent St. | MAC | 35 | 22 | 104.1 | 102.9 | 0.5322 | 48.2 | 50.1 | 17.9 | 19.0 | 38.1 | 29.8 | 32.5 | 34.3 | 48.5 | 48.5 | 31.7 | 35.3 | 69.5 | -7.2 | R64 | 14 | 2017 |
| Buffalo | MAC | 34 | 19 | 105.1 | 102.4 | 0.5746 | 49.6 | 49.8 | 18.8 | 17.7 | 31.8 | 28.7 | 41.6 | 37.8 | 49.0 | 48.8 | 33.7 | 34.3 | 72.9 | -5.8 | R64 | 14 | 2016 |
| Buffalo | MAC | 33 | 23 | 109.1 | 98.0 | 0.7742 | 48.7 | 48.5 | 16.7 | 19.7 | 34.5 | 30.4 | 42.6 | 34.3 | 47.6 | 48.0 | 34.1 | 33.0 | 68.9 | -1.3 | R64 | 12 | 2015 |
| Western Michigan | MAC | 32 | 22 | 104.7 | 102.5 | 0.5605 | 52.0 | 47.6 | 20.9 | 18.5 | 31.0 | 32.0 | 51.4 | 34.7 | 53.7 | 47.8 | 32.6 | 31.5 | 67.0 | -2.9 | R64 | 14 | 2014 |
| Akron | MAC | 32 | 25 | 103.4 | 94.4 | 0.7402 | 51.0 | 44.6 | 20.8 | 20.2 | 37.6 | 30.5 | 36.9 | 35.0 | 52.1 | 42.7 | 32.7 | 32.8 | 66.5 | 0.3 | R64 | 12 | 2013 |
Buffalo is the most dominant team in the MAC, 4 March Madness
appearances in 7 years. CMU need to improve their odds of making it to
the MAC.
Unfortunately, the farthest the MAC teams have gone is the Round of
32.